Node Crawler

Write a crawler using nodejs

5 Optimization

Published:
Category: { Node Crawler }
Summary: In this article, we will be optimizing the crawler to get better performance. Batch Jobs In the article about using MongoDB as data storage, we write the data to database whenever we get it. In practice, this is not efficient at all. Here comes the batch jobs. It would be much better if one write to database with batch jobs. If you recall, the code we used to write to database is // ...other code localdb.test.save(data, (err, res)=>{ // do something }) The function save takes in not only one entry of document but an array of documents: const array = [] for(let i = INI_ID ; i < MAX_ID; i++){ // fetch data from website const data = fetchData(i) array.
Pages: 5

4 Restrictions of Websites

Published:
Category: { Node Crawler }
Summary: Beware that scraping data off websites is neither always allowed nor as easy as a few lines of code. The preceding articles enable you to scrape many data, however, man websites have counter measures. In this article, we will be dealing with some of the common ones. Request Frequency Some websites have limitations on the frequency of API requests. The solution to this is simply a brief pause after each request. In Node.js, the function setInterval enables this. // ... require packages here // define the function fetch to get data const fetch = (aid) => superagent .get('https://api.bilibili.com/x/web-interface/archive/stat') .query({ aid:aid }) .
Pages: 5

3 Manage Data Using MongoDB

Published:
Category: { Node Crawler }
Summary: In most cases, databases makes the management of data quite convenient. In this article, we would scrape data using the code we discussed before but write data into MongoDB. For installation of MongoDB, please refer to the official documentation. The Code To write data to MongoDB using Node.js, we choose the package mongojs, which provides almost exactly the standard MongoDB syntax. To install mongojs, npm i mongojs --save Here is a module that can write data to MongoDB. We create a file named dao.js and copy/paste the following code into it. // use mongojs const mongojs = require('mongojs') // connect to the database 'simple_spider' in MongoDB and use collection 'test' const localdb = mongojs('simple_spider', ['test']) // a function that saves data to MongoDB const saveData = (data,cb) => { localdb.
Pages: 5

2 Basic Node Crawler

Published:
Category: { Node Crawler }
Summary: Prerequisites Nodejs >= 8.9 Overview A model for a crawler is as follows. A crawler requests data from the server, while the server responds with some data. Here is a graphic illustration +----------+ +-----------+ | | HTTP Request | | | +----------------> | | Nodejs | | Servers | | <----------------+ | | | HTTP Response | | +----------+ +-----------+ HTTP Requests For a good introduction of HTTP requests, please refer to this video on youtube: Explained HTTP, HTTPS, SSL/TLS API As for the first step, we need to find which url to request. Chrome and Firefox-based browsers come with the developer’s tool.
Pages: 5

1 Introduction to Node Crawler Series

Published:
Category: { Node Crawler }
Summary: Installing node.js and mongodb.
Pages: 5