Write a crawler using nodejs
In this article, we will be optimizing the crawler to get better performance. Batch Jobs In the article about using MongoDB as data storage, we write the data to database whenever we get it. In practice, this is not efficient at all. Here comes the batch jobs. It would be much better if one write to database with batch jobs. If you recall, the code we used to write to database is
Beware that scraping data off websites is neither always allowed nor as easy as a few lines of code. The preceding articles enable you to scrape many data, however, man websites have counter measures. In this article, we will be dealing with some of the common ones. Request Frequency Some websites have limitations on the frequency of API requests. The solution to this is simply a brief pause after each request.
In most cases, databases makes the management of data quite convenient. In this article, we would scrape data using the code we discussed before but write data into MongoDB. For installation of MongoDB, please refer to the official documentation. The Code To write data to MongoDB using Node.js, we choose the package mongojs, which provides almost exactly the standard MongoDB syntax. To install mongojs, npm i mongojs --save Here is a module that can write data to MongoDB.
Prerequisites Nodejs >= 8.9 Overview A model for a crawler is as follows. A crawler requests data from the server, while the server responds with some data. Here is a graphic illustration +----------+ +-----------+ | | HTTP Request | | | +----------------> | | Nodejs | | Servers | | <----------------+ | | | HTTP Response | | +----------+ +-----------+ HTTP Requests For a good introduction of HTTP requests, please refer to this video on youtube: Explained HTTP, HTTPS, SSL/TLS API As for the first step, we need to find which url to request.
Installing node.js and mongodb.