

Using just the request-promise module and Cheerio.js should allow you to scrape the vast majority of sites on the internet.

Voilà! A list of the names and birthdays of all 45 U.S. We’ll then apply it to the list of wikiUrls we gathered earlier. Now let’s return to our original file potusScraper.js and require the potusParse.js module. Perfect! Now let’s wrap this up into a function and export it from this module.
#Cheerio js code#
Let’s modify our code to use Cheerio.js to extract these two classes. So we see that the name is in a class called “firstHeading” and the birthday is in a class called “bday”. Let’s once again use Chrome DevTools to find the syntax of the code we want to parse, so that we can extract the name and birthday with Cheerio.js.

First things first, let’s get the raw HTML from George Washington’s Wikipedia page. Let’s create a new file (named potusParse.js), which will contain a function to take a presidential Wikipedia page and return the president’s name and birthday. Now we have a list of all 45 presidential Wikipedia pages. Now, we can go through and grab a list of links to all 45 presidential Wikipedia pages by getting them from the “attribs” section of each element. presidents), meaning there aren’t any extra hidden “big” tags elsewhere on the page. We check to make sure there are exactly 45 elements returned (the number of U.S. Let’s use Cheerio.js to parse the HTML we received earlier to return a list of links to the individual Wikipedia pages of U.S. Parsing HTML with Cheerio.jsĪwesome, Chrome DevTools is now showing us the exact pattern we should be looking for in the code (a “big” tag with a hyperlink inside of it). Now, simply click inspect, and Chrome will bring up its DevTools pane, allowing you to easily inspect the page’s source HTML. Using Chrome DevTools is easy: simply open Google Chrome, and right click on the element you would like to scrape (in this case I am right clicking on George Washington, because we want to get links to all of the individual presidents’ Wikipedia pages): To do that, we’ll need to use Chrome DevTools to allow us to easily search through the HTML of a web page. Next, let’s open a new text file (name the file potusScraper.js), and write a quick function to get the HTML of the Wikipedia “List of Presidents” page.Ĭool, we got the raw HTML from the web page! But now we need to make sense of this giant blob of text.
#Cheerio js install#
presidents from Wikipedia and the titles of all the posts on the front page of Reddit.įirst things first: Let’s install the libraries we’ll be using in this guide (Puppeteer will take a while to install as it needs to download Chromium as well). We will be gathering a list of all the names and birthdays of U.S.
#Cheerio js pro#
Working through the examples in this guide, you will learn all the tips and tricks you need to become a pro at gathering any data you need with Node.js! This guide will walk you through the process with the popular Node.js request-promise module, CheerioJS, and Puppeteer.

Or perhaps you need flight times and hotel/AirBNB listings for a travel site. There are a lot of use cases for web scraping: you might want to collect prices from various e-commerce sites for a price comparison site. So what’s web scraping anyway? It involves automating away the laborious task of collecting information from websites.
