Web scraping with Node.JS and Cheerio

Jake L.

Published Nov 26, 2018

This post is a guest contribution by Soham Kamani, a guest blogger for ButterCMS. The original post can be found here.

Almost all the information on the web exists in the form of HTML pages. The information in these pages is structured as paragraphs, headings, lists, or one of the many other HTML elements. These elements are organized in the browser as a hierarchical tree structure called the DOM (short for Document Object Model). Each element can have multiple child elements, which can also have their own children. This structure makes it convenient to extract specific information from the page.

The process of extracting this information is called "scraping" the web, and it’s useful for a variety of applications. All search engines, for example, use web scraping to index web pages for their search results. We can also use web scraping in our own applications when we want to automate repetitive information gathering tasks.

Cheerio is a Node.js library that helps developers interpret and analyze web pages using a jQuery-like syntax. In this post, I will explain how to use Cheerio to scrape the web. We will use the ButterCMS API documentation as an example and use Cheerio to extract all the API endpoint URLs from the web page.

Why Cheerio

There are many other web scraping libraries, and they run on most popular programming languages and platforms. What makes Cheerio unique, however, is its jQuery-based API.

jQuery is by far the most popular javascript library in use today. It's used in browser-based javascript applications to traverse and manipulate the DOM. For example, if your document has the following paragraph:

<p id="example">This is an <strong>example</strong> paragraph</p>

You could use jQuery to get the text of the paragraph:

const txt = $('#example').text()
console.log(txt)
// Output: "This is an example paragraph"

The above code uses a CSS selector #example to get the element with the id of "example". The text method of jQuery extracts just the text inside the element (the <strong> tags disappeared in the output).

The jQuery API is useful because it uses standard CSS selectors to search for elements, and has a readable API to extract information from them. jQuery is, however, usable only inside the browser, and thus cannot be used for web scraping. Cheerio solves this problem by providing jQuery's functionality within the Node.js runtime, so that it can be used in server-side applications as well. Now, we can use the same familiar CSS selection syntax and jQuery methods without depending on the browser.

The Cheerio API

Unlike jQuery, Cheerio doesn't have access to the browser’s DOM . Instead, we need to load the source code of the webpage we want to crawl. Cheerio allows us to load HTML code as a string, and returns an instance that we can use just like jQuery.

Let's look at how we can implement the previous example using cheerio:

// Import the Cheerio library
const cheerio = require('cheerio')

// Load the HTML code as a string, which returns a Cheerio instance
const $ = cheerio.load('<p id="example">This is an <strong>example</strong> paragraph</p>')

// We can use the same API as jQuery to get the desired result
const txt = $('#example').text()
console.log(txt)
// Output: "This is an example paragraph"

You can find more information on the Cheerio API in the official documentation.

Scraping the ButterCMS documentation page

The ButterCMS documentation page is filled with useful information on their APIs. For our application, we just want to extract the URLs of the API endpoints.

For example, the API to get a single page is documented below:

You can find more information on the Cheerio API in the official documentation.

Scraping the ButterCMS documentation page

The ButterCMS documentation page is filled with useful information on their APIs. For our application, we just want to extract the URLs of the API endpoints.

For example, the API to get a single page is documented below:

What we want is the URL:

https://api.buttercms.com/v2/pages/<page_type_slug>/<page_slug>/?auth_token=api_token_b60a008a

In order to use Cheerio to extract all the URLs documented on the page, we need to:

Download the source code of the webpage, and load it into a Cheerio instance
Use the Cheerio API to filter out the HTML elements containing the URLs

To get started, make sure you have Node.js installed on your system. Create an empty folder as your project directory:

mkdir cheerio-example

Next, go inside the directory and start a new node project:

To continue reading, click here.

To view or add a comment, sign in

Web scraping with Node.JS and Cheerio

Jake L.

Why Cheerio

The Cheerio API

Scraping the ButterCMS documentation page

Scraping the ButterCMS documentation page

More articles by Jake L.

Others also viewed

Demystifying JavaScript Data Types: A Comprehensive Guide

Practical Example: Updating the Google Ecommerce DataLayer in PyScript and Passing it to GTM

AI special: Tensorflow in Javascript

Clearing Up Flask & Websockets Confusion for HTMX Enthusiasts

Selenium vs Beautiful Soup for Web Scraping: Which One Should You Use?

A Marionette TODO List tutorial for smartUI developers

The Art of System Design: A Journey with Jinja, Flask, and More!

Let's Learn JSON Together

JavaScript Objects in Go and WebAssembly

Vue: Native client-side validation

Explore content categories

Why Cheerio

The Cheerio API

Scraping the ButterCMS documentation page

Scraping the ButterCMS documentation page

More articles by Jake L.

Front End Technologies in 2020

Supercharge Your Shopify Landing Pages

Flutter vs React Native - Choosing Your Approach

Integrating ButterCMS with Ionic

3 KPIs for Software Development Team Efficiency

Micro CMS - A lightweight content management system for building websites faster

Building Great User Experience with React Suspense

Software Engineer Job Satisfaction: A leadership Guide

Headless CMS: Learn the What, Why, and How

Rethinking project management for 2020

Others also viewed

Demystifying JavaScript Data Types: A Comprehensive Guide

Practical Example: Updating the Google Ecommerce DataLayer in PyScript and Passing it to GTM

AI special: Tensorflow in Javascript

Clearing Up Flask & Websockets Confusion for HTMX Enthusiasts

Selenium vs Beautiful Soup for Web Scraping: Which One Should You Use?

A Marionette TODO List tutorial for smartUI developers

The Art of System Design: A Journey with Jinja, Flask, and More!

Let's Learn JSON Together

JavaScript Objects in Go and WebAssembly

Vue: Native client-side validation

Explore content categories