Custom Extraction in Screaming Frog: XPath and CSSPath
Screaming Frog is one of the most versatile and useful SEO tools currently available. Seer Interactive created an awesome guide for ‘doing almost anything’, and you should definitely check it out to get a grasp on some of the basic and advanced features in Screaming Frog.
This guide was published prior to Screaming Frog’s update, and I wanted to focus on one of the more advanced features that I think is vastly underutilized by the SEO community: custom extraction.
Screaming Frog announced custom extraction with version 4.0 on July 7th, 2015. Extraction is similar to their custom search feature, but you can actually extract pretty much anything out of the HTML source that you want – not just search for it. With this release, you have 10 fields that you can customize to extract information from HTML pages that return a 200 status code using XPath, CSSPath, or regex.
Before we get into how to use these functions, I will try to briefly explain what they are at a basic level (ELI5) since I am not a developer and most SEOs are in the same boat. This topic gets very technical quickly, so I will try to break it down to a level where anyone can understand and use these functions to make custom data extraction easy.
XPath:
XPath is short for XML Path Language, which is a query language that describes a way to find and process items in XML documents. XPath can also be used for HTML because it has a similar hierarchical structure and it is a short, quick, and easy way to find an element on a web page.
CSSPath:
CSSPath is a lot like XPath, but you can use CSS selectors to pull elements. The difference between which one is ‘better’ is a hotly debated topic that gets technical, very quickly. They both have a relatively similar syntax, but using the CSS selectors has been described as an ‘easier’ query and many claim that it is faster than XPath, which doesn’t really matter for our purposes. The CSSPath also allows you the option to pull attributes.
Regex:
Regex is short for regular expressions and should generally be used as a last resort (which is why we won't get into it here) to pull data that XPath and CSSPath cannot, like inline JavaScript and HTML comments. Regex is essentially a syntax that was developed to spot various patterns and can be used in multiple programming languages. For SEO purposes, it is generally used the least as XPath and CSSPath will be able to get you most of what you need. It is worth noting that regex is used in the exclude option during a crawl and I use it constantly for things like excluding pages with parameters.
So How Do We Use Them?
In order to get to the custom extraction option, click on configuration, mouse over custom, and click on Extraction:
This will bring up a box that looks like this:
These 10 fields allow you to insert XPath, CSSPath, or regex to search for and extract custom elements. You can type a name for the search in the white boxes and it will use that as the column title in Screaming Frog and the export in Excel. For example, OG Title, Google Analytics ID, H3[1], H3[2], H4[1], or whatever else you are searching for.
I will go over the basic syntax and provide some examples for each below, but I want to point out that Screaming Frog makes this easy by including a syntax validation indicator right next to the input field. A red X means that the syntax is invalid and a green check mark means that it is correct:
When you select a method of extraction, a field for the text as well as an option box for XPath and CSSPath appears and you can select:
- Extract HTML Element: The selected element and its inner HTML content.
- Extract Inner HTML: The inner HTML content of the selected element. If the selected element contains other HTML elements, they will be included.
- Extract Text: The text content of the selected element and the text content of any sub elements.
I did a quick crawl of Moz's site to illustrate how these will look:
Take a look at the blue highlighted examples. You can see that each selected field extracts different parts of the HTML. You can use whatever option serves your extraction purpose, but I find myself extracting the text more often than anything else.
Syntax and Common Extractions
XPath:
In order to understand XPath we need to understand how nodes work. Essentially, nodes are the relationship of elements in a family type structure. Let's take a look at the commonly used <div>:
<div title="This is a Section">
<td name="Pie">
<tr class="Pecan">Pecan Pie</tr>
<tr class="Blueberry">Blueberry Pie</tr>
<tr class="Apple">Apple Pie</tr>
</td>
</div>
As you can see, the section has an opening <div> at the top and a closing </div> at the bottom. This means that the <td>and <tr> elements are contained within the div and these elements are considered descendants of the div. The <td> is a child, and the <tr> is a grandchild. The three pie <tr> elements are considered siblings. This is important to understand as XPath uses these relationships to find the element you tell it to with your custom extraction search. Still with me?
In this example, it turns out we need to find the Apple Pie. Any of these XPath expressions should do the trick:
//div[@title="This is a Section"]//tr[text()="Apple Pie"]
//tr[@class='Apple']
/div/td/tr[3]
You can find more syntax detail here, but here is a quick and dirty reference:
// search all descendant elements
/ search all child elements
[] The predicate (specifies something about the element you are looking for)
@ Specifies an element attribute. (like @title)
text() Gets the text of the element.
. specifies the current node
.. specifies the parent node
and Boolean and
or Boolean or
= equals
* wildcard that selects all elements
@* wildcard that matches any attribute node
node() matches any node of any kind
() Groups operations to establish precedence
As you can see, you can do a lot with XPath and it can get very complex. So how does this help with SEO and what are some of the more common uses? Here are some examples:
Screaming Frog already pulls H1s and H2s, but what if you want H3s or H4s? We can use the // to search all decedent elements for the H tags:
//h3[1] - To pull the first H3
//h3[2] - To pull the second H3
//h4 - To pull the H4
//h5 - to pull the H5
If you wanted to extract any schema elements, you would use something like:
(//*[@itemtype])[1]/@itemtype
(//*[@itemtype])[2]/@itemtype
(//*[@itemtype])[3]/@itemtype
If you wanted to extract the text social media tags like Open Graph or Twitter Cards, you would use:
//meta[starts-with(@property, 'og:title')][1]/@content
//meta[starts-with(@property, 'og:description')][1]/@content
//meta[starts-with(@property, 'og:type')][1]/@content
//meta[starts-with(@property, 'og:site_name')][1]/@content
//meta[starts-with(@property, 'og:locale')][1]/@content
//meta[starts-with(@property, 'og:image')][1]/@content
If you wanted to pull email addresses from a site to utilize for outreach, you could use:
//a[starts-with(@href, 'mailto')][1]
//a[starts-with(@href, 'mailto')][2]
You can come up with any number of combinations to search for and the more you use XPath, the easier it will be to grasp how the syntax works.
CSSPath:
CSSPath uses the same type of relational understanding as XPath, but utilizes the CSS to search for the information. I think the easiest way for us non-developers to understand CSSPath is to contrast it with XPath:
Remember our pie example above?
XPath: //div[@title="This is a Section"]//tr[text()="Apple Pie"]
XPath: //tr[@class='Apple']
Xpath: /div/td/tr[3]
CSSPath: div[title="This is a Section"] tr:contains(^Apple Pie$)
CSSPath: tr.Apple
CSSPath: div > td > tr:nth-of-type(3)
As for the H tags:
Xpath: //h3[1] - To pull the first H3
Xpath: //h3[2] - To pull the second H3
Xpath: //h4 - To pull the H4
Xpath: //h5 - to pull the H5
CSSPath: h3:nth-of-type(1)
CSSPath: h3:nth-of-type(2)
CSSPath: h4
CSSPath: h5
You can find more CSS Selector Details here, this is particularly useful as it compares it against XPath. Ultimately each method has a different syntax that you can learn. Personally, I am more comfortable with XPath and use it far more than CSSPath.
The Easy Way
Now that we went through the hard way, there is also a really easy shortcut you can use to generate these XPath expressions and CSSPath Selectors using Google Chrome.
For this example, we will use our friends at Search Engine Land. Let's say we want a list of all of the contributing authors but don't know how to write the XPath expression for that exact website structure. If we go to the site and click on a random article, right click on the author's name, select 'inspect element'
This will bring up the developer console and you can right click the author line, click on copy XPath or copy CSSPath, and paste it into Screaming Frog (which validates it for us):
Xpath: /html/body/div[2]/div[2]/div/div[3]/article/div/div[4]/a
CSSPath: html > body > div:nth-of-type(2) > div:nth-of-type(2) > div > div:nth-of-type(3) > article > div > div:nth-of-type(4) > a
When you run a crawl of the site, it will extract the names of the authors for each article:
Wrapping Up
While Screaming Frog's feature set was already extremely useful, adding the ability to extract 10 custom elements makes it a significantly more versatile tool.
XPath and CSSPath can be confusing when you first start messing around with them, but I would recommend that all SEOs take the time to understand and create their own cheat sheet of XPath/CSSPath expressions to help pull the data that you were previously unable to do with Screaming Frog.
Oh hey I have used this guide with great results - I did not realize you were an Austinite
I linked to this article just now in my own LinkedIn post (thanks, Luke Carthy from MozCon): https://www.garudax.id/pulse/so-you-missed-mozcon-2019-heres-recap-entire-event-david-v-kimball/
Thanks Brian! This is great!
Hey Brian, I just wanted to thank you for making the XPath method really simple. You've totally saved my bacon!