How to Use Natural Language Processing, Classification, and Entity Recognition to Understand Your Content Gaps

Marat Gaziev, MBA

Published Feb 27, 2020

In SEO, we often hear that content is king, however, what we often fail to realize is that it’s not just about content but the context in which that content is delivered, in other words, right content, at the right time, to the right user. There is no shortage of ways and methods to help you understand “context”, or whether or not you’re delivering the “right” content to your users, however, today, i’ll describe one particular method you can use to better understand if you’re delivering the right content.

Content Gap Analysis is a great way to understand what you’re content is missing, and whether or not there are any glaring “gaps” in your content which your users/readers might find relevant or otherwise pertinent. But how do we go about understanding our content gaps? Especially, if like me, you work with publishers that cover a wide variety of topics. You can’t be a subject matter expert on every topic. Luckily you don’t have to be. We can use Google as a proxy! In other words, assuming that Google returns only the most “relevant” and “comprehensive” resources in their search results (which I really hope is the case), we can compare our content to that of those ranking above us to better understand where our content gaps may be.

But, if like me, you work with publishers or websites that publish a lot of content, this may seem like an impossible task. Luckily, with just a little bit of Python and some nifty libraries and APIs, we can easily scale and automate this process. Below, I’ll share with you a process that will allow you to apply Natural Language Processing techniques to parse, analyze and extract entity data from content so that we can easily discover content gaps.

But before we get started you’ll need to familiarize yourself with and install the following libraries and APIs which will do most of the heavy lifting for us:

TextRazor — Entity Extraction, Disambiguation and Linking. Keyphrase Extraction. Automatic Topic Tagging and Classification.
Pipulate — Free and Open Source SEO Software. Automate Google Sheets without plug-ins.

To get started, let’s imagine we’re HighGroundGaming.com and we’d like to understand if our “7 Best Standing Desks of 2020” article has any glaring content gaps. As of writing this, HGG is ranking #15 on Google for “best stand up desks”. First, we’ll head over to Google and copy the top three URLs ranking for “best stand up desks”, or you can copy all URLs ranking in the top 10 positions, up to you. Also note, that this too, can be automated fairly easily with Python, but that’s another post for another time. Once you’ve !pip installed both TexRazor and Pipulate, and gathered all the URLs you’d like to analyze, follow the steps below.

Import the necessary libraries, set your textrazor API key, and authorize Pipulate.

2. Input the URLs you wish to analyze, including your own.

3. We are doing a few things in this step and depending on the number of urls you’re passing in, this may take a few seconds to run. Firstly, we are telling TextRazor to process the text contained in each url we’ve passed in. Next, we’re sorting the response to show the most relevant entities first. And finally, we are creating a response dataframe.

4. And finally we are outputting/populating a designated Google Sheet with all the columns contained in df.columns which includes the list of URLs, Entities, and their respective relevance and confidence scores. You can check the TextRazor API documentation for definitions of what the relevance and confidence scores are and how to use them.

Once you output the data into a Google Sheet or a CSV for that matter, it’s ready to be manipulated, filtered and sorted in ways that allow you to better understand your content gaps. Note that the post-processing analysis can also be automated, to a degree, however, someone will need to review the gap data in order to figure out the most relevant pieces, some amount of noise is inevitable. Just to give you an example of what the content gap analysis might look like, let’s take a look at our “best standing desk” data.

What I’ve done here is counted up the number of times a particular “entity” appears across the sample set of competitive URLs, filtered the list to only show those entities that appeared at least twice, sorted the list by relevance, and filtered out any duplicate or overlapping entities.

Look for the entities, or content gaps that have a relatively high degree of relevance and confidence. Using our example we can see that HGG doesn’t mention “treadmill desk”, and although that concept has a fairly low relevance score, it might serve their audience better if HGG were to incorporate and review “treadmill desks” in addition to stand up desks.

Mark Barrera 6y

Great stuff here! I've been looking at doing something similar but with Google Cloud NLP and Google Apps Scripts to pull into Sheets. Any thoughts on Google Cloud vs TextRazor? Haven't heard of that one before. Also, if you haven't checked out frase.io, they're doing some cool things with NLP and content gaps worth that's worth looking at.