Combination of Web Crawling and NLP as a Search Engine Optimization Solution

Combination of Web Crawling and NLP as a Search Engine Optimization Solution

An important part of SEO(search engine optimization) is optimizing metadata, title and H1 tags and increasing SERP in search engines for desired keywords. Crawling is one of the most common tasks in technical SEO. However, analyzing a crawl can take quite a long time, especially when working on high volume websites. Using good tools can help analysts go faster, and can bring more advanced analysis. In this data science project, I am going to fetch Metadata, title and H1(header) text from all Harvard.edu’s pages which are listed on its XML sitemap. After that I will apply NLP(Natural Language Processing) to interpret and understand text and common keywords.

I used BeautifulSoup to retrieve links from https://www.harvard.edu/sitemap.xml which show all sitemap urls and other information.Add alt text

No alt text provided for this image


As you can see in the code above, at first I print the first 5 URLs' list from the webpage and in the second cell, the number of URLs is displayed.

H1, Title, Meta Data Description: most important SEO tags

I am targeting three more popular tags (title, metadata and H1) which are more important for SEO. According to SearchMetrics, 80% of first-page search results on Google use an H1 header. A meta description (sometimes called a meta description attribute or tag) is an HTML element that describes and summarizes the contents of your page for the benefit of users and search engines. In addition, the title tag shows the main message of the web page. only the H1 tag is visible for visitors and others are not visible. This picture below shows an example of how these tags are placed in one of 922 pages.

No alt text provided for this image


SEO tags data frame

After cleaning the list of 922 URLs, I crawl all the pages by looping through title, metadata and H1 tags in all ULRs-list. By fetching all texts inside all those tags this data frame is built.

No alt text provided for this image


As you can see from the above excerpt, this data frame has 922 rows. This total number of rows is equal to the size of URL's list that was retrieved in the last section.

Applying NLP to filter the most common keywords on the entire website

After preparing the data frame, we need to apply NLP libraries to analyze SEO and keywords of pages and websites. I apply NLP for the columns of the SEO-Tags data frame in this step by using different functions of NLTK library.

  • Remove HTML
  • Tokenization + Remove punctuation + Lowercase letters
  • Remove stop words + Remove non alphabet characters
  • Detokenize
  • Lemmatization and Stemming

Frequent keyword data visualization

After refining strings for each tag, we want to plot the top 10 frequent words that have been used within each title, Heading and meta data tags in the entire website. In the last step of NLP we did stemming for the words, so some words that displayed below are the rrot of the words like univers which stands for university and presid stands for president.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Keywords Analysis:

As it is shown above we see more Frequent words related to branding and mission of Harvard.edu including academic, university, Harvard, president and history have used within those tags. There are some words have been used equally for meta data which shows that there is some meta description content duplicated multiple times on different URLs.


Conclusion:

This research has been conducted for one of the preeminent universities in the world. Most universities are non-profit organizations This solution can be applied to for-profit organizations’ websites, which rely on their return of investment for desired keywords. The goal of SEO is to rank more desired keywords higher, relative to particular business interests.The goal of SEO is to rank more desired keywords higher, relative to particular business interests.

To view or add a comment, sign in

More articles by Hamid Zade(Gholizadegan), MBA

Others also viewed

Explore content categories