How to Use Python With Jupyter Notebook to Run Correlation Studies

Marat Gaziev, MBA

Published Dec 3, 2017

this post originally appeared on ThinkGrow.co

As a longtime practitioner or an active observer of SEO you may have noticed that the term has become somewhat synonymous with correlation studies, and rightly so. In an industry shrouded by enigmatic ranking algorithms and Google’s apparent lack of transparency, correlation studies offer us an opportunity to slightly pull back this shroud of mystery. But as we all know, correlation studies come with a big, fat forewarning, just because two or more things are correlated does not mean one causes the other, simply put, correlation does not infer the presence of a causal relationship. Regardless, correlations can tell us a lot about our dataset, and can often indicate a predictive relationship in various contexts, SEO or otherwise. But this isn’t a study in causality, rather I wanted to demonstrate the power afforded to us by Python and its amazing ecosystem of libraries to easily run your own correlations. Below, I outline a step-by-step guide to using Python with Jupyter Notebook to find correlations within a dataset.

Exploring Dataset With Pandas

Pandas is an amazing and powerful data analysis library which allows us to quickly and easily manipulate data in Python. But before we do anything we must first decide which dataset to explore. For the purposes of this example, I decided to use the following dataset provided by AJ Ghergich as part of his excellent study on Featured Snippets.

Before we dive in, there are a few prerequisite steps you need to follow in order to setup Jupyter Notebook:

Download Anaconda. I strongly recommend you download the latest Python 3 distribution of Anaconda.
Follow the instructions on the download page to install Anaconda. Now go ahead and launch the Navigator
You can launch Jupyter Notebook from the Navigator, or you can run the following command in your Terminal window or Command Prompt > jupyter notebook
Click on the the “New” dropdown and launch your Python 3 notebook.

You should give your new notebook a title so you can easily find it later but that’s pretty much it. You are now ready to write some Python code.

Reading In Your Data

The steps below will show you how to read in your data into a Pandas dataframe. You can either use the read_csv or the read_excel methods, depending on how you saved your data. Keep in mind that read_excel method reads Excel 2003 files and Excel 2007 or above.

As the very first step you’ll need to import the pandas library. Execute the code by hitting the “run cell” button at the top

Now you’re ready to read in your file into a DataFrame. Make sure to give the variable a relevant name, and specify your local filepath or buffer

You can print the column names contained in your datafile by typing in the following print command

you’ll get the following output, rather, here is what I get when I read in the file from the Featured Snippet study

Index(['URL', 'Domain', 'TLD', 'Scheme', 'Content Length', 'HTML Length',
       'Text Length', 'Text to HTML Ratio', 'Title', 'Title Length',
       'Description', 'Description Length', 'Word Count', 'Sentence Count',
       'Header Count', 'Paragraph Count', 'Reading Time', 'Sentiment',
       'Sentiment Score', 'Dale-Chall Score', 'Flesch Kincaid Grade Level',
       'Flesch Kincaid Reading Ease Score', 'Flesch Kincaid Reading Ease',
       'Gunning Fog Score', 'Smog Index', 'Images', 'Images with Alt',
       'Images without Alt', 'Videos', 'External Link Count',
       'Internal Link Count', 'Total Link Count',
       'Domain Mozscape Domain Authority', 'Domain Mozscape Page Authority',
       'Domain Mozscape External Equity Links', 'Domain Mozscape MozRank',
       'Domain Mozscape MozTrust', 'Homepage LinkedIn Shares',
       'Homepage Pinterest Pins', 'Homepage Total Shares',
       'URL Mozscape Domain Authority', 'URL Mozscape Page Authority',
       'URL Mozscape External Equity Links', 'URL Mozscape MozRank',
       'URL Mozscape MozTrust', 'URL LinkedIn Shares', 'URL Pinterest Pins',
       'URL Total Shares', 'SEMRush Rank', 'SEMRush Organic Keywords',
       'SEMRush Organic Traffic', 'SEMRush Organic Cost',
       'SEMRush Adwords Keywords', 'SEMRush Adwords Traffic',
       'SEMRush Adwords Cost', 'Mobile Friendly', 'Mobile Friendly Score',
       'Uses Incompatible Plugins', 'Content Wider Than Screen',
       'Links Too Close Together', 'Text Too Small To Read',
       'Mobile Viewport Not Set', 'Mobile Friendly Url', 'PageSpeed Device',
       'Mobile Speed Score', 'Usability Score', 'Speed Score',
       'PageSpeed Mobile Url', 'PageSpeed Device.1', 'Desktop Speed Score',
       'Tables', 'Ordered List', 'Unordered List',
       'Lists - Either OR (OL OR UL Usage)'],
      dtype='object')

The above shows us all of the column names contained in our datafile. You can also print the shape of the data if you’d like to see how many columns and rows are contained within your datafile

Visualizing Your Data

Next, we may want to start visualizing our dataset, or at least the inputs we’re interested in. To do this, we are going to use Matplotlib, the foremost Python plotting library. To get started, we are going to import the library using the following function

import matplotlib.pyplot as plt

Import Matplotlib using the import function and run the code.

Next we are going to plot and show a histogram. I’m interested in visualizing the distribution of Domain Authority within this particular data set, so that’s the column name I’m going to use as an input

After plotting our data, here is what the distribution of Domain Authority looks like. We can see that our dataset skews towards the higher end of the spectrum

Finding Correlations

Pandas makes finding correlations extremely easy. We can use the corr method to compute pairwise correlation of columns using either the Pearson, Kendall, or Spearman methods. At this point you should have a well formulated hypothesis or at least an idea of what you’d like to prove or disprove with the help of correlations. For example, looking at the data gathered by AJ Ghergich for their study on Featured Snippets, I’m interested in seeing how Social Shares are correlated with other data, particularly Word Count and the Dale-Chall Score. My hypothesis is that longer articles that have a low Dale-Chall Score, meaning they are easier to read, are shared more often through social media than those that are shorter and harder to read. So I’d expect a positive correlation between Word Count and Social Shares, and a strong negative correlation between Dale-Chall Score and Social Shares. Lets take a look.

mydata.corr('pearson')["URL Total Shares"]
    
    Content Length                           0.025096
    HTML Length                              0.027943
    Text Length                              0.006617
    Text to HTML Ratio                      -0.009764
    Title Length                            -0.008611
    Description Length                       0.003576
    Word Count                               0.012393
    Sentence Count                           0.019827
    Header Count                             0.033657
    Paragraph Count                          0.019222
    Sentiment Score                          0.025975
    Dale-Chall Score                        -0.016972
    Flesch Kincaid Grade Level              -0.015324
    Flesch Kincaid Reading Ease Score        0.017746
    Gunning Fog Score                       -0.031502
    Smog Index                              -0.017827
    Images                                   0.046815
    Images with Alt                          0.027644
    Images without Alt                       0.042049
    Videos                                   0.015922
    External Link Count                      0.008700
    Internal Link Count                     -0.005896
    Total Link Count                        -0.004012
    Domain Mozscape Domain Authority        -0.015110
    Domain Mozscape Page Authority          -0.005406
    Domain Mozscape External Equity Links   -0.005328
    Domain Mozscape MozRank                  0.000044
    Domain Mozscape MozTrust                -0.011382
    Homepage LinkedIn Shares                -0.005707
    Homepage Pinterest Pins                  0.008726
    Homepage Total Shares                    0.008543
    URL Mozscape Domain Authority           -0.014716
    URL Mozscape Page Authority              0.021478
    URL Mozscape External Equity Links       0.002473
    URL Mozscape MozRank                     0.025344
    URL Mozscape MozTrust                    0.020126
    URL LinkedIn Shares                      0.014667
    URL Pinterest Pins                       0.999969
    URL Total Shares                         1.000000
    SEMRush Rank                            -0.006452
    SEMRush Organic Keywords                -0.017145
    SEMRush Organic Traffic                 -0.016675
    SEMRush Organic Cost                    -0.017952
    SEMRush Adwords Keywords                -0.005523
    SEMRush Adwords Traffic                 -0.006049
    SEMRush Adwords Cost                    -0.006742
    Mobile Friendly Score                    0.016018
    Usability Score                          0.017139
    Speed Score                             -0.030370
    Tables                                  -0.006556
    Ordered List                            -0.000192
    Unordered List                           0.007822
    Name: URL Total Shares, dtype: float64

Although the data is not as strongly correlated as we would have liked it to be, we do, in fact, see a positive correlation between Word Count and Social Shares, and a negative correlation between the Dale-Shall Score and Social Shares. Surprisingly, Pinterest Pins appear to have the strongest correlation with Social Shares, this can be interpreted in two ways, either the data strongly skews towards Pinterest, therefore it’s overrepresented, or the URLs chosen for this study deal in subject matter that is visual in nature and contain a lot of images. So there you have it, with just a few lines of code you'll be running your own correlation studies in no time. Happy analyzing!

Jabari Smith Fraser 7y

Great walkthrough! I really appreciate the focus on more data informed SEO

1 Reaction

To view or add a comment, sign in

How to Use Python With Jupyter Notebook to Run Correlation Studies

Marat Gaziev, MBA

Exploring Dataset With Pandas

Reading In Your Data

Visualizing Your Data

Finding Correlations

More articles by Marat Gaziev, MBA

Others also viewed

BigQuery Vector Search using Python SDK, Gemini and Langchain on GCP

How to Create Vector-Indexed Nodes in a Knowledge Graph with Python

Why you should start with Python in digital analytics

WEB SCRAPING WITH PYTHON AND BEAUTIFUL SOUP

Web Scraping with Python

Taming 100M Rows in Python 3.11+: A Practical Pandas & Polars Playbook for Business Analytics

Streamlit Machine Leaning app

A Beginner's Guide to Data Extraction from Websites Using Python

Track usage of your Python scripts using GA's measurement protocol

Explore content categories

Exploring Dataset With Pandas

Reading In Your Data

Visualizing Your Data

Finding Correlations

More articles by Marat Gaziev, MBA

Evolve or Perish: Google’s AI Overviews and the End of Traditional Search

From Search Intent to Root Cause

How to Better Measure Impact from Google Broad Core Updates

Use Machine Learning to Predict the CTR of Your Title Tags

How to Use Natural Language Processing, Classification, and Entity Recognition to Understand Your Content Gaps

Others also viewed

BigQuery Vector Search using Python SDK, Gemini and Langchain on GCP

How to Create Vector-Indexed Nodes in a Knowledge Graph with Python

Why you should start with Python in digital analytics

WEB SCRAPING WITH PYTHON AND BEAUTIFUL SOUP

Web Scraping with Python

Taming 100M Rows in Python 3.11+: A Practical Pandas & Polars Playbook for Business Analytics

Streamlit Machine Leaning app

A Beginner's Guide to Data Extraction from Websites Using Python

Track usage of your Python scripts using GA's measurement protocol

Explore content categories