How to Use Python With Jupyter Notebook to Run Correlation Studies
this post originally appeared on ThinkGrow.co
As a longtime practitioner or an active observer of SEO you may have noticed that the term has become somewhat synonymous with correlation studies, and rightly so. In an industry shrouded by enigmatic ranking algorithms and Google’s apparent lack of transparency, correlation studies offer us an opportunity to slightly pull back this shroud of mystery. But as we all know, correlation studies come with a big, fat forewarning, just because two or more things are correlated does not mean one causes the other, simply put, correlation does not infer the presence of a causal relationship. Regardless, correlations can tell us a lot about our dataset, and can often indicate a predictive relationship in various contexts, SEO or otherwise. But this isn’t a study in causality, rather I wanted to demonstrate the power afforded to us by Python and its amazing ecosystem of libraries to easily run your own correlations. Below, I outline a step-by-step guide to using Python with Jupyter Notebook to find correlations within a dataset.
Exploring Dataset With Pandas
Pandas is an amazing and powerful data analysis library which allows us to quickly and easily manipulate data in Python. But before we do anything we must first decide which dataset to explore. For the purposes of this example, I decided to use the following dataset provided by AJ Ghergich as part of his excellent study on Featured Snippets.
Before we dive in, there are a few prerequisite steps you need to follow in order to setup Jupyter Notebook:
- Download Anaconda. I strongly recommend you download the latest Python 3 distribution of Anaconda.
- Follow the instructions on the download page to install Anaconda. Now go ahead and launch the Navigator
- You can launch Jupyter Notebook from the Navigator, or you can run the following command in your Terminal window or Command Prompt > jupyter notebook
- Click on the the “New” dropdown and launch your Python 3 notebook.
You should give your new notebook a title so you can easily find it later but that’s pretty much it. You are now ready to write some Python code.
Reading In Your Data
The steps below will show you how to read in your data into a Pandas dataframe. You can either use the read_csv or the read_excel methods, depending on how you saved your data. Keep in mind that read_excel method reads Excel 2003 files and Excel 2007 or above.
As the very first step you’ll need to import the pandas library. Execute the code by hitting the “run cell” button at the top
Now you’re ready to read in your file into a DataFrame. Make sure to give the variable a relevant name, and specify your local filepath or buffer
You can print the column names contained in your datafile by typing in the following print command
you’ll get the following output, rather, here is what I get when I read in the file from the Featured Snippet study
Index(['URL', 'Domain', 'TLD', 'Scheme', 'Content Length', 'HTML Length',
'Text Length', 'Text to HTML Ratio', 'Title', 'Title Length',
'Description', 'Description Length', 'Word Count', 'Sentence Count',
'Header Count', 'Paragraph Count', 'Reading Time', 'Sentiment',
'Sentiment Score', 'Dale-Chall Score', 'Flesch Kincaid Grade Level',
'Flesch Kincaid Reading Ease Score', 'Flesch Kincaid Reading Ease',
'Gunning Fog Score', 'Smog Index', 'Images', 'Images with Alt',
'Images without Alt', 'Videos', 'External Link Count',
'Internal Link Count', 'Total Link Count',
'Domain Mozscape Domain Authority', 'Domain Mozscape Page Authority',
'Domain Mozscape External Equity Links', 'Domain Mozscape MozRank',
'Domain Mozscape MozTrust', 'Homepage LinkedIn Shares',
'Homepage Pinterest Pins', 'Homepage Total Shares',
'URL Mozscape Domain Authority', 'URL Mozscape Page Authority',
'URL Mozscape External Equity Links', 'URL Mozscape MozRank',
'URL Mozscape MozTrust', 'URL LinkedIn Shares', 'URL Pinterest Pins',
'URL Total Shares', 'SEMRush Rank', 'SEMRush Organic Keywords',
'SEMRush Organic Traffic', 'SEMRush Organic Cost',
'SEMRush Adwords Keywords', 'SEMRush Adwords Traffic',
'SEMRush Adwords Cost', 'Mobile Friendly', 'Mobile Friendly Score',
'Uses Incompatible Plugins', 'Content Wider Than Screen',
'Links Too Close Together', 'Text Too Small To Read',
'Mobile Viewport Not Set', 'Mobile Friendly Url', 'PageSpeed Device',
'Mobile Speed Score', 'Usability Score', 'Speed Score',
'PageSpeed Mobile Url', 'PageSpeed Device.1', 'Desktop Speed Score',
'Tables', 'Ordered List', 'Unordered List',
'Lists - Either OR (OL OR UL Usage)'],
dtype='object')
The above shows us all of the column names contained in our datafile. You can also print the shape of the data if you’d like to see how many columns and rows are contained within your datafile
Visualizing Your Data
Next, we may want to start visualizing our dataset, or at least the inputs we’re interested in. To do this, we are going to use Matplotlib, the foremost Python plotting library. To get started, we are going to import the library using the following function
import matplotlib.pyplot as plt
Import Matplotlib using the import function and run the code.
Next we are going to plot and show a histogram. I’m interested in visualizing the distribution of Domain Authority within this particular data set, so that’s the column name I’m going to use as an input
After plotting our data, here is what the distribution of Domain Authority looks like. We can see that our dataset skews towards the higher end of the spectrum
Finding Correlations
Pandas makes finding correlations extremely easy. We can use the corr method to compute pairwise correlation of columns using either the Pearson, Kendall, or Spearman methods. At this point you should have a well formulated hypothesis or at least an idea of what you’d like to prove or disprove with the help of correlations. For example, looking at the data gathered by AJ Ghergich for their study on Featured Snippets, I’m interested in seeing how Social Shares are correlated with other data, particularly Word Count and the Dale-Chall Score. My hypothesis is that longer articles that have a low Dale-Chall Score, meaning they are easier to read, are shared more often through social media than those that are shorter and harder to read. So I’d expect a positive correlation between Word Count and Social Shares, and a strong negative correlation between Dale-Chall Score and Social Shares. Lets take a look.
mydata.corr('pearson')["URL Total Shares"]
Content Length 0.025096
HTML Length 0.027943
Text Length 0.006617
Text to HTML Ratio -0.009764
Title Length -0.008611
Description Length 0.003576
Word Count 0.012393
Sentence Count 0.019827
Header Count 0.033657
Paragraph Count 0.019222
Sentiment Score 0.025975
Dale-Chall Score -0.016972
Flesch Kincaid Grade Level -0.015324
Flesch Kincaid Reading Ease Score 0.017746
Gunning Fog Score -0.031502
Smog Index -0.017827
Images 0.046815
Images with Alt 0.027644
Images without Alt 0.042049
Videos 0.015922
External Link Count 0.008700
Internal Link Count -0.005896
Total Link Count -0.004012
Domain Mozscape Domain Authority -0.015110
Domain Mozscape Page Authority -0.005406
Domain Mozscape External Equity Links -0.005328
Domain Mozscape MozRank 0.000044
Domain Mozscape MozTrust -0.011382
Homepage LinkedIn Shares -0.005707
Homepage Pinterest Pins 0.008726
Homepage Total Shares 0.008543
URL Mozscape Domain Authority -0.014716
URL Mozscape Page Authority 0.021478
URL Mozscape External Equity Links 0.002473
URL Mozscape MozRank 0.025344
URL Mozscape MozTrust 0.020126
URL LinkedIn Shares 0.014667
URL Pinterest Pins 0.999969
URL Total Shares 1.000000
SEMRush Rank -0.006452
SEMRush Organic Keywords -0.017145
SEMRush Organic Traffic -0.016675
SEMRush Organic Cost -0.017952
SEMRush Adwords Keywords -0.005523
SEMRush Adwords Traffic -0.006049
SEMRush Adwords Cost -0.006742
Mobile Friendly Score 0.016018
Usability Score 0.017139
Speed Score -0.030370
Tables -0.006556
Ordered List -0.000192
Unordered List 0.007822
Name: URL Total Shares, dtype: float64
Although the data is not as strongly correlated as we would have liked it to be, we do, in fact, see a positive correlation between Word Count and Social Shares, and a negative correlation between the Dale-Shall Score and Social Shares. Surprisingly, Pinterest Pins appear to have the strongest correlation with Social Shares, this can be interpreted in two ways, either the data strongly skews towards Pinterest, therefore it’s overrepresented, or the URLs chosen for this study deal in subject matter that is visual in nature and contain a lot of images. So there you have it, with just a few lines of code you'll be running your own correlation studies in no time. Happy analyzing!
Great walkthrough! I really appreciate the focus on more data informed SEO