Calculating Correlation of Data Attributes

In optimizing email campaigns, it is best to segment your email list into smaller lists, based on the attributes of the subscribers. The typical attributes that we collect are demographic, psychographic & purchase behavior, as well as email open and CTR rates. However, how do we know that these attributes correlate in any statistically significant way?

I did a blog post on using a one-way anova (https://fujoanalytics.com/blog/using_one-way-_anova_for_email_list_segmentation/). However, I’m going concentrate on correlation and how to determine if two attributes are correlated.

If we want to understand if two variables are correlated, the first thing we have to understand is if they covary. That means that when one variable moves from it’s mean, we would expect that a related variables would change in a similar way. 

For this example, lets use the number of repeat campaign emails sent to each subscriber and the conversion rate. Let’s say that we think that the more emails we send, the higher the conversion rate will be. We look at some of the data and we find that those subscribers that received more emails, converted at a higher rate than those that received less.

It appears that these two variables are related. We can be a little more assured if we could use some kind of calculation to be more confident. Luckily, there is one - a covariance equation:

This equation may look daunting to some, but it basically takes the deviations in one variable and multiplies it by the deviations in the other variable. That is the top part of the equation and is called the cross-product deviations. Those are then added together and multiplied by how many subscribers are in the sample (minus one - there is a good reason for this - check out this video for an explanation: http://bit.ly/2FU7PFm). 

If the covariance value is positive, then the two variables are, to some degree, positively related. If the covariance is negative, then the two variables are, to some degree, negatively related. 

This is a good start in understanding correlation. However, there are issues with using the covariance equation. The most glaring is that it is not a standardized measurement. That means that the scale depends on the data. 

Standardizing Covariance

To move beyond the covariance equation, we need to standardize the covariance. The best way to do that is to use the standard deviation. If we divide the distances from the mean by the standard deviation, we get a standardized measurement that we can use to compare different sets of data. This is called the correlation coefficient:

By using this equation (which is known as the Pearson correlation coefficient), you will have a value between -1 and 1. A -1 would indicate a perfectly negative relationship, while a 1 would indicate a perfectly positive relationship. 

From here, we can test probabilities by transforming r into a z-score. We do that by first transforming r, because the Pearson correlation coefficient does not have a normal distribution. Take 1/2 the natural log of 1+r divided by the natural log of 1-r. 

To transform the result into a z-score, divide it by the standard error. The standard error is 1 divided by the square root of the number of samples (N) minus 3

Another way to test r is to use the t-statistic instead of a z-score. To get the t-statisitic, multiple r by the square root of N-2. Divide the result by the square root of 1 minus r squared.

This is a basic explanation of how to correlate data attributes. You can move beyond this by calculating confidence intervals. As a note and warning - this will tell you only that two variables are correlated, but does nothing to verify whether there is causation between the variables.

Doing it in R

There are three main functions to calculate correlation in R

  • cor()
  • cor.test()
  • rcorr()

I’m just going to show cor(), but look up using the others as well.

cor() takes these arguments:

cor(x, y, use = “how missing values are handled”, method = “type of correlation”)

Use can have three values:

  • “everything” - which will spit out N/A instead of a correlation coefficient for any missing values.
  • “all.obs” - which will use everything and return an error if there are any missing values.
  • “complete.obs” - which will only compute cases where there are no missing values.
  • pairwise.complete.obs” - which computes correlations between pairs of variables where data for those pairs are not missing.

Method is the type of correlation that you want to calculate:

  • pearson
  • spearman
  • kendall

So, for a data.frame, it could look like this:

cor(mydata$col1, mydata$col2, use = “all.obs”, method = “pearson”)

Hopefully this will help you to think about correlation and working on calculating correlations between variables.

Good analysis. Great deal of thanks

Like
Reply

To view or add a comment, sign in

More articles by Daran Johnson

  • Time Series ARIMA Models

    The acronym ARIMA stands for auto-regressive (AR) integrated (I) moving-average (MA). ARIMA models can be broken down…

  • Replacing Excel With R

    Excel is great for spreadsheet use. I can remember taking accounting in college, when I was not aware of Excel (before…

  • USING ONE-WAY ANOVA FOR EMAIL LIST SEGMENTATION

    A one-way ANOVA is used to test a null hypothesis by comparing three or more sample groups from a population (a t-test…

  • Steps for Social Media Success

    The first step in using social media channels successfully is to understand what you will be using them for. Is it for…

  • Strategies For Measuring Digital Branding Campaigns

    Digital branding campaigns are campaigns designed to boost positive awareness and recall of your brand. They are not…

  • 5 Tools Guaranteed to Boost Your Digital Marketing

    There are a lot of great tools out there to help you with digital marketing. Some help you update and upload new pages…

    1 Comment
  • Fun With Variables - Creating Dimensions With Google Analytics & GTM

    Google Analytics has become enormously powerful over the years. When I first started using GA, it was little more than…

  • Channel Attribution - Give Credit Where Credit is Due!

    Most analytics tools give credit to the last traffic channel (banner, search, etc.) through which a visitor arrived at…

  • Do You Know Your Data?

    Data-driven decision making is one of the most important factors in having a successful online presence, regardless of…

  • 6 Things You Must Do To Succeed in Digital Marketing

    There are so many things that need to be done when running an organization’s digital marketing. There are new…

Others also viewed

Explore content categories