Calculating Correlation of Data Attributes
In optimizing email campaigns, it is best to segment your email list into smaller lists, based on the attributes of the subscribers. The typical attributes that we collect are demographic, psychographic & purchase behavior, as well as email open and CTR rates. However, how do we know that these attributes correlate in any statistically significant way?
I did a blog post on using a one-way anova (https://fujoanalytics.com/blog/using_one-way-_anova_for_email_list_segmentation/). However, I’m going concentrate on correlation and how to determine if two attributes are correlated.
If we want to understand if two variables are correlated, the first thing we have to understand is if they covary. That means that when one variable moves from it’s mean, we would expect that a related variables would change in a similar way.
For this example, lets use the number of repeat campaign emails sent to each subscriber and the conversion rate. Let’s say that we think that the more emails we send, the higher the conversion rate will be. We look at some of the data and we find that those subscribers that received more emails, converted at a higher rate than those that received less.
It appears that these two variables are related. We can be a little more assured if we could use some kind of calculation to be more confident. Luckily, there is one - a covariance equation:
This equation may look daunting to some, but it basically takes the deviations in one variable and multiplies it by the deviations in the other variable. That is the top part of the equation and is called the cross-product deviations. Those are then added together and multiplied by how many subscribers are in the sample (minus one - there is a good reason for this - check out this video for an explanation: http://bit.ly/2FU7PFm).
If the covariance value is positive, then the two variables are, to some degree, positively related. If the covariance is negative, then the two variables are, to some degree, negatively related.
This is a good start in understanding correlation. However, there are issues with using the covariance equation. The most glaring is that it is not a standardized measurement. That means that the scale depends on the data.
Standardizing Covariance
To move beyond the covariance equation, we need to standardize the covariance. The best way to do that is to use the standard deviation. If we divide the distances from the mean by the standard deviation, we get a standardized measurement that we can use to compare different sets of data. This is called the correlation coefficient:
By using this equation (which is known as the Pearson correlation coefficient), you will have a value between -1 and 1. A -1 would indicate a perfectly negative relationship, while a 1 would indicate a perfectly positive relationship.
From here, we can test probabilities by transforming r into a z-score. We do that by first transforming r, because the Pearson correlation coefficient does not have a normal distribution. Take 1/2 the natural log of 1+r divided by the natural log of 1-r.
To transform the result into a z-score, divide it by the standard error. The standard error is 1 divided by the square root of the number of samples (N) minus 3.
Another way to test r is to use the t-statistic instead of a z-score. To get the t-statisitic, multiple r by the square root of N-2. Divide the result by the square root of 1 minus r squared.
This is a basic explanation of how to correlate data attributes. You can move beyond this by calculating confidence intervals. As a note and warning - this will tell you only that two variables are correlated, but does nothing to verify whether there is causation between the variables.
Doing it in R
There are three main functions to calculate correlation in R:
- cor()
- cor.test()
- rcorr()
I’m just going to show cor(), but look up using the others as well.
cor() takes these arguments:
cor(x, y, use = “how missing values are handled”, method = “type of correlation”)
Use can have three values:
- “everything” - which will spit out N/A instead of a correlation coefficient for any missing values.
- “all.obs” - which will use everything and return an error if there are any missing values.
- “complete.obs” - which will only compute cases where there are no missing values.
- “pairwise.complete.obs” - which computes correlations between pairs of variables where data for those pairs are not missing.
Method is the type of correlation that you want to calculate:
- pearson
- spearman
- kendall
So, for a data.frame, it could look like this:
cor(mydata$col1, mydata$col2, use = “all.obs”, method = “pearson”)
Hopefully this will help you to think about correlation and working on calculating correlations between variables.
Good analysis. Great deal of thanks