Super Simple Machine Learning — Simple Linear Regression Part 2 [Math and Python]
Originally posted on 16th Jan '18 here. Part 1 can be found here.
This is part of a series of articles I am writing, covering ML algorithms, explained in a simple and light-hearted way for easy understanding. I may gloss over more technical aspects and terms as the goal here is to help myself and others to understand concepts over just following steps and throwing out terms blindly.
However, if my explanations are fundamentally incorrect, let me know.
Now that you’ve sort of got the basic concept for Simple Linear Regression down from Part 1, let’s get to the nitty gritty.
In this post, I will go into some Python coding and the math behind it, plus touch on certain characteristics of a dataset.
LET’S BEGIN!
Fantastic Parameters/Statistics and Where to Find them
I want to get some terms out of the way first:
- Parameters
- Statistics
Big thanks to stats god Michael for correcting my misconceptions in the first version.
- Parameters: Characteristics of a POPULATION (e.g all possible outcomes). They are most likely impossible to derive.
- Statistics: Characteristics of a SAMPLE (e.g outcomes we can record). The statistics allows you to estimate parameters: “Inferential statistics enables you to make an educated guess about a population parameter”
Examples of characteristics you should be familiar with by now:
- Mean : average
- Median: middle value
- Variance : average of the squared difference between each x and the Mean x. It describes how far spread out the data is. If variance is high, your ‘low numbers’ are low and your ‘high numbers’ are high, just imagine an elastic band being held further and further apart and the variance increases accodingly. The lower the variance the more ‘stable’ it is as it converges to the Mean.
- Standard Deviation: Square root of Variance. Basically finds how wide the spread of x is, which is the same as Variance BUT it is a matter of units. If you are looking at at a dataset of height (in cm), variance will give you cm2, but Standard Deviation is the square root of it and will give you an answer in cm, and this is sometimes better to calculate with and sits better with your OCD.
Get your head around Variance and Standard Deviation first, because you’ll encounter it ALOT in statistical modelling. This explanation is pretty good…. plus there’s doggies!
Alright, So …Regression?
Remember in Part 1 how I spoke about trying to plot different lines to find the one with the least squared error, and how R and Python packages can just solve it for you?
Well, let’s look into what these packages are doing.
KEEP THIS IN MIND:
y = ax + b
The equation behind the Ordinary Least Squared Method (finding the best fit line) looks like this:
The first equation is basically the line equation.
The 2nd and 3rd equations are the what you need to find a and b.
a and b are referred to as the “beta”s in the linear regression, and are considered learned parameters as they are eventually “learnt” after running the linear regression algorithm.
You can find the above equation here and the math behind it can be found here.
What’s going on in those formulae/formulas?
We are trying to minimise the Sum of Squared Errors (Squared differences between your actual and predicted. Refer to part 1 if unsure, are you even paying attention?!).
To do this your partial derivative of a and b has to be 0. Because 0 is the inflection point when the curve is at its least. Yada yada, you get those two equations.
Derivatives are not for me to explain but you can do a quick revisions here.
X and Y have Cooler Hats than You
You’ll notice that in the 2nd and 3rd equation, the x and y have funny things on their heads.
x̄ = x bar
̅y = y bar
Other than being difficult to type out properly, the bars basically mean mean.
x̄ refers to the mean (average) of x.
(x- x̄) is the difference between that value of x and the average value of all the different x values. This is referred to as the deviation score, meaning how far it deviates from the mean.
(xi-x̄) looks really familiar doesn’t it? That’s because it’s used in calculating variance and standard deviation as well.
See how useful the mean values are! This explains why parameters/statistics are so important.
Another symbol to take note of is the hat
ŷ = y-hat
This refers to the predicted value of y from a prediction equation.
In other words, to be more correct,
y = ax + b
should be
ŷ = ax + b
And the error is basically
REAL Y - PREDICTED Y
which can be written as
y - ŷ
This is also referred to as the residual. (Remember that step in Part 1 about checking that your residuals are random and should not show a pattern?)
Anyway because life is hard and complicated — the sum of squared errors in prediction are written as SSE
but can also be called :
- residual sum of squares (RSS)
- sum of squared residuals (SSR)
OKAY, THE COOL CODING BEGINS HERE
Now that we’ve got all those math out of the way, let’s start coding.
We are going to start off with using a Linear Regression module from the sklearn library in Python. This is similar to the one-line code I gave in Part 1 for R in which I am using something already pre-coded to find my regression line. Code can be found here. Does anyone know how to embed code into Linkedin posts?!
Yay and you’re done with the modelling!
Look at that gorgeous line. Whether it is a good line or not has not been decided yet (wait for Part 3.. just wait for it), but for now it has been decided that this line has the least SSE (or RSS, or SSR).
However, since I spent a substantial amount of time going through the equations behind the LinearRegression() method, I want to prove that it actually is the math behind the python module we just used.
This is the equation I will use in the following python code:
Also note that the power sign in Python is NOT " ^ " , it’s " ** "
Full code can be found here
Both give the exact same results as seen below:
The regression equation is y = 1.37x + 4.27
AAAND WE’RE DONE.
Hopefully you have a better idea about how Simple Linear Regression works now :) I certainly do.
This is only the first step to Linear Regression, but feel free to try it yourself. I used this post as a guide and it’s proven to be very comprehensive, especially for the math part.
You can set up and code in Jupyter Notebook or just use a Python IDE.
In the next episode, I will be touching on evaluating the accuracy of the model and how to derive predictions from it.
STAY TUNED!
EXTRA EXTRA!! Context is important!!
In Part 1, the example I gave about ‘Tears Shed’ vs ‘Exam Score’ was a bad one on afterthought.
As much as statistics can prove a correlation, ALWAYS REMEMBER THAT
**correlation is not causation**
Perhaps the number of tears shed would have affect the score but it could have all just been a coincidence and not directly related.
The correlation between number of tears shed and the exam score could have been a ….
*drum roll*
Check out super fun superious correlations by Tyler Vigen here.
This is where business knowledge and common sense comes in. Whats relevant and what isn’t is not solely defined by a program.
Feature selection is very much a job for both humans and computers.
We’ve reached the end of PART 2! Thank you for sticking around so far. Keep those gorgeous eyes out for Part 3 which I will post tomorrow, and remember to let me know if you spot any errors.