DAY 5: MACHINE LEARNING
LINEAR REGRESSION :
The Continuous relationship between dependent(Predictor) and independent(Predicted) variable is Linear Regression.
The formula behind Linear Regression is y = b + cx, where y is dependent variable and x is independent variable.
In this Article, you will see the complete architecture of Linear Regression.
- Figure alongside shows the dataset of Salary and Years of Experience.
- Dataset have two columns YearsExperience and Salary
- .head() function prints the top five obsevations. If you like to print the last ten observations use .tail(10)
- .info() function will print the complete information about the dataset.
- For predicting in Machine learning you need to separate the dataset in x and y variable as you can see in figure alongside.
- x contain YearsExperience column whereas y contain Salary column.
- Now we have to reshape values in variable x. Sometimes it is necessary because when you predict then sklearn wants your values in 2 D. As there are 30 rows in x, to make it 2 D use .reshape(30,1). Now X have the same data but it's dimension is changed.
- After filling your data in X and y you need to split same data in training and testing set. On the training data we will train our model whereas on the testing data we will predict our model and see that our model is predicting the right values or not. To do so, we have to use train_test_split function which is available from model_selection module and from sklearn library.
- X_train, y_train :- training set X_test, y_test :- testing set
- train_test_split takes X,y and test_size (how many observations you want to keep in test_size, Here i choose 30% of the dataset will be in testing part and the 70% automatically in training part). random_state means your training and testing data will be choosen randomly. This increases the performance of the model.
- As our data is continuous, so there is a LinearRegression function inside linear_model module from sklearn library which helps us to predict the model on continuous data.
- After, that we have to the fit training data and prepared the model to learn the patterns of the data. model.fit() takes two parameters of training data i.e X_train, y_train.
- As our fitting part is completed and our model is trained on the dataset which we have provided, we will move further to predicting part.
- Our model is trained on the seen data but here in predicting part model will predict on the data which model haven't seen. As you can see in the above figure, we have predicted on the test data which was in X_test. Now we will compare the result of y_pred with y_test. The Salary of 27th observation in y_test is 112635.0 where as the model predicted 115573.62288352 which is almost closed to 112635.0.
- Here's the complete Actual and Predicted value.
- The Predicted values is close to the Actual one, we can say that our model have approached a good accuracy. The important thing behind this, is value of c and b in the formula y = b + cx. The value of c and b keep changing behind the model when it is predicting the values. By visualizing the actual and predicted value you will understand the difference.
- model checks the value of c and b thoroughly and generate output. if the output is too less than the Actual value it again checks the value of c and b till it generate value too close to Actual value.
- The Blue data point is the Actual value whereas Orange data point is the predicted one.
- The difference between the Actual value(Blue) and Predicted value(Orange) is known as Error, Loss, Residual.
- plt.scatter(X_test, y_test) will give trained plot and plt.scatter(X_test,y_pred) will give you predicted plot.
- In mathematics, Actual values is denoted by y and Predicted value is denoted by y_hat. So, Loss(sigma) = y - y_hat.
- In the figure alongside, we have plotted the line passing through only Predicted values. The line is known as Regressor line or Best fit line.
- To check the error we have different function in sklearn.
- We can calculate the errors in three possible ways :
We know Error = y - y_hat. Below n is no. of Errors.
Mean Square Error (MSE) : Sum of Errors if got negative then square it and take the mean.
Sum(Error) ---> if -ve --- > (-ve)^2 ---> +ve ---> Error/ n
Mean Absolute Error (MAE) : Sum of Errors if got negative then take absolute(Mod) value and take the mean.
Sum(Error) ---> if -ve --- > | -ve | ---> +ve ---> Error/ n
Root Mean Square Error (RMSE): Sum of Errors if got negative the square it you will get positive error and then take the square-root and take the mean.
Sum(Error) ---> if -ve --- > (-ve)^2 ---> +ve ---> root(+ve) ---> (+ve) ---> Error/ n
Note : Less will be the Error more better will be the performance of the model. We don't have to optimize the error the model will automatically do it. We can only check the Errors.
Very quickly created Great broo 🙌 Keep it up broo
Great..