Building Predictive Model Using Machine Learning

Building Predictive Model Using Machine Learning


What if you could predict the future? For instance, if you were able to predict the best marketing offer to convince customers to open a marketing email and sign up for your product? Or which of your customers is most likely to take their business elsewhere? You can get to this level of insights and information through predictive analytics.

Forbes magazine[i] reports that the use of data mining and predictive analytics has helped to identify patients who have been of the greatest risk of developing congestive heart failure. IBM collected 3 years of data pertaining to 350,000 patients, and including measurements on over 200 factors, including things such as blood pressure, weight, and drugs prescribed. Using predictive analytics, IBM was able to identify the 8500 patients most at risk of dying of congestive heart failure within 1 year.

The McKinsey Global Institute (MGI)[ii] reports that most American companies with more than 1000 employees had an average of at least 200 TB of stored data. MGI projects that the amount of data generated worldwide will increase by 40% annually, creating profitable opportunities for companies to leverage their data to reduce costs and increase their bottom line.

While shopping at amazon, adding items to the shopping cart, each item indicates a new row in the database, a new “observation” in the information being collected about your shopping habits and next time when you open that amazon app, you see a list of recommended items that you might want to shop based on your past shopping behavior.

A lot of data is being collected. However, what is being learned from all this data? What knowledge can be gained from all this information? This is where data mining and predictive analytics play an important role. So, what is data mining and predictive analytics. Data mining is the process of discovering useful patterns and trends in large data sets. Predictive analytics is the process of extracting information from large data sets to make predictions and estimates about future outcomes.

There are three main types of predictive models — decision trees, regression, and neural networks. Decision trees use a tree-shaped diagram to chart the possible outcomes of different courses of action, including how one choice leads to others. Regression techniques use statistics to help users understand the relationships between different variables. Neural networks are complex algorithms designed to mimic the way the human mind works, and by doing so, identify nonlinear relationships in data.

In this article, we will learn how to use regression technique to build predictive model in R.

What is Multiple Regression?

Multiple regression modeling provides an elegant method of describing relationships between the target variable and the predictors. Compared to simple linear regression, multiple regression models provide improved precision for estimation and prediction, like the improved precision of regression estimates over univariate estimates. A multiple regression model uses a linear surface to approximate the relationship between a continuous response variable, and a set of predictor variables. While the predictor variables are typically continuous, categorical predictor variables may be included as well, using indicator (dummy) variables.

We have a dataset on car price and the different variables that impact car price. Price is the response here or the dependent variable and all other factors are independent variables. There are 26 variables including price as response and car_ID, symbolling, CarName, fueltype, aspiration, doornumber, carbody, drivewheel, enginelocation, wheelbase         , carlength, carwidth, carheight, curbweight, enginetype, cylindernumber, enginesize, fuelsystem, boreratio, stroke, compressionratio, horsepower, peakrpm, citympg, highwaympg  as predictive variables. We will use the following variables for our analysis.

Price: Response variable ($)

Fueltype: Qualitative variable

Carbody: Qualitative variable

Carlength: Quantitative variable

Enginesize: Quantitative variable

Horsepower: Quantitative variable

Peakrpm: Quantitative variable

Highwaympg: Quantitative variable

Data Exploration - As part of data exploration, we will run a few tests and basic analysis to visualize the dataset and get preliminary interpretation of the response and the factors.

To perform preliminary analysis, we will create boxplots, correlation, basic plots, and summary model. I will take an example of each of the above.

1.     Plotting response against the predictor variable – This plot will help us establish basic visualization and a relationship between the 2 variables. It is a line graph that shows either positive or negative or no relationship between the 2 variables.

R Code:

library(ggplot2)

library(car)

# Plotting Price against Car Length

ggplot(data=Cars, aes(x=carlength, y=price)) +

geom_point(alpha=I(0.2),color='blue') +

xlab('Car Length') +

ylab('Price') +

ggtitle('Price vs. Car Length') +

geom_smooth(method= "lm",color='gray', se=FALSE)

No alt text provided for this image

Based on the graph above, there appears to be a positive and linear relationship between the price of the car and the length of the car. Let’s test the correlation between the two variables.

2.     Correlation – A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of between -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation

R Code:

cor(Cars$carlength,Cars$price)

0.68292

Based on the correlation, the price and car length have moderate linear relationship. We formulated a hypothesis that price and length have positive relationship based on the graph above, but correlation further strengthens the hypothesis.

3.     Boxplot – A boxplot, also called a box and whisker plot, is a way to show the spread and centers of a data set. Measures of spread include the interquartile range and the mean of the data set. Measures of center include the mean and median.

Let us review the boxplot of price and fueltype of the car.

R Code:

boxplot(price~fueltype,

       main="BoxPlot of Price & fueltype",

       xlab="fueltype",

       ylab="price",

       col=blues9,

       data=Cars)

No alt text provided for this image

The boxplot suggests price would differ between fuel types. Hence, there does appear to be a relationship between Price and fueltype. The gas category appears to be statistically significantly lower than the diesel category

4.     Performing Multiple Regression –

R Code :

#Running multiple regression

model1 = lm(price ~ ., data=cars)

summary(model1)

Output:

Call:

lm(formula = price ~ ., data = cars)

Residuals:

   Min     1Q Median     3Q    Max

-9012.6 -1848.1  -48.1 1658.0 13011.4

Coefficients:

                 Estimate Std. Error t value Pr(>|t|)   

(Intercept)     -2.235e+04 9.325e+03 -2.397 0.017502 * 

fueltypegas     -3.810e+03 9.596e+02 -3.970 0.000101 ***

carbodyhardtop  -2.904e+03 1.790e+03 -1.622 0.106401   

carbodyhatchback -5.128e+03 1.436e+03 -3.571 0.000449 ***

carbodysedan    -4.305e+03 1.477e+03 -2.914 0.003985 **

carbodywagon    -5.504e+03 1.618e+03 -3.402 0.000811 ***

carlength        9.564e+01 3.915e+01  2.443 0.015471 * 

enginesize       1.032e+02 1.277e+01  8.082 6.69e-14 ***

horsepower       4.703e+01 1.383e+01  3.400 0.000818 ***

peakrpm          2.126e+00 6.605e-01  3.218 0.001512 **

highwaympg      -6.235e+01 6.868e+01 -0.908 0.365114   

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3272 on 194 degrees of freedom

Multiple R-squared: 0.8405, Adjusted R-squared: 0.8323

F-statistic: 102.2 on 10 and 194 DF, p-value: < 2.2e-16

We can conduct an F-Test in order to assess the overall adequacy of the model

The regression equation is y = β0 + β1x1 + ... + β10x10 + e

The null hypothesis is : H0 : β1 = ... = β10 = 0

The alternate hypothesis is : Ha: At least one of the slope coefficients is nonzero

Test statistic: F = 102.2 p-value: < 2.2e-16

We can conclude: Since α = 0.05 exceeds the observed significance level, p < 0.05, we reject the null hypothesis. The data provide strong evidence that at least one of the slope coefficients is nonzero. The overall model appears to be statistically useful in predicting price. We can also eliminate insignificant factors with a p-value > 0.05 that do not statistically significantly impact the response and rerun the model. These factors are carbodyhardtop, enginesize and highwaympg.

To conclude, predictive analytics can never produce conclusions that are 100% accurate, they are generally reliable forecasts that can improve business outcomes. A Forbes Insights[iii] report found that 86% of executives who used predictive marketing for at least two years reported an increased return on investment. The report found that one of the primary benefits of predictive marketing was that it enabled a much greater degree of focus, including “the ability to better identify market opportunities, better ad targeting, improved nurture programs, and more targeted accounts.”

By embracing predictive analytics across organization, businesses will be able to unlock even more of the value hidden in their raw data.

[i] IBM and Epic Apply Predictive Analytics to Electronic Health Records, by Zina Moukheiber, Forbes magazine, February 19, 2014.

[ii] Big data: The next frontier for innovation, competition, and productivity, by James Manyika et al., Mckinsey Global Institute, www.mckinsey.com, May, 2011. Last accessed March 16, 2014.

[iii] https://www.salesforce.com/blog/2019/07/what-is-predictive-analytics.html



Devshree, thanks for your post!.I find it interesting!

Like
Reply

To view or add a comment, sign in

More articles by Devshree Golecha

Others also viewed

Explore content categories