How to do a predictive model without fancy AI techniques

How to do a predictive model without fancy AI techniques

OK, hands up if you are a worker bee in your company’s data analytics department, but you don’t have any statistical package installed on your computer, or you are not allowed to use any open source software, or worse yet, all you have available is Excel. How are you supposed to make a kick-ass predictive model that will save your company millions of dollars and earn yourself a promotion to the C-suite with a corner office? Well, I don’t know about that last part (that’s a topic for another article), but it is entirely possible to build a model from scratch with nary a Python, Pig, or Pandas in sight.

How do I know this? Because I have actually built a few at work over the years, and not once did I have to deal with anything having to do with the following buzzwords: deep neural networks, random forests, cluster analysis, ensemble models, Tensorflow, Kubernetes, and so on and so forth.

The two basic steps are as follows: (1) find some variables that individually seem to correlate well with the thing that you want to predict, and (2) try out different combinations of them all in one model until you get one that has good model quality.

For Step 1, if you have absolutely no idea what kind of explanatory variables will work, then look up some journal articles to see what other people have used in your industry for a similar business problem. Once you have your list of factors you want to try, use your tool of choice (there are a ton of free web-based utilities where you just have to plug in the data and click a button such as this one) to test if each one has a statistically significant correlation with your dependent variable. This means it has the correct positive or negative sign that matches your intuition, a t-stat of at least 2 and a p-value under 0.05.

These are not hard and fast rules. Maybe you found a brilliant variable no one has ever thought to use before, but the t-stat is only 1.8 or the p-value is 0.11. Are you going to just throw it away? Hell no! The key is you want to have a bunch of promising factors to try out in Step 2.

To actually construct the model (i.e. dependent variable is some function of the independent variables), as I said you don’t need to use any complex artificial intelligence methods. You want to know what’s the best thing at doing machine learning? Your brain is the best machine! That’s why I say you just need Data Analysis Toolpack in Excel or some other widely available GUI tool where you can do copy/paste, drag and drop, and check off boxes to have the software estimate the model and calculate all the goodness-of-fit statistics for you, without you having to write a single line of code.

When you visually put together different combinations by yourself instead of having the computer cycle through infinite permutations, you naturally avoid the dreaded Multicollinearity Problem by only combining factors that are conceptually and mathematically different from each other. You already have a variable called BMI in your model? Then don’t also include a variable called Weight. It’s just common sense — that thing machines DON’T have.

Article originally published on Medium.com

I was looking for the mention of "Data Cleaning" word in this article . That's also something you can do without the use of fancy tools. All one needs is vigilant understanding of each variable in the data set and common sense identification of erroneous values. 

To view or add a comment, sign in

More articles by Marmi Le

Others also viewed

Explore content categories