Why machine learning is better on a slower computer
Waiting wheel; credit: https://linux.m2osw.com/designing-rotating-wheel-ask-users-wait-while-working-background

Why machine learning is better on a slower computer

When you think about machine learning, what comes to your mind? Most likely, smart data scientists, statistics and mathematics, large and complicated datasets, programming, and of course powerful computer hardware.

As you no doubt know, over the past decades, computing hardware has become much more powerful, affordable and hence accessible. It would be dishonest to say that these hardware improvements have not helped with machine learning. They indeed have, but you do not need the latest super-duper hardware to do the job. In fact, I would say, there are some unexpected advantages in using slower hardware.

Not convinced? Allow me to explain

Here are three ideas on how to productively use your waiting time while your “slow” hardware crunches the numbers

  • Engage more
  • Study more
  • Think more

Have you ever experienced that, the first time you heard the problem, you thought … this is fairly easy to model … then you started digging deeper into the data, and it became clear that the problem is far from simple?

No alt text provided for this image

Source: https://m.xkcd.com/1831/

Several things may not go, as you perhaps expected them to go, when you started with the problem. Firstly, the data did not turn out to be quite as clean as you had hoped. Perhaps bits and pieces of data were missing, and you worried whether this would impact your ability to model. Then you had to impute some missing values or get rid of outliers but were not confident about a good way to do so. Perhaps you did manage to overcome the data issues but on modeling you first thought that you had found some cool “insights” that turned out to be common knowledge in the domain, or could be attributed to some data leakage. And finally, you thought you had understood the business problem right, but on getting feedback on your results, realized that you have misunderstood what kind of prediction will be helpful to address the problem!

Mathematics, programming, fancy algorithms, and cutting-edge hardware won’t help you with any of these challenges. What you need is some judgement informed by deep domain experience, that you can get either from domain experts or by educating yourself on the domain.

Engage problem owners, and study your data and domain more

Engage: You for sure do need to engage the problem owner to get your problem definition right. In addition, you need to engage the problem owner and of course domain experts to get the right context to the problem you are trying to solve: how is the data generated? what data is missing and why? why do they want a prediction? what kind of prediction will be helpful?, etc. Finally, besides problem definition and context, you can learn a great deal about the domain from them – by playing back your observations and results and getting their feedback on your data-driven insights.

Real-life example: In one of the manufacturing quality troubleshooting projects that we did for a pharma company, we spotted a strange correlation of the day of the week with quality issues. Feeling pleased with the results, we felt that we had found at least one important but unexpected root cause of the problem. We imagined that this correlation could perhaps be explained because a specific set of operators worked on those days or equipment cleaning routine was linked to those days. After we shared the results, the client went back and studied logbooks and other documents to report back that what we had seen was pure coincidence and that there was no real link with the factors we had imagined. In the same project, we had initially received operations data for only the final step of manufacturing. On exploring the data, we saw hints of some problem with an intermediate product that was sourced from other plants. Together with the client, we traced back this problem all the way to a raw material used in the intermediate’s plant. The real data-driven insight to the problem emerged only after we investigated the operations data from those plants. This kind of detective work is impossible to do without deep engagement with your problem owners.

So, as your “slow” hardware works on the numbers, grab a cup of your favorite hot beverage and engage your colleagues to understand, deeper and better, the data quirks and the problem to solve.

Study: Another way to inject some domain experience, is to learn about the domain yourself through literature. Google is your friend. I do not mean going to Stackoverflow for your code questions, this is important, no doubt, but to browse the web to read up on the business context. Company's internal documentation can also be of help here. Learning a bit more about the domain, in addition to your data, will allow you to create an independent perspective – this is usually very helpful. It prevents you from getting trapped in the same biases that domain experts may have. And it may spark some new ideas that even your colleagues may not have thought of.

No alt text provided for this image

Source: https://me.me/i/everything-we-hear-is-an-opinion-not-a-fact-everything-8475453




Real-life example: In a yield troubleshooting project that we did for a chemical company, the objective was to figure out why the yield of some production batches of a very expensive specialty chemical was low. We had quite a bit of sensor data available for every batch. Our initial results to model yield produced no results until we discovered in the data that, contrary to what the problem owners believed, the operators were using multiple recipes across batches to make the chemical. Most importantly, the operators were simply using judgment to stop the process, so the yield was effectively a result of the random decision by the operator. To discover this insight, it was necessary to study the standard operating procedure of the process, and to compare it with what we observe in the data.

So, as your “slow” hardware works on the numbers, look for ways to learn more about the data and the context by reading-up open and proprietary literature on the domain.

 Think deeper about the features and test them out

After you have learned more on your problem to solve and acquired some domain knowledge, it is time to apply it. You probably have heard about the mantra “location, location, location” when it comes to property prices.

My mantra to succeed with machine learning is “features, features, features”.

You may or may not be able to create a better model by tweaking your machine learning algorithm or using a different algorithm, but you will be able create a much better model if you can create better features. By a better model, I mean not just a more accurate, but more importantly a simpler and a more robust model.

Machine learning algorithms are of course different from each other, but they are not as different as some may believe. For example, look at the decision boundaries of different classification algorithms, for a two class XOR classification problem with just two features x and y. The classification performance of SVM, NN , Decision tree and Random forest algorithms is only slightly different

No alt text provided for this image

Source: https://www.kdnuggets.com/2015/06/decision-boundaries-deep-learning-machine-learning-classifiers.html




No alt text provided for this image

If you are aiming for a top position in a Kaggle contest, these differences could be relevant, but for many of your real-life applications, such small differences in model performance are insignificant. The solution to the XOR problem above is dramatically simplified if one creates a new feature x*y out of the two features x and y. The two classes can then be separated by a simple linear classifier by engineering this new feature.

How does one create better features for more complex real-life problems? Once you get hang of the problem and the domain a bit, feature selection and engineering will come much easier to you. For example, if you discover that the target variable z is likely to be dependent on x/y rather than x and y by itself, you will get better results by modeling with the engineered feature x/y.

So, as your “slow” hardware crunches on the numbers, you can invest your time in thinking about and testing better features to solve your problem.

In summary there are tons of useful things to do while you let your "slow" computer do its job. So what are you waiting for? …go ahead …

Infuse some EQ and IQ into your AI …. and reap the benefits


***

Vidya is the founder of Decodexis consulting services, and a management consultant and a data science practitioner. He loves to help B2B industry clients solve their problems using data analytics. He has got deep experience with analytics both at the strategic and number-crunching levels, in R&D, operations and commercial functions.


Dear Vidyadhar, how true!  Domain experience plus new tools / perspective leads to new insight.  Also to buy-in from the end-user --- most important for eventual adoption.

Like
Reply

To view or add a comment, sign in

More articles by Vidyadhar (Vidya) Ranade

Others also viewed

Explore content categories