An Intuitive approach to Ensemble learning
Introduction
Machine Learning is fast emerging as a new automated way of dealing with data. Everyday new information is being generated in every sphere of our lives. Big data as it is so called, possesses the following characteristics.
Considering that information is being generated on quite an enormous scale , it can be very challenging to design a single machine learning model that encapsulates all the above traits of big data to approximate the real world the best it can and generate predictions. And of course data is for the most part dynamic like stock market prices, age, brand and quantity of items sold etc. So it can be a bit of a stretch to assume that one machine learning model can do all the heavy lifting to conform to ever changing trends. Here is how Machine learning typically works.
While in most cases, solo machine learning is sufficient to make predictions from new information, it may not be adequate in scenarios wherein data is constantly subject to change. For instance, consider Alexa, the go to AI product of Amazon. Supposing you would like to select a favorite movie or TV show and instructed Alexa to pick one through voice command, the AI algorithm would not only recommend the choice but also recommend 5 other choices of the same genre and which match with the user tastes. However, errors are bound to creep in. One of the recommendations may contain a horror film which the user does not prefer. This could arise from the fact that because user tastes may change over time, Amazon will need to constantly update their databases with new content accordingly and if the popularity trend shows a significant change from what it was previously, the AI algorithm will probably need to be retrained to conform to the current scenario, which can put a tremendous strain on resources.
This is where ensemble learning comes into play. Put it simply, rather than relying on a single Machine learning model to make predictions, a group of models is created. Think of it as a panel of experts who each render their opinion to make critical decisions. Each expert is uniquely suited to address possible knowledge gaps of others, allowing the chairperson to make a high quality informed decision that could have major implications. This is akin to a strategy to tackle a natural disaster, a pandemic or a huge business acquisition deal.
Now there is a catch to this approach, by sampling data and assigning them to each model, information may be lost in the process. But it does to a great extent address the massive data morphing problem. Each data sample need not be very statistically similar (Like means, median, mode, standard deviation etc) to one another since the load is being shared by an ensemble of models making for a very robust technique.
Two very popular Ensemble techniques are the random forest and boosting methods.
Boosting
Suppose we have a classification problem. Lets say a patient walks into a hospital and the staff will need to figure out whether this person is afflicted with diabetes based on some markers (Age, BMI, Height, Blood parameters etc). We currently have with us existing data on patients with these attributes and an outcome (1 if the patient has diabetes and 0 if not). Below is a snippet of the Pima Indian Diabetes Dataset taken from Kaggle.
In instances like these, it is very likely that the number of non diabetics will far outnumber the diabetics. Same as the case with defaulters vs prompt borrowers of loans. The important thing is to ensure that our machine learning model can reliably pick out the diabetics from the rest, so the right treatment can be administered. If we were to take the confusion matrix of the predictions of our ML model, it would look something like this.
Realistically, it is highly likely that the normal patients will outnumber the diabetics as a result of which, the model may be biased to learn from the majority class as opposed to the minority one. This is called data imbalance and can be the Achilles to many machine learning classification problems. The cost of wrongly labelling a diabetic patient as normal can be very high in this example.
So how do we get around this? While there exist many algorithms to balance datasets, a common synergistic approach is the Boosting method. With this technique, a machine learning model initially starts out with the initial prediction of the positives. Any misclassified data labels will be assigned extra weightage which will serve as a cue for the machine learning model to better classify these points in the next step and so on until saturation is reached. This is an ensemble approach as each prediction is vetted by the subsequent model and shortcomings addressed to make for a robust prediction!
Random Forest
This reminds me of a popular Game of Thrones quote by the Sept of Baelor nicknamed the "Sparrow". "The many, however weak, will always overpower the strong who are the few". Random forest as groups of trees starts off with numerous weak learners to make corresponding predictions. Except in this case instead of sampling observations along the rows, sampling from the main data is done column wise.
Each tree in the random forest will learn from the column wise samples. And yes even though each learner will not have access to all of the original dataset features, we need to recall the premise of an ensemble : Each member will address gaps of the other. Some machine learning models may perform excellently on a given set of features which is randomly sampled. Eventually through iteration, we will come up with a set of models each with their own unique strengths and weaknesses. Hence random forest capitalizes on the strengths of machine learning models to deliver reliable predictions. Note : Random sampling is done with replacement, so there is always bound to be an overlap between models. This is important to mitigate loss of critical information between models in the random forest which will compromise accuracy of predictions.
Can we consider having a hybrid of the Random forest and Boosting method?
Why not? We can fine tune each tree in the Random forest with the Boosting method to get around misclassifying data labels. This is a great way get around the data imbalance problem in most cases. The XGBoost Random forest classifier is one of the algorithms used in python that specifically serves this purpose.
Limitations of Ensemble learning
Conclusion
Ensemble methods do make for robust predictions and get around many of the data imbalance problems commonly encountered. In situations where it is difficult to rely on a single model to make accurate predictions, assembling a team of models to capitalize on their strengths is the best way to go.
As far as selecting ensemble sizes and parameters is concerned, it is by no means an easy task and this is an art cultivated through experience and more importantly fine tuned to the problem's needs
References