Demystifying Random Forests Algorithm


Just for the record, I have been a huge fan of RandomForest since I was first introduced to it. Am sure many of you are as well. It is one of the easiest model to train, having almost no assumptions to be taken care of as it is a non-parametric algorithm, requires negligible pre-processing and moreover, almost always gives fairly decent and reliable output.

However, I strongly believe that one needs to have a complete understanding of the model-working to be able to use it right and not get carried away. A data scientist is described as “Someone better at statistics than any software engineer, and someone better at software engineering than any statistician”. I come from Computer Science background and hence I spend a lot of time to get my statistics concepts right.

Random Forest works well in a large number of cases, but not ALL. Most people use it without understanding its drawbacks and limitations which might affect the model performance. Below is a summary of key points to consider while training a RandomForest:


1.      RandomForests do not perform well on a dataset with high-cardinality categorical features.

For example, if we have a feature with 20 unique values and another feature with 50 unique values then we have 50x20 = 1000 unique combinations available to choose from. There are high chances that a few of these combinations will be picked up well during the training phase giving good accuracy scores, but may not perform as well on new unseen data points, and hence not generalize so well.


2.      Like all trees, RandomForests have discrete outputs since they have a finite number of leaf nodes. As a result, interpolating between the discrete predictions and extrapolating beyond the trained data takes extra effort.

For example, if you have a continuous response variable, you might always get just 50 distinct values as predictions because the tree has 50 leaf nodes. But in case of say, a linear regression model, the equation of the fitted line gives us a continuous range of values. Usually it is not a deal-breaker, but just something to look out for.


3.      It does not work well with time. Again, trees in general don’t.


4.      Sometimes each tree can be inaccurate.

For example, sometimes trees need to go to immensely high depth to capture a simple decision boundary like y = x1 + x2. Based on your ‘max_depth’ parameter, none of the trees might ever be able to capture this.


5.      RandomForest is known to work well with high feature spaces. However, sometimes only a handful of these features are useful. In the process of random feature sampling, some trees in the forest might not have any useful feature to predict the target variable but still contribute to final prediction with an equal weight.


6. RandomForests can be easy and fast to train, but they take a long time to come up with predictions. This low prediction throughput is not ideal for real-time prediction use cases.


To view or add a comment, sign in

More articles by Shantanu Chandra

  • Underlying assumptions of ML algorithms

    It is no secret that running machine learning algorithms in their native form is just not good enough to achieve the…

    2 Comments

Others also viewed

Explore content categories