Overview of CatBoost

Nishant kumar singh

Published Oct 31, 2018

CatBoost is an algorithm that belongs to the family of Gradient Boosting Decision Trees in which Xgboost, Adaboost etc. lies.

It’s the newest addition to the family along with some features that helped data scientists to win the competition in past years.

The feature that separates CatBoost algorithm from rest is its unbiased boosting with categorical variables. The boosted trees generally handle the categorical variables with one hot encoding technique, however for high cardinality categories like “User Id” such technique can generate exponential amount of data, to handle it we need to group the categories and then encode the clusters. But this method can also lead to loss of information. CatBoost uses Ordered Target Statistics which is discussed in next sections.

Note: Target Statistics is one of the encoding methodologies with varies nature like mean target encoding in which categories are encoded and assigned to the mean of the target, thus maintaining relationship with target as well as other Categories in the feature.

The target Statistics applies in the CatBoost has significantly outperformed other approaches.

In gradient boosted decision trees constructing decision trees can be divided into two parts.

1. Choosing tree structure or splitting conditions with features

2. Assigning or choosing the values of the leaves

CatBoost performs the second phase that is assignment of leaf values by using standard but the first phase has two modes for the first phase that is Ordered and Plain.

In Ordered and plain it’s been tested by the development team that the ordered mode is useful on small data sets based on the hypothesis that a higher bias negatively affects the performance but also depends on the relationship between features and the target. Although in general tests it been verified that ordered mode performs well for smaller data sets and plain mode works best for larger data sets.

· Working of CatBoost

At the start of the algorithm CatBoost generates x + 1 independent random permutation of the data sets for the training purpose. Apart from first permuted dataset all other are used for constructing decision trees and the first one serves for choosing leaf values.

CatBoost uses same splitting criteria across an entire level of tree so that the tree remains balanced and less prone to overfitting.

CatBoost handles data very efficiently, few tweaks can be made to increase efficiency like choosing the mode according to data.

Hope the overview helps…….

Overview of CatBoost

Nishant kumar singh

More articles by Nishant kumar singh

Others also viewed

Data, decisions, egos and pinball.

Both OneHotEncoder and pd.get_dummies are used to convert categorical data into numerical data. But what is the difference between them?💡

Signal / Noise, Forest / Tree - What type of data problem are you solving?

The message of the curve-fitting messages

Puncturing the Hype with Graphs

When Linear Models Don’t Fit Your Data, Now What?

Graph Algorithms - Breadth First Search

Making Causal Inference Easy

Binary Classification: ANN vs RNN what changed ?

Dummify Variables in R

Explore content categories