DECISION TREE

Ragini Trivedi

Published Apr 10, 2023

A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes

As you can see from the diagram above, a decision tree starts with a root node, which does not have any incoming branches. The outgoing branches from the root node then feed into the internal nodes, also known as decision nodes. Based on the available features, both node types conduct evaluations to form homogenous subsets, which are denoted by leaf nodes, or terminal nodes. The leaf nodes represent all the possible outcomes within the dataset. As an example, let’s imagine that you were trying to assess whether or not you should go surf, you may use the following decision rules to make a choice:

This type of flowchart structure also creates an easy to digest representation of decision-making, allowing different groups across an organization to better understand why a decision was made.


Decision tree learning employs a divide and conquer strategy by conducting a greedy search to identify the optimal split points within a tree. This process of splitting is then repeated in a top-down, recursive manner until all, or the majority of records have been classified under specific class labels. Whether or not all data points are classified as homogenous sets is largely dependent on the complexity of the decision tree. Smaller trees are more easily able to attain pure leaf nodes—i.e. data points in a single class. However, as a tree grows in size, it becomes increasingly difficult to maintain this purity, and it usually results in too little data falling within a given subtree. When this occurs, it is known as data fragmentation, and it can often lead to overfitting. As a result, decision trees have preference for small trees, which is consistent with the principle of parsimony in Occam’s Razor; that is, “entities should not be multiplied beyond necessity.” Said differently, decision trees should add complexity only if necessary, as the simplest explanation is often the best. To reduce complexity and prevent overfitting, pruning is usually employed; this is a process, which removes branches that split on features with low importance. The model’s fit can then be evaluated through the process of cross-validation. Another way that decision trees can maintain their accuracy is by forming an ensemble via a random forest algorithm; this classifier predicts more accurate results, particularly when the individual trees are uncorrelated with each other.

Types of Decision Trees

Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology, forms the foundation of many popular decision tree algorithms, such as the following: 

- ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative Dichotomiser 3.” This algorithm leverages entropy and information gain as metrics to evaluate candidate splits. Some of Quinlan’s research on this algorithm from 1986 can be found here (PDF, 1.4 MB) (link resides outside of ibm.com).

- C4.5: This algorithm is considered a later iteration of ID3, which was also developed by Quinlan. It can use information gain or gain ratios to evaluate split points within the decision trees. 

- CART: The term, CART, is an abbreviation for “classification and regression trees” and was introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify the ideal attribute to split on. Gini impurity measures how often a randomly chosen attribute is misclassified. When evaluating using Gini impurity, a lower value is more ideal. 

How to choose the best attribute at each node

While there are multiple ways to select the best attribute at each node, two methods, information gain and Gini impurity, act as popular splitting criterion for decision tree models. They help to evaluate the quality of each test condition and how well it will be able to classify samples into a class.  

Entropy and Information Gain

It’s difficult to explain information gain without first discussing entropy. Entropy is a concept that stems from information theory, which measures the impurity of the sample values. It is defined with by the following formula, where: 

Entropy formula

S represents the data set that entropy is calculated 
c represents the classes in set, S
p(c) represents the proportion of data points that belong to class c to the number of total data points in set, S

Entropy values can fall between 0 and 1. If all samples in data set, S, belong to one class, then entropy will equal zero. If half of the samples are classified as one class and the other half are in another class, entropy will be at its highest at 1. In order to select the best feature to split on and find the optimal decision tree, the attribute with the smallest amount of entropy should be used. Information gain represents the difference in entropy before and after a split on a given attribute. The attribute with the highest information gain will produce the best split as it’s doing the best job at classifying the training data according to its target classification. Information gain is usually represented with the following formula, where: 

Information Gain formula

a represents a specific attribute or class label
Entropy(S) is the entropy of dataset, S
|Sv|/ |S| represents the proportion of the values in Sv to the number of values in dataset, S
Entropy(Sv) is the entropy of dataset, Sv

Let’s walk through an example to solidify these concepts. Imagine that we have the following arbitrary dataset:


For this dataset, the entropy is 0.94. This can be calculated by finding the proportion of days where “Play Tennis” is “Yes”, which is 9/14, and the proportion of days where “Play Tennis” is “No”, which is 5/14. Then, these values can be plugged into the entropy formula above.

Entropy (Tennis) = -(9/14) log2(9/14) – (5/14) log2 (5/14) = 0.94

We can then compute the information gain for each of the attributes individually. For example, the information gain for the attribute, “Humidity” would be the following:

Gain (Tennis, Humidity) = (0.94)-(7/14)*(0.985) – (7/14)*(0.592) = 0.151

To view or add a comment, sign in

DECISION TREE

Ragini Trivedi

More articles by Ragini Trivedi

Others also viewed

Image Classification and Transfer Learning

What's the Fuss About Machine Learning?

What is Machine Learning?

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

Introduction to Supervised Learning

Challenges in Machine Learning: Understanding Real-World Hurdles

Imputing Missing Data in Supervised Learning Models

Machine Learning - Introduction

XGBoost: Machine Learning Powerhouse

Classification and regression trees

Explore content categories

More articles by Ragini Trivedi

GIT

APACHE SPARK

DEVOPS

AZURE DATA ENGINEER

GCP

ACTURIAL

CLOUD OPERATIONS

SALESFORCE

REDSHIFT

UIPATH

Others also viewed

Image Classification and Transfer Learning

What's the Fuss About Machine Learning?

What is Machine Learning?

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

Introduction to Supervised Learning

Challenges in Machine Learning: Understanding Real-World Hurdles

Imputing Missing Data in Supervised Learning Models

Machine Learning - Introduction

XGBoost: Machine Learning Powerhouse

Classification and regression trees

Explore content categories