Decision Tree - Introduction
A decision tree uses a tree-like model to make predictions. It resembles an upside-down tree. It is also very similar to how you make decisions in real life: you ask a series of questions to arrive at a decision.
A decision tree splits the data into multiple sets. Then, each of these sets is further split into subsets to arrive at a decision.
With high interpretability and an intuitive algorithm, decision trees mimic the human decision-making process and excel in dealing with categorical data. Unlike other algorithms such as logistic regression or SVMs, decision trees do not find a linear relationship between the independent and the target variable. Rather, they can be used to model highly nonlinear data.
With decision trees, you can easily explain all the factors leading to a decision/prediction. Hence, they are easily understood by business people. They form the building blocks for random forests, which are very popular algorithms among the kaggle community.
- An attribute can be present in one or more tests/nodes of a decision tree. In the above example, Age is an attribute and 50 or 20 is a value.
- The most informative features are towards the top of a tree.
If a test splits the data into more than two partitions, this is called a multiway decision tree
Almost always, you can identify the various factors that lead to the decision. In fact, trees are often underestimated for their ability to relate the predictor variables to the predictions.
As a rule of thumb, if interpretability by laymen is what you're looking for in a model, decision trees should be at the top of your list.
Each decision is reached via a path that can be expressed as a series of ‘if’ conditions satisfied together, i.e., if ‘thal’ is not equal to 3, and if colored fluoroscopy is greater than or equal to 0.5, then the patient has heart disease.
Python requires the library 'pydotplus' and an external software 'graphviz' to visualize the decision tree.
Concepts Behind Decision Construction:
- Homogeneity measures: Given 10 attributes, how do you decide which attribute to split on first?
- Gini index
- Entropy and information gain
- Splitting by R-squared
- Advantages and disadvantages
- Tree truncation
- Pruning
- Hyperparameters
Advantages:
- Predictions made by a decision tree are easily interpretable.
- A decision tree does not assume anything specific about the nature of the attributes in a data set. It can seamlessly handle all kinds of data — numeric, categorical, strings, Boolean, and so on.
- It does not require normalization since it has to only compare the values within an attribute.
- Decision trees often give us an idea of the relative importance of the explanatory attributes that are used for prediction.
Disadvantage:
- Decision trees tend to overfit the data. If allowed to grow with no check on its complexity, a tree will keep splitting till it has correctly classified (or rather, mugged up) all the data points in the training set.
- Decision trees tend to be very unstable, which is an implication of overfitting. A few changes in the data can change a tree considerably.