Classification and regression trees
Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values.
Regression is also a type of supervised learning. The difference between the two tasks is the fact that the dependent attribute is numerical for regression and categorical for classification.
In both cases, we can generate a feature tree. While in the classification problem, our main motto is to reduce the impurity of the nodes and for the regression problem, our motto is to reduce variance.
Binary Classification impurity calculation methods:
Entropy: (-p)log2(p) - (1-p)log2(1-p)
Gini Index: 2(p)(1-p)
Minority Class: minimum(p, 1-p)
√Gini Index
Consider the following example for growing a feature tree:
We have five positives and five negatives in our examples that depicts if the given attributes belong to a 'Dolphin' category or not. Now, consider each attribute and mark the number of positives and negatives against it.
Length: {[3,4,5]} - {[2+, 0-], [1+, 3-], [2+, 2-]}
Gills: {[yes, no]} - {[0+, 4-], [5+, 1-]}
Beak: {[yes, no]} - {[5+, 3-], [0+, 2-]}
Teeth: {[few, many]} - {[3+, 4-], [2+, 1-]}
Total Impurity = weighted average of individual impurities = Summation (weight * Entropy) = Summation (weight * Gini Index)
Note: In case of length = 3, it is forming a pure node that is it generates only positive class and no negative class with this rule. So impurity if automatically zero. And in case of length = 5, it is forming a balanced node of 50 percent negatives and 50 percent positives so the impurity in this case is maximum = 1 (P.S: Go ahead and calculate the values using one of the impurity criterion you would obtain the same values)
Total Entropy(length[3,4,5]) = (2/10) (0) + (4/10) [(-1/4)log2(1/4) - (3/4)log2(3/4)] + (4/10) (1) = 0.73
Total Entropy(Gills[yes,no]) = (4/10) (0) + (6/10) [(-5/6)log2(5/6) - (1/6)log2(5/6)] = 0.39
Total Entropy(Beak[yes,no]) = (8/10) [(-5/8)log2(5/8) - (3/8)log2(3/8)] + (2/10)(0) = 0.76
Total Entropy(Teeth[many,few]) = (7/10) [(-3/7)log2(3/7) - (4/7)log2(4/7)] + (3/10) [(-2/3)log2(2/3) - (1/3)log2(1/3)] = 0.97
Least impurity is observed in Gills, so we select Gills to split on.
In a similar manner, we can find the Gini Index criterion as well and values we would obtain is as follows:
Total Gini(Length) = (2/10)(0) + (4/10)[(2)(1/4)(3/4)] + (4/10)[(2)(2/4)(2/4)] = 0.35
Total Gini(Gills) = (4/10)(0) + (6/10)[(2)(5/6)(1/6)] = 0.17
Total Gini(Beak) = (8/10)[(2)(5/8)(3/8)] + (2/10)(0) = 0.38
Total Gini(Teeth) = (7/10)[(2)(3/7)(4/7)] + (3/10)[(2)(2/3)(1/3)] = 0.48
As seen, using Gini Index as impurity criterion also we get 'Gills' as an excellent feature to split on. So 'Gills' is the first split of the tree. To deduce the next split of the tree find the positives and negatives with respect to Gills = yes
Length: {[4,5]} - {[0+, 2-], [0+, 2-]}
Beak: {[yes, no]} - {[0+, 2-], [0+, 2-]}
Teeth: {[few, many]} - {[0+, 0-], [0+, 4-]}
Now, as observed for Gills = yes, we have received all pure nodes that is with impurity = 0, which depicts that Gills = yes itself is the leaf node of the tree and no further rules can be added to this decision rule.
So, now we find the positives and negatives with respect to Gills = no. And calculate impurity in the same manner. Please try this and come up with solutions. The final tree would look something like this:
This is the internal working of creation of a feature tree. In the next article, I will try to explain the reduction of variance and how to find the best split in case of regression trees.