Decision tree in Machine Learning.
So, what's a decision tree? as intuitive as it sounds it got some decisions to make to reach until leaves, basically imagine this way, it's like an inverted tree, roots on top and leaves are the areas that we have to reach to.
The decisions that we make changes the structure of the tree but the data remains the same.
so let's talk about grouping first, why we need to group data points? so as to sort and get some insights of it, we call it clustering in machine learning language, for example, types of anything is called grouping, types of species on earth: human, animals, plants, marine beings, etc.,
What is the classification? it's more or less like a yes/no problem, a particular person will repay my loan or 'not'. There are two types of classifications in ML: Descriptive classification and Discriminative classification. Descriptive classification deals with techniques like Bayesian, K-nearest neighbors, and Parzen window, and here comes the decision tree algo, Neural Networks, and Support vector machine in the discriminative classification techniques.
How would we assign class labels? in mundane life, let's consider lentils with some stones in it, we call it lentils because we see most of the time lentils ignore the stones or impurities, in same the way labeling is done in ML, if there are n data points which belongs to a group say lentils and n' data points which belong to stones group, n>n' then the group is called lentils and n<n' then it's called stones. Here the doubt comes, how much of impurity we can allow to the group of lentils to be sold, obviously with fewer stones, right? to optimize this in ML world we consider 'Accuracy' as one parameter, how is it done? the pack of lentils is called lentils if the lentils /lentils+stones is greater than stones/lentils +stones, purely it looks like average, but can we count stones in the lentils? the straight answer is 'no' but we can approximate the counting, which uses maximization technique also called as hard clustering.
So, here we need to understand which data point looks like stone and which as lentil? stone looks black and shape is kinda unstructured when compared to the lentil, we need to get an understanding which features of the data point to be considered, is it color or is it the shape of the stone, is this the best way to divide the data and is the feature we picked is the best feature to divide it to group.
Softening the cluster:
Moving forward from hard clustering to soft, here we learn about weighted averages, again going back to the mundane world, consider below picture, the greener parts are very less on the Indian map, but when greenery is high then it's called forests, but India isn't full of forests, right? there are called pure regions with 100% accuracy that shows greenery from the visualization below, this is how we call them forests, where there's density there's purity and accuracy, now go to the brown circle, here also we can see that there's greenery but in very less ink. In ML here's how we would consider: purest regions will have lesser weights compared to impure regions which lead to 'Information theory', which says 'where there's infrequent information there would be more importance'. The expected accuracy of any cluster in the feature space is nothing but purity and size combined, in a mathematical world 'and' converts to multiplication, hence it's: size * purity.
There are three types of purity measuring techniques in ML:
Accuracy is maximum of (number of data points of each class in each cluster / total number of data points), for good understanding on the above fraction, let's take an example, let's say we have lentils, black gram and stones mixed in packets of lentils, there are so many such packages, to figure out dimensions, let's say there are 100 stones, 20 back grams, 100 lentils in one package, 30 stones, 10 back grams, 500 lentil seeds another package and so on.. this measure goes in the numerator, and in denominator it's (100+20+100)+(30+10+500)...so on..(Note: Laplacian smoothing factor Lamda is added in numerator and denominator).
Recommended by LinkedIn
So each division or the packet will have a probability distribution, let's say those as P1, P2,...of a particular class(lentil, stone or black gram), and probability is division seen in the above paragraph. hence accuracy is max(probability of a particular class) and the error is '1-accuracy'. Hence, accuracy is rightly labeled data points and error is wrongly labeled data point.
GINI measure: let's consider, the same packet of lentils, black grams, and stones as above, we see more of lentils in one packet and we could say there's good percentage of lentils then black gram in a package thus this might be a packet of lentil, we checked another packet and we see black gram is more then we would say that 'this packet might be of black gram', the 'might be' word here in math is the probability, probably lentils or probably black gram, here's how labeling is done, we calculate the probability of each class in the data cluster like 6% of stones, 4% BG and 90% lentils, hence this is a packet of lentils, thereby observing all clusters we come to a conclusion of expected accuracy.
GINI-purity= sum(squares of Probability), impurity is 1 - purity.
Entropy: Less frequent is the information more is the importance, expected info: size/content is entropy.
Entropy=H= summation of {(Probability)*Log (1/Probability)} (Probability should be calculated for each class in a cluster)
Hyperparameters in a decision tree are Size and Purity.
Below is the graph which explains in short how the purity of the data cluster affects the decision tree, but how far we can go?? is a good topic for thresholds.
Also refer:
Dedicated to my Professor: Shailesh Kumar _/\_
Abhay Kumar
Congratulations Sravani, Keep it up.