This Tree Can Help You Predict the Future
No, this post is not about some physical tree with mystical power, but rather about a wonderfully simple and powerful predictive analytic technique called a Decision Tree. Decision Trees can be used to find patterns in large quantities of data that may be difficult or impossible to find manually. The patterns that are discovered can and should be used to help you make better decisions by giving you insight into the outcomes of future events.
What are Decision Trees and How are they Used?
Decision trees are constructed using a supervised learning algorithm – which is to say that you need a training data set for which you know the prediction target (whether it was a sale or not, whether the employee churned or not, etc.). Ultimately the goal is for the algorithm to learn the patterns in the training data set which can then be used to predict – hopefully with some reasonable accuracy – the output for some unseen future data. Decision Trees are applicable to regression and classification problem, though we will only focus on the latter case here.
To make the concept more concrete, let’s consider a simple use case. Assume we are a company that sells some product or service. We routinely comb through a large number of sales leads – more than we can possibly engage – to identify where we should spend our efforts trying to sell. Can we can use past data to help prioritize the opportunities in the future that are most likely to bear fruit? Well fortunately for us, we have been recording data about all the sales opportunities that we've had – so the answer is yes! In our data we have a few features: Age, Height, Gender, College Degree Status, Income and Sales Outcome (this is our target variable – basically what we want our model to predict). PERFECT! This will be our training data set. In practice we would actually want to take a part of the data to train the model and another part to test the model, but we will leave that discussion for another post.
Below is an example decision tree that I created using our completely fictitious data set. Before we talk about how I actually created it, lets talk about how we can use it. The tree is simply a series of true/false evaluation conditions in the gray boxes (called “decision” nodes) which are used to arrive at the prediction in the blue boxes (called “leaf” or “terminal” nodes). The very first decision node is called the “root” node.
In order to utilize this tree to make a prediction on the outcome of a new sales lead, we start at the root node and apply the statement/question to our data: Is the person younger than 30 years old? If so, follow the branch to the right, if not, follow the branch to the left. Continue traversing the tree, evaluating the sales opportunity at each decision node. Ultimately we will arrive at a leaf (blue) node that informs us whether the sales opportunity will likely be successful or not. To illustrate this point, lets’s consider we have a new sales opportunity with the following individual:
Age: 60 yrs, Gender: Male, Height (in): 64, College Degree?: Yes, Estimated Income: $50-60k
Using our decision tree, we can predict that the sales lead is unlikely to result in a sales and accordingly may not be worth our resources (time, effort, money) to pursue. The image below shows how the tree was traversed to make that prediction.
Pretty cool, right? But how did we build this tree? And more generally, how do we take an arbitrary set of historical data (training data) and generate the tree structure and associated decisions that we can use for future predictions? We’ll talk about that next.
Decision Tree Learning
The way the algorithm works at its core is to partition the training data set into smaller, mutually exclusive groups through a process called recursive partitioning. Recursive partitioning is just a fancy way of saying it splits the data into subgroups and further splits those subgroups into additional subgroups. In the end, each of the final groups (or partitions) will be given a label to indicate whether a sale is likely for all the sales opportunities that fall into that partition. The way the training data is partitioned is by creating a bunch of successive split criteria – which are nothing but the True/False evaluation statements like the ones we saw in the decision nodes in the above tree. The practical problem is the following: how do we know what evaluation criteria to use, and how do we assemble them into a tree-like structure in such a way that the result is useful? For our case, should we first split on age and then on gender? Vice versa? What value of age is the best age to split on? This is where the decision tree algorithm comes to our rescue. It builds the tree, starting at the root node, by choosing the best split criteria at each node. This is done by considering every possible split value for every possible feature (thank God for computers) and then selecting the “best” one as the split criteria. This is why it is called Decision Tree Learning, because the algorithm will actually learn the structure of the tree from the data we give it. This explanation begs the question, though, how do we determine the quality of a split?
For me, the most intuitive way to think of the quality of a split is to think about how homogeneous (or how pure) the resulting partitions of data are in terms of the outcome our model is predicting. The combination of the feature and value that maximize the homogeneity of the resulting partitions is chosen. In other words, a good split criteria will separate – better than any other possible split criteria – the data into those sales opportunities that were successful and those that were not. For those that are interested in reading a little more about the technical metrics used to measure what I am referring to as “purity” and “homogeneity”, take a look at Gini Impurity and Information Gain. This process of selecting the “best” split is repeated starting at the root node and then again on each subsequent node in a recursive fashion until all the data points in a node have the same value or no more splits can be made. Basically, stop splitting when the terminal nodes are 100% pure or there are no more attributes to split (i.e. all of the data points within a node are identical, even if the outcomes are not). The resulting decision tree is known as a “fully grown” decision tree.
At each leaf node of the tree we use the most common value of the training data labels to assign the predicted value for that node. For example, if we come to a leaf node that has 100 training data points and all were Successful Sales, create a rule for that node that says any future data that fall into this partition should be predicted to be Successful Sales. If the target variable for the training data in leaf node is not homogeneous, use the more common label from the training data, or better yet assign an estimate of probability. For example, if the leaf node has 70 Successful Sales and 30 Unsuccessful Attempts (out of 100 total samples in the node), we could label future data that falls in that node/partition to be Successful (this is called a Hard Rule) or we could label them 70% likely to be Successful (this is called a Soft Rule), to give an indication of the strength of the rule. Generally Soft Rules are preferable because they convey more information.
Our decision tree for predicting sales outcomes from above is shown again below, but now, we can see the associated training data that was used to learn the actual tree. Each one of the training samples – representing a historical sales leads – is represented by a black or red stick figure to represent whether the outcome of the lead was a successful sale or not. Observe that, the further down the tree we get the more homogeneous the groups become.
Practical Considerations
There are some drawbacks to training "fully grown" decision tree as outlined in the procedure above. To motivate potential shortcomings, consider that the overall objective of building a decision tree is to learn the patterns from a sample of training data to predict something about future samples from the population of data. Specifically, we want to learn about the patterns in our data that generalize to (i.e. are also present in) the population of unseen future data. Likewise, we want to make sure we don’t learn the data patterns that are specific to the training data but not representative of the patterns in the population – think of this as noise. The question becomes: how do we ensure we are learning as much useful information about our data as possible without learning superfluous patterns specific only to our sample of training data? This is commonly referred to as balancing underfitting and overfitting the training data.
The practical approaches to avoiding the issue of a Decision Tree overfitting the training data is to stop partitioning after the tree reaches a certain depth (which can serve to limit the complexity of a partition), stop partitioning after the number of data points in a node drops below some threshold level (limiting the minimum size of a partition), or stop splitting when the gain in “purity” resulting from any additional splits drops below some threshold value. By doing these things, it is often found that the performance of the resulting decision tree will be better on new, unseen data. I wish there were a way that I could tell you what the best tree depth was, or the best minimum leaf node size, but in practice you’ll need try various combinations and figure out what works best for your specific problem (this is a procedure called “model tuning”).
Conclusions
In this post I introduced a basic decision tree algorithm, explained how simple they are to use, and introduced at a high level how the algorithm builds a decision tree from a set of training data. I used a simple toy case to demonstrate some of the concepts, but in practice, decision trees can accommodate much more data. There have been numerous enhancements and extensions of the decision tree algorithm as described above to make them more powerful, but the heart of the algorithm is largely the same. In fact, decision trees formed the algorithmic backbone of the incredible accuracy that the Microsoft’s Xbox Kinect (the camera) has at tracking and identifying body parts. You can read more about that here. I have tried to keep this post as nontechnical as possible while still getting the idea across. If I missed the mark or if you’d like to see more technical details in the future feel free to message me directly or leave a comment.