Decision Trees Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video, we continue to learn classification techniques with the decision tree uh classification algorithm that uses trees. ... So there are many algorithms that fall into a class known as tree based methods. Uh because they are based upon the humble decision tree which we will talk about in this notebook. Uh This notebook can be found in the classification folder of the supervised learning lectures. So in this notebook, we will introduce the decision tree algorithm. We'll we'll define a concept known as a ginny impurity or genie impurity. At the end, we will are in that process. In the process of that, we will learn the cart algorithm which is used to fit a decision tree and S K learn. So the really basic idea behind a decision tree is it's used to segment the data space in a way that allows us to classify things. Well, so let's look at a simple example where we've got two features X one and X two X one. Uh we've got and two classes zeros and ones. So zero are blue circles, ones are orange triangles, X one is on the horizontal axis and X two is on the vertical axis. So if you had to come up with a rule to classify these, what might you do? Well, one reasonable idea might be to put a vertical line at the point er at a vertical line at X one equals 21. So X one equals one, that vertical line. And then say if we're to the left of that line, we would classify the observation as a zero. And if we're to the right of that line, we would classify the observation as a one. And so one way we might illustrate this to others would be to uh draw a decision tree or a diagram. So here is a diagram where we take all of our data or any new observation and we start here. And if your point X has an X one that is less than one, you go to the left and you classify it as a zero. If your point has an XX one that is greater than, or equal to one, we will take it to the right and classify it as a one. So all the training data goes in here. Any data you want to make an observation or a prediction for, excuse me is uh put in here, you then make the decision uh is X one less than one. If so you classify as zero, if X one is greater than or equal to one, you classify as a one. And so this is a simple example, but it's really the basic idea behind a decision tree. So you'll have some data space that your decision tree will look at and try and make cut points along various features which at the end will present uh will, which in the end will uh provide you with a logic tree that you can then follow and figure out how to classify any point that you might see. So we're going to show you how to fit this on our silly data here. Uh And see what classification rule S K learns decision tree model comes up with, which is the decision tree classifier. So you might see here first, I'm importing tree. The reason I'm doing this is because eventually I would like to plot the tree that S K learn comes up with and to do that, we'll use the tree sub sub package. But in general, if you're just plotting or sorry, if you're just fitting a decision tree, you would say from S K learn dot tree import decision tree classifier. OK. And then you're gonna go ahead and make a decision tree object. So decision tree classifier, ... and then this fits it just like we fit it before. And now that we have a fitted uh decision tree in S K learn, we uh note that I have stored this decision fitted decision tree in something called fig a variable called fig which I can use to plot the tree with trees, plot tree function. So in plot tree, you put in a stored uh fitted decision tree uh and then you can fill the uh boxes. So I'll see what we mean here. OK. ... So here we can see this is the decision tree pro produced by S K learn. Uh And how can we read it? There's a lot of different things. Well, the top here corresponds to the top node at the top. And so at the, at, in this image above, uh so each of these little boxes here, there's no box that you can see. But this is a box. Each of these little boxes are known as nodes. Uh they have various levels in the tree. So this is the highest level. This is the next lowest level. Uh Each of these nodes would be considered Children of the previous layer of the decision tree network. Uh And then you'll see in each of these nodes, we have a bunch of different words uh equations, equal symbols. So what do all those mean? Well, any node where a decision rule is made will have the decision rule at the top. So here, the first decision rule we have is we look at the zero column of X and if it's less than or equal to 1.1, you go to the left. So if that's true, you go to the left and it looks like here uh it's colored orange. We'll see what this means in a second. If that is not true. If your ex at zero is greater than one point oh oh one, you go to the right. OK. Uh You see this thing here called Ginny or genie. This is the uh impurity metric. We'll talk more about this in a second. But this is how the decision tree determines where to make its cuts. This is the measure of uh how impure the classification is. Uh at this step, the number of samples is the number of observations within the node. So in the top node, we start off with 200 samples because that's all of the observations. And these two Children nodes, uh we have 100 samples each because we've split it exactly 50 50. Uh And then we now we see something called value equals and there's a list 101 100. The value will tell you for all of the nodes within this uh for all of the observations that are within this node. Um What is the split of the different classes? So here we have a single class or a binary classification problem. And in our parent node, our highest level node, we have our root node is what it's called. We have 100 observations of class 0, 100 observations of class one. After we make the split in these two parent or Children nodes, uh we have uh 100 observations in the left node 100 of them are of class zero. And then we also have 100 observations in the right node 100 of them are of class one. And you can notice here that our impurity measure is now zero. So the goal with this impurity measure, and again, we'll see what it is, what that means in a second uh is zero. So you wanna make it as small as possible with each cut that you make in the decision tree. So that's how you read these diagrams. ... And so the big question that we have left is what is an impurity measure? Or in other words, how do we measure how wrong we are with our classification? Uh And so for a decision tree, wrongness is measured in two ways. The first is called genie impurity or ginny impurity. Uh And the second we've looked at in the multi class classification notebook called uh entropy. So we're gonna focus mostly on Ginny because that's the default in scalar. So in general, we're gonna have capital and target classes. So in this problem, we had two because it's binary classification. Uh And then for each class I, the gene impurity for that class within a single node is the probability that a randomly chosen sample of class I from the node is incorrectly classified as not class I. And so the way each node works is classifications are determined by the majority class. And then again, if there's a tie. It's just a random coin flip. And so uh this can be estimated or calculated as P I, which is the probability of randomly uh selecting uh 222 piece of I, which is the probability of randomly selecting an observation of type I. And then what's the probability that you ca uh classify that as not type I? That's one minus piece of I. And so the thing that's gonna change at each level of the decision tree and within each node is your estimates for P I and one minus P I. So in a given node, these estimates are just given by the fraction of all note or all observations within that node that fall into that class. So for instance, here and this node, the probability of being chosen and being of class zero is one and the probability of being chosen and being classified is not zero is zero because all the notes here are type zero. Hence why the Yin impurity Ginny impurity here is of zero. So that is within uh for a single class. And then the total impurity is just the sum of all of those. And so if we call the total impurity capital I subcapital G Virginia impurity, you just sum up all of the G I and then you can rework this to be one minus the sum from I equals one to capital N of piece of I squared. Uh entropy is the same entropy as what we saw in the multi class classification metrics notebook. So we're not gonna go over the derivation and what the formula is. Um But it's the same as in that notebook. So which might you want to use? So both of these measures are pretty comparable. Uh gin impurity tends to be faster because you don't have to compute a log of anything. Um So this is usually a good default to use and this is the default that S K learn uses. So how does it go about making these cut points? And how does it choose uh you know where to make cut points, et cetera. Uh So this is the cart algorithm which stands for the classification and regression tree uh or cart algorithm. So if we suppose in general that our data set that we're training on has little N observations with M features and then for simplicity, we're gonna only gonna talk about two target classes. Uh The extension into capital N as we used earlier in the notebook target classes is relatively straightforward, but we're not gonna cover it. So starting with the root node, your cart algorithm searches through each feature little K, each of the M features, uh each one little K and then finds a split point T sub K that will produce the purest subsets weighted by the number of samples. So what does this mean? It's gonna find A T sub K once you look at a particular feature K, your algorithm will then look and find a, a cut point T sub K that minimizes this sum, which is the awaited uh sum or the uh weighted sum of impurity and the resulting left and right node. So for instance, if you had 10 nodes or sorry, 10 observations and then seven of them end up in the left and three of them end up in the right. The weight on the left would be 70% and the weight on the right would be 30%. And then these impurities are show are measured using whatever metric you choose. So in the default is Ginny, but it could also be cross entropy. So essentially, you're looking to find the cut point that minimizes this, which is typically done with like a binary search. OK. Uh Once it finds, it goes through all the features and then finds the cut points, it looks at all of the features that it's looked through and all the corresponding cut points. And then it finds the pair K T sub K that has the smallest of these J S. So then that is the cut point it chooses. Uh And that's what produces the split at that root node. Uh Then the algorithm is gonna go ahead and repeat this entire process on each of the Children nodes it's prov uh produced. So after the root node cut right, you'll have a left node and a right node. And it's gonna uh essentially just repeat the entire algorithm on that left node, repeat the entire algorithm on that right node and then it will keep going. So it keeps doing this in some loop until you meet a stopping condition. So there's a number of stopping conditions that you can look for. So one of them that's pretty common is called uh reaching a maximum depth. So that means uh below or above Uh this decision tree had a depth of one because it had one decision rule if it had two decision rules. So let's say this wasn't completely purely class zero, maybe there were some class ones in there, it would then try and split them even further. And then we would have uh a uh a decision tree that is two layers deep. ... So this maximum depth is a hyper parameter. You can choose which just says OK. Once you reach depth three, say once you reach depth max underscore depth, I don't want you to keep doing any more cuts, you can stop there. Uh Another uh option is minimum number of samples in each node which is min underscore samples underscore leaf. So for instance, if you were to produce a split that gave you less than whatever this number is, then it would say, well, I can't make that split and then exit the algorithm. Uh Similarly, you can choose to have minimum weight to be in a node which is the same thing, but instead of number of samples, it's just the weight. So let's say you could set it at 0.1 of the data set or you know, point oh five something like that. Uh And there are even further options that you can look at at the decision tree classifier documentation known as Kyler. Uh The one that we'll usually deal with is max underscore depth. Another cutting option or stopping point maybe is that uh you can no longer reduce the impurity. So in the example above, after one cut, after one cut, we had two entirely pure nodes, meaning that no other cut we could make would make it any more pure. Uh So therefore, it stopped. Uh If you could also end up in a situation where you don't have a completely well split data set, but you can't cut it anymore and make uh a pure uh split. So that's uh that's another stopping point. Uh So here we're going to look at a new data set and implement it here and show you an example of using maximum depth. So here we have, I think we used this uh may have used this before. Uh But here we have a bunch of blue zeros up in the upper left hand side and a bunch of orange triangles which are ones in the bottom right hand side. And so we're gonna go ahead and fit a bunch of different decision tree classifiers with varying maximum depth. So from 1 to 11, and we'll see how that changes the decision boundary on this data set. So for instance, with a single maximum, a maximum depth of one, so after one cut, we stop, we're limited at this classifier that just chooses a vertical split. But then as we start to have more uh a larger or a deeper and deeper tree be allowed, we start to fit the data a little bit better. Or one could even argue that maybe we're, we start to over fit the data, right? And so that's the point of these controls is it tells us uh the controls will tell us to stop before we've reached a point where we've over fit too highly to the training data. And 11 way you can see this is um as you keep going down and down with a deeper and deeper tree, we're starting to get more and more uh cuts in the data er in the decision boundary that are more specific to the data set, producing this very wonky looking um staircase. OK. So typically you would find maybe the best choice with something like cross validation or uh a validation set. So some nice advantages of decision trees is that they're interpretable, meaning that you can entirely interpret the decision process because all you have to do is walk yourself through to the decision tree. Uh Predictions are very fast. Uh Because once you have a fitted decision tree again, you just essentially have to put it through and let the machine sort everything out. Uh And you really don't have to preproce your data too, too much before you train. Uh some disadvantages are greediness. So this cart algorithm chooses the best feature cut point pair at each step, right. So that means that maybe the best decision tree, the one that has the best impurity at the end of the day or makes the best predictions isn't the one you end up with because it has a suboptimal feature cut point at the top. Uh Whereas the greedy decision tree cart algorithm uh is going to avoid that. Uh we'll see some ways to counteract this in a, in a later notebook. Um But for now, what we've presented here is a greedy algorithm. Uh decision trees are also very prone to overfitting, but you can try and control it a little bit. As we said, with maximum depth, they're like min samples split. Uh Another weakness is that there are orthogonal boundaries. So for instance, you and I might be able to look at this data set and say that like the line Y equals X would be pretty good. But the decision tree produces these weird orthogonal cuts in its attempt to approximate that line. ... Uh And then finally, it's sensitive. So this goes to the over overfitting point. So they're very sensitive to the training data. So if you take out or put in uh certain points. It's possible that that's gonna really change your decision boundary if they're put in the right position in the data space. So uh one way around this is using what's known as an averaged algorithm like a random forest which we will talk about in a later notebook. OK. So that is the decision tree classifier. Uh And the decision tree classification algorithm. Uh I hope you enjoyed learning about things like Ginny impurity and the car algorithm. And I hope you enjoyed learning in general how to make a decision tree. Uh I had, I enjoyed having you watch this video and I hope you have a great rest of your day. Bye.