Random Forests Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna learn about ensemble learning's random forest method random forest model. Uh So let's go ahead and get started. So, in this notebook, we're going to describe the random forest model, how it works, how it's made. Uh In this process, we will mention how you can introduce random perturbations in different ways to decision trees. Uh We'll then extend this idea to an even more randomly perturbed model called the extra trees model. And then at the end, we'll demonstrate both how random forests and extra trees can be used to determine feature importance um uh for classification settings, but it can be extended to regression as well. So the random forest model is an ensemble model that builds many different decision trees. So here the ensemble is an ensemble of randomly different decision trees. And so the decision trees are made different with a variety of random perturbations. Um So we'll talk about what that means in a little bit. But the main takeaway here is a random forest. Why is it called the random forest? Because it's an ensemble of decision tree models that are randomly perturbed. Uh So let's demonstrate maybe some of the advantages of this with the very quick synthetic data sets. So this was used in our Decision tree notebook. Uh We've got some blue zeros up above and some one orange triangles down here. And what we can do is we can compare the decision boundaries produced on these data uh using a single decision tree of maximum depth two and a random forest of trees with maximum depth two. So in S K learn, you can make a random forest classifier with the random forest classifier class. And this is stored in the ensemble module. So from S K learn dot tree, we'll import the decision tree for our comparison points, ... decision tree classifier. And then from S K learn dot ensemble, we'll import the random forest classifier. OK. So this is one we've seen before we make a decision tree of max steps two. Now let's show you how to make a random forest. So R F is equal to random forest classifier. You have to set the number of estimators. So why don't we choose uh 100? So the first argument is the number of estimators. The second argument is the uh we can put the maximum depth. So we want it to be two, you can choose the random state. So we're gonna do a random state here. So this is a random thing. So if you want, I want you to be able to see the same thing that I see. So we'll do 614. OK. Now we'll fit the models. So tree dot Fit X comma Y and then R F dot Fit X comma Y. So this R F might take a little bit because you are training 100 decision trees. Uh but it's pretty fast because we only have a maximum depth of two. Uh OK. So this is gonna just demonstrate the different e uh decision um boundaries. So on the left, we have the decision tree which is two orthogonal cuts. And then on the right, we have an average right of different decision trees, which gives you a, a closer border to what the actual is. But we're still making mistakes over here. So we could even see like what happens if we were to change this from 100 decision trees. What if we went to 1000 decision trees? ... OK. So once we increase the number of trees, we're more likely to randomly and we'll see what this means in a, in just a little bit, uh we're more likely to randomly get decision trees that make the correct decision on these over here. So you can see this one for instance, fails to cut those in um and produces uh an algorithm that perform will perform better in general than a single decision tree on this particular problem. So what does an S K learn? Uh what does S can learn do when you try and train a random forest. How are we making all these different decision trees? And what the heck do you mean by random perturbation? So you can randomly make decision trees slightly different in a couple of different ways. And the main one is by subli sampling subsets of the training data. So this right was our training data up above in this example. And so what do I mean by randomly sampling subsets is you're going to go through all of these observations and then uniformly at random, you're going to take a point, any point you want uh each of them having the same probability of being chosen, you'll write it down and put it in the training set and you're gonna do this with replacement. So you'll then go back and then all of the original points will still be there. You can randomly take another point. This means that you may accidentally, well, not accidentally, but you may uh take the same point that you took with the first draw that can happen. But that's the whole idea is you're gonna take a subset of this data set and you're going to do what is called bootstrapping, which is where you're going to randomly sample another subset of points to train a decision tree on by just choosing points from here with replacement. And so the number of points you choose can either be the same size of the original data set or it can be smaller OK. So this is called random replacement or boot strap aggregating. So, bootstrapping is where you take a random sample of your training data with replacement. OK? And so each of the, you're gonna do this when we set above up here, when we put 1000 in here, this random sampling was done 1000 times. So each of the decision trees in your random forest are trained on one of 1000. In this example, one of 1000 different training sets and they're different because you're randomly sampling the data. And so this is still the same training data and it's still operating on the, on the assumption that your training data comes from some distribution out in the world that you've sampled it from. Uh But you're getting slightly different data sets from one tree to the next. With the idea here, uh you're introducing bias. So doing this average process right, introduces some bias into the data to try and get away from overfitting to any one particular training set. So that's sort of the first way. And so you could do um max samples, you could set this equal to uh a number, that's how you control the number of points you you sample from. So it, for instance, I believe max samples is by default set to uh the size of the data set. But we could go through and say ... we want the max samples to be uh 60 and that will even give us another slightly. So in this example, right, we didn't get those, we could go through and say what if I want max samples to be uh 80. So this is just changing the number of samples uh that we take from our training set in this uh random process. Uh do do do OK. And so you might also be wondering, I don't think I've touched on this yet. How did the predictions actually get made? So, in this particular example, right, we had the 1000 decision trees. So the classification, uh each of those 1000 decision trees is going to have its own classification. Uh for each point you just take the average. So in this case, uh of the 1000 trees, you're essentially just voting. How many of you think that this observation should be a zero? How many of you think that this observation should be a one? And then you take the one that is uh has the highest number of votes. And so, for instance, in this region that's been shaded orange, we would say that we have more votes for this to be an orange triangle than a blue circle in this region up here shaded blue, we have more votes to be a blue circle than an orange triangle. OK. Uh So that's the first way is by sampling subsets of the training data. Uh In addition, we will also randomly select predictors. So in the uh in the decision tree, right, we ran the cart algorithm and that shows the best predictor and the best, the best feature and the best cut point that gave me the greatest reduction in impurity. But a random forest will also just sometimes depending on the hyper parameters that you choose, will also just do a random selection of predictors to run the cart algorithm on. So in this instance, we only have two, we have two predictors. And so that probably isn't happening. But in the instance where maybe you have uh tens or hundreds of features, you'll, you'll maybe want to use this hyper parameter called maximum features, which limits the number of features that the cart algorithm can consider. So at the beginning of the algorithm, you'll randomly select a sub feat a subset of features on which to run the algorithm. And so again, this can give you suboptimal trees like not the best tree, but you're introducing this bias in order to maybe lower the variants enough to give you a better model overall. So that's the the main two ways that random forests do things. Um This is probably the most hyper parameters for any model we've considered, right. So not only do you have all the hyper parameters for a decision tree like maximum depth or min leafs sample. Uh But you also have all the hyper parameters for random forests like max samples, max features. Uh and things like that you can see more of them by going to the documentation. Um But you want, you may want to, you know, come up with some sort of principled way to choose the hyper parameters uh for instance by doing cross validation and making a grid search or something like this. So that's the random forest model. You can make this even more random uh with something called extra trees. So an extra trees classifier is an even more random random forest. So an extra tree classifiers is an extension of the random forest. In addition to randomly selecting points and randomly selecting a handful of features, extra trees also doesn't really run the car algorithm. It just randomly chooses car uh cut points. So for instance, if your feature was on the interval from 0 to 1 like ours, the extra trees algorithm is just going to randomly choose the cut point for each decision tree. So if it decides that I want to make a cut point on X one, it will randomly choose a cut point for X one as opposed to running the cart algorithm to find the best cut point. And again, this is nice. Uh because remember in the decision tree classifier notebook, we talked about how uh the cart algorithm is greedy, meaning that you may get a suboptimal cut, which will give you a worse tree than the best tree that you could get. This sort of extra trees approach of taking random cuts could maybe give you that uh that best tree that a cart algorithm wouldn't give you. So because of this randomness and choosing a cut point, it's faster. But again, because we're just randomly choosing a cut point, it has more bias. And so to know which one ahead of time will work better, extra trees or random forests, uh you can't really know. So you in essence have to run both of them and do a cross validation to compare the two. So and S K learn uh extra trees is the extra trees cats classifier also stored in ensemble. So we'd say from S K learn dot ensemble import extra trees classifier. And then you can make a model object which you may be able to see. I've called the E T. So E T is equal to extra trees classifier and then I'm gonna go up real quick. So I just copy and paste the exact same arguments that I had for my random forest classifier. So we have like an apples to apples comparison. Uh The random state won't impact. Uh You know, we'll still have the extra trees befit the correct way I went too far. Here we go. OK. So we'll have 1000 trees with each of them having a maximum depth of two and the maximum number of samples I will randomly take from the training set is 80 and then I need to fit it come away. ... And then uh so I'm gonna set bootstrap equal to true. ... This will do the um, bootstrapping is what the random sampling is called. Um So this just makes sure that the random sampling happens. OK. So now here are the three decision boundaries. You can see that the decision tree is still the same as above random forest is slightly better. But now it looks like uh the extra trees, it does have some incorrect classifications of the blue points. But this is much closer to the actual decision boundary of the line Y equals X right. OK. OK. So random forests and extra trees are nice uh because they can improve upon the performance of a single decision tree like we showed above in the very nice example. Uh that's easy to visualize. But the nice another added feature of this with a random forest is you can use the fact that these decision trees are are different from one another to get a sense of how important each feature in your data set is to making the decisions. So the way it does this is it sort of averages the impurity reductions made by making cuts along each feature across all of the decision trees. And then the feature that has the largest average impurity reduction across all decision trees uh is the most important according to this feature important score. And the one that has the lowest is the the least important. So we'll implement this on the Iris data set. ... So this iris data that we've seen in previous classification videos, we've got these four measurements of different iris flowers. Uh And now I'm gonna make my train test split and then I can fit a random forest classifier, let's say 500 estimators and a maximum depth of four. ... And then an S can learn once I have fitted my model, I can do forest dot feature importances ... and then this gives you the four feature importances score scores. OK. And so then uh we can go through and make a nice data frame out of this that will have the column names of Xtra along with the scores. And then we'll sort it from most important to least important. ... And so the most important feature here appears to be pedal length followed by pedal width, which makes sense though I believe those are related to one another. Uh and then seal length and sl width. OK. Uh And so these are these, I believe these are normalized so that they sum up to one. Um But this is essentially the normalized average of all impurity uh improvements across all of the decision trees here, we had 500 in the random forest. So this is a nice way to make your random forests a little bit more explainable to say, why is this doing better? What's the important features for making these predictions? Um Similarly, extra trees has the same exact feature because it's just a more random random forest. Uh So here, I'm fitting the same data with an extra cheese model and then showing you that it also has this feature importances score uh that can give you the importances and you know, as should be expected, it's similar to the results you get for the random forest. OK. So in this video, we talked about random forest as being ensemble method uh models of decision trees. Now, you've seen how to implement them, you know how the ensemble is made with these random perturbations and bootstrapping. Uh And then you saw also so the extra trees classifier which is an even more random random forest. And you learned about feature importances, which is nice for explaining why a model is doing something and also for feature selection. OK. So I hope you enjoyed this video. I enjoyed uh having you go through this video with me and I hope to see you in the next video. Have a great rest of your day. Bye.