WEBVTT 00:00:01.000 --> 00:00:07.000 Hi, welcome back everybody. So this is day 10 of the live lectures. So after today, there's only 2 more live lectures. 00:00:07.000 --> 00:00:18.000 So today the plan is to wrap up the notebooks and classification. So we just have one left there called decision trees, which we'll start with. 00:00:18.000 --> 00:00:28.000 And then after that, we'll start on ensemble learning. So we're gonna try and get all the way through bagging and pasting and see if we are able to do that. 00:00:28.000 --> 00:00:39.000 And if we are, then maybe we'll start on boosting. So the plan is today to wrap up classification, get started on ensemble learning, which I think will see a natural extension as soon as we finish. 00:00:39.000 --> 00:00:51.000 Random forests and then tomorrow I'm gonna try and get through all of boosting and then voter models and then on the last day we'll do a little bit of neural network content. 00:00:51.000 --> 00:00:57.000 So the first notebook though is in the classification folder. Notebook number 9. Decision trees. 00:00:57.000 --> 00:01:05.000 And so this is a class of algorithms that is sort of You might see things called tree based methods. 00:01:05.000 --> 00:01:16.000 So this is sort of the foundational class of algorithms for that. They're all based on the idea of a decision tree, which we're gonna look at in this notebook. 00:01:16.000 --> 00:01:22.000 So it's a really basic idea that you know in a way will allow us to build relatively powerful models that are pretty straightforward. 00:01:22.000 --> 00:01:33.000 So let's say we have this hypothetical data set of zeros and one. So notice we're back in 0 one land, away from negative one and one, but again, it wouldn't really matter in terms of actually solving the problem. 00:01:33.000 --> 00:01:39.000 Just pointing out that the labels are weirdly solving the problem and just pointing out that the labels are weird because data science has been conducted and founded by a bunch of different subfields. 00:01:39.000 --> 00:01:55.000 Okay, so we've got these blue circles and these orange triangles and you might like just ignoring the fact that we're in a world where we have computers do our work for us. 00:01:55.000 --> 00:02:02.000 We could try and think of, you know, what is a reasonable decision rule for these data. And if we were to look at it and think about it for a little bit. 00:02:02.000 --> 00:02:23.000 The simplest one would be if we just draw a vertical line at x one equal to one and classify everything on the left of that line as a 0 and everything on the right of that line as a one, then that would be a pretty good classifier, at least on this training data. 00:02:23.000 --> 00:02:30.000 And so if we were to sort of document in a diagram, I guess maybe diagram is the correct word. 00:02:30.000 --> 00:02:37.000 Apparently when I zoom in it makes the picture smaller. So maybe I'll do this. 00:02:37.000 --> 00:02:55.000 And we were to diagram that as an image. Way we could describe our thinking or thought process is we put all the data in to This sort of filtration system and then we basically check if x one is less than one, then boom, we're gonna classify it as a 0. 00:02:55.000 --> 00:03:03.000 So we follow the tree down this left hand branch. And then if x one is greater than or equal to one, then we would follow down to the right hand branch and classify it as one. 00:03:03.000 --> 00:03:10.000 And for all intents and purposes, we could have put the equal to on the right on the left hand side. 00:03:10.000 --> 00:03:16.000 I just randomly chose it to be the right hand side. So this is really all a decision tree is. 00:03:16.000 --> 00:03:26.000 We'll get a little bit more. Into it as we go through the notebook, but on its, you know, surface level, you're just looking at the data and then making cut points. 00:03:26.000 --> 00:03:37.000 So this line why x one equals one is a cut point for our decision boundary and then after the cut point you'll say, all right, everything over here gets classified as one thing and everything over here gets classified as another. 00:03:37.000 --> 00:03:48.000 That's really the essence of a decision tree. So to show you though that's what's going on, we're going to use the decision tree classifier object from SK Learn. 00:03:48.000 --> 00:03:49.000 We're going to use the out of the box. So I'm importing an extra thing up here. 00:03:49.000 --> 00:03:52.000 Called tree. This is going to allow me to plot this and this is not how you spell decision. 00:03:52.000 --> 00:04:08.000 There we go. So we're gonna first import the decision tree classifier and then later you'll see why I've imported the decision tree classifier and then later you'll see why I've imported tree as its own thing. 00:04:08.000 --> 00:04:29.000 So from SK Learn dot tree. We will import the decision. Tree classifier. So one thing you might notice is if you're paying attention, we had Kate nearest neighbor's classifier, support vector classifier, and then now we have decision tree classifier. 00:04:29.000 --> 00:04:41.000 This is because all of these algorithms have regression counterparts. So there's a decision tree regressor, a K neighbor, a nearest neighbors regressor and a. 00:04:41.000 --> 00:04:51.000 Support vector regressor. If you're interested in learning about those, there is a notebook and the regression folder that covers those algorithms. 00:04:51.000 --> 00:05:01.000 Okay, so we're gonna start by making our object decision tree classifier. And then for this, that's all I'm gonna put in just the defaults. 00:05:01.000 --> 00:05:06.000 Then I fit it on X and Y. Okay, and now you'll notice I'm storing this in the thing called fig. 00:05:06.000 --> 00:05:16.000 That's because it's going to allow us to plot the resulting tree. So here we can see we have this tree and if we look up here it's pretty similar. 00:05:16.000 --> 00:05:17.000 So let's break down what the tree is telling us. So it's saying up top it has the decision rule. 00:05:17.000 --> 00:05:32.000 And that's if our 0 column is less than or equal to 1.0 0 one. And so that's our decision role that we get from this training set, Skelerton gets. 00:05:32.000 --> 00:05:38.000 And so if I believe if it's true, we go to the left and if it's false, we go to the right. 00:05:38.000 --> 00:05:46.000 Okay. And I believe, okay, so we are gonna break this down now. So samples, all of these are nodes of a decision tree. 00:05:46.000 --> 00:05:56.000 Though whenever you see the argument samples, that's the number of observations from your whatever set your training set in this instance that are within that node. 00:05:56.000 --> 00:06:05.000 So at the very beginning, you start with all of the observations. So for us, it's 200 but after the split we have a 100 100 in each node. 00:06:05.000 --> 00:06:16.000 The next thing you'll see is this argument called value. So value provides a list of how the different observations in the node are split up. 00:06:16.000 --> 00:06:32.000 So 100 of my observations are of class 0 and the other 100 are of class one. And so on the left hand side here, we can see that one all 100 of the nodes observations in the node are of class 0 and on the right hand side all 100 of the objects observations in the node are of class one. 00:06:32.000 --> 00:06:45.000 Then there's this thing called, which we'll talk about in a little bit. This is a way to measure the purity of the node. 00:06:45.000 --> 00:06:52.000 So it's gonna measure basically like giving you a sense of how much of the node is comprised. 00:06:52.000 --> 00:07:01.000 Of a single class. Okay. Now if we were to continue and have to make additional splits, we would see decision rules at the top of each of these nodes, but because the decision tree has perfectly classified the data after the training data. 00:07:01.000 --> 00:07:11.000 After this point, we don't need to make any more splits. Okay. 00:07:11.000 --> 00:07:24.000 Alright, so are there any questions just about like what the basic idea of a decision tree is or what we saw in in the SK Learn picture. 00:07:24.000 --> 00:07:29.000 Sorry, so the decision rule is found by Escaler itself. I do we provide it? Okay. 00:07:29.000 --> 00:07:38.000 Yes. So it's, so here we were able to look at this picture because it's pretty simple and straightforward. 00:07:38.000 --> 00:07:39.000 But in general, like the one that's best isn't always going to be, you're not always gonna be able to see it just by looking at it. 00:07:39.000 --> 00:07:54.000 So this was just to show you like this was the thought process that we had just now and it's the same exact process that was discovered by 00:07:54.000 --> 00:08:12.000 Okay, so it does it by doing something with the, like it figures out what the best decision. Is by doing something with the GI and I values. 00:08:12.000 --> 00:08:13.000 Okay. 00:08:13.000 --> 00:08:17.000 Yeah, I believe it's pronounced, uni, like a why, but, you know, I never feel confident. 00:08:17.000 --> 00:08:23.000 But yes, so it uses that, wrongness measure or impurity measure, to make a decision on where to make the cuts, in general. 00:08:23.000 --> 00:08:26.000 There are other, there are tweaks to it that you can make, but, at, you know, at the base form, that' 00:08:26.000 --> 00:08:27.000 Thanks. 00:08:27.000 --> 00:08:30.000 Yeah, any other questions before we move on? 00:08:30.000 --> 00:08:33.000 Yeah, just this following up on that. So is it like. I like I'm still confused how it gets the decision down. 00:08:33.000 --> 00:08:42.000 Like what is it doing like? For example, yesterday we looked at Something like that, but is it? 00:08:42.000 --> 00:08:47.000 It. 00:08:47.000 --> 00:09:02.000 Yeah, so it's probably confusing because we have not explicitly covered how the decision is made yet. So we are going to cover that So I would say like. 00:09:02.000 --> 00:09:10.000 Yeah, hold on to the edge of your seats because you're gonna, you're gonna find out if that's what your question is. 00:09:10.000 --> 00:09:21.000 Yeah, all right. So maybe on that we'll talk about like how is wrongness measured or impurity measured and then how is that then used to produce cut points. 00:09:21.000 --> 00:09:37.000 So wrong, the way it basically works is the decision tree at each node is going to try and measure like, you know, how much of this is impure and so in the decision tree sense it means how much of this is like not the majority class. 00:09:37.000 --> 00:09:47.000 So here because we have a 50 50 split it wants to make a cut but now in these 2 knows because Each of the nodes has entirely a single class represented. 00:09:47.000 --> 00:09:56.000 It says, okay, I am perfectly classified. I don't need to keep going. So now we need to talk about how is it measuring impurity or wrongness? 00:09:56.000 --> 00:10:06.000 So the first one that is the default is this, so we're gonna suppose that in general that there are capital N target classes. 00:10:06.000 --> 00:10:12.000 So in this example, we had a 0 and a one, but maybe you have 3, 0, 1, 2, or 4, 0, 1, 2, 3. 00:10:12.000 --> 00:10:25.000 So in general, you can have end target classes. So in each node, the any impurity for class I of a node is going to estimate the probability that a randomly chosen sample of Class I. 00:10:25.000 --> 00:10:39.000 Is incorrectly classified as not Class I. So that is the probability of selecting I. So that's P of I where P of I here is just representing the proportion of observations in the node of that class. 00:10:39.000 --> 00:10:50.000 So in the previous example at the top that's point 5 for both of the nodes. And then you multiply it by one minus p of i because that's the probability of randomly being classified as one of the other classes. 00:10:50.000 --> 00:11:01.000 Okay? And so then the total, this is for a single class within a single node and then the total union purity within the node is then given by the sum of the union that the any impurities across all classes. 00:11:01.000 --> 00:11:17.000 So it turns out that you can simplify this to be one minus the sum from class one all the way up to class N of the PIs, squared. 00:11:17.000 --> 00:11:24.000 Okay. 00:11:24.000 --> 00:11:31.000 Alright, so that's, one measure. That's like the default. Another measure people will use is called entropy. 00:11:31.000 --> 00:11:36.000 And so if you went through the multi-classic multi-class classification metrics notebook on your own time. 00:11:36.000 --> 00:11:48.000 You saw this already, but if you didn't, we'll introduce it now. So entropy, the measure we're going to call it for a single class is capital H sub I and it's minus the probability of selecting I. 00:11:48.000 --> 00:11:55.000 So the proportion of nodes that are of class I times the log of that probability. And so then once again you just for a single node will sum up all of those HIs. 00:11:55.000 --> 00:12:06.000 And here there's not like a nice little simplification like there is for, uni or jinni, impurity. 00:12:06.000 --> 00:12:12.000 So you might be wondering like which one do you use? So the, is faster because it doesn't have to compute a logarithm. 00:12:12.000 --> 00:12:24.000 It just has to do some addition and multipleication. So most of the time there are comparable. So Jenny is a good a good default and then if you really want to you can always compare it to using entropy and see if you get a better performance. 00:12:24.000 --> 00:12:34.000 So that's sort of what, you know, that's what's being, and see if you get a better performance. 00:12:34.000 --> 00:12:40.000 So that's sort of what, you know, that's what's being measured here. And so if you were to go through, you can work out how it gets point 5 in the top and then 0 is in the bottom. 00:12:40.000 --> 00:12:47.000 So now that we have an idea of the 2 impurity measures that you can use, at least in the SK. Learn decision tree. 00:12:47.000 --> 00:12:57.000 You can learn, we can now learn about how it actually determines that decision rule that, x at 0 less than or equal to 1.0 1. 00:12:57.000 --> 00:13:03.000 So SK Learn, uses the classification and regression tree or CART algorithm. 00:13:03.000 --> 00:13:08.000 And we're going to break it down, break down how this works. For simplicity, we're only going to use 2 so it saves me a little bit of talking. 00:13:08.000 --> 00:13:17.000 But in general it can be extended to however many classes you need. 00:13:17.000 --> 00:13:32.000 So we're gonna suppose that we have n observations of little m features. So basically what we're gonna do is we start with what's known as the root node and that is the very top node that has all of the observations. 00:13:32.000 --> 00:13:48.000 So then you want to produce a cut. You want to first find a feature and I'm gonna denote little K, I use that to denote the feature and then you want to find a split point t sub k that produces the purest subsets weighted by the number of samples in each subset produced by the cut. 00:13:48.000 --> 00:14:11.000 So. For a single cut right that's gonna give us the weight as the number of observations on the left hand side of that cut divided by the total number of observations in the node and then times the whatever impurity measure you're using on that left hand side. 00:14:11.000 --> 00:14:26.000 So. Like if you made one cut it would be like the number of observations in the left hand side of that cut times divided by total observations in the original node times the impurity within that node produced by that cut. 00:14:26.000 --> 00:14:31.000 And then plus the same thing but for the right. So the number of observations on the right hand side of the cut divided by the total number of observations times the resulting impurity on the right hand side. 00:14:31.000 --> 00:14:42.000 So you want to find a K and a TK that minimizes this. And the way this gets done is sort of just doing a search. 00:14:42.000 --> 00:14:51.000 So you'll search through all of the features. And then for each feature, you do some sort of, you're not you, but your computer. 00:14:51.000 --> 00:15:03.000 We'll do some sort of binary search that will then produce the optimal T sub K. And then you just go through all of those and choose the one that has the smallest capital J. 00:15:03.000 --> 00:15:09.000 That's the one that produces the best split. According to the, according to this rule. 00:15:09.000 --> 00:15:18.000 Okay. So then after it does that for the top node, it will then do it for the subsequent nodes that are produced from that and it keeps going. 00:15:18.000 --> 00:15:30.000 So this will keep going. All of the all of the observations in the training set are perfectly classified or you've provided some stopping conditions. 00:15:30.000 --> 00:15:44.000 So common stopping conditions are, you know, reaching a maximum depth. So just going back to this tree again, we would say that this is a decision tree with a Mac with a with a maximum depth of one because the lowest it goes is just a single cut. 00:15:44.000 --> 00:15:53.000 If it made a second cut after that, it would have a depth of 2. And if it kept making cuts and you kept going downward, that's the depth. 00:15:53.000 --> 00:16:01.000 So you can make it so that You your decision tree once it reaches a certain depth will not continue producing cuts. 00:16:01.000 --> 00:16:08.000 You can also say, I don't want you to have fewer than this number. Of observations in a node made by the cut. 00:16:08.000 --> 00:16:15.000 So if you were about to make a cut that would produce nodes that have less than this limit, it would stop. 00:16:15.000 --> 00:16:24.000 Similarly, you can, instead of number of samples, you could do like a weight. So you could say like, all right, I don't want you to produce a cut that has less than 5% of the training set. 00:16:24.000 --> 00:16:33.000 And there are other features that you could choose to sort of serve as a stopping condition. You can check out the documentation here. 00:16:33.000 --> 00:16:44.000 Okay, so we will show you how to use Mac steps in this example. And then you can check out things like min samples leaf and min weight fraction leaf on your own time. 00:16:44.000 --> 00:16:49.000 So we're gonna go back to the original data set. Oh, sorry, this not the original data set. 00:16:49.000 --> 00:16:57.000 This is a, I believe this is an example we looked at. Actually, this might be the first time we've seen this example. 00:16:57.000 --> 00:17:03.000 So we have, again, some blue circles. Some orange triangles and we're trying to classify them. 00:17:03.000 --> 00:17:06.000 So we're going to go ahead and set up a decision tree classifier. And what we're going to do is change the maximum depth that's allowed. 00:17:06.000 --> 00:17:15.000 From a maximum depth of one all the way up to a maximum depth of 10, will go through, make that decision tree classifier. 00:17:15.000 --> 00:17:26.000 Fit it, and then show you the resulting decision boundaries and how it impacts. Impacts those boundaries. 00:17:26.000 --> 00:17:34.000 So with a maximum depth of one. Our decision tree is limited to a single cut, right? So this is what we would get. 00:17:34.000 --> 00:17:41.000 And then you can see how the decision boundary changes as we introduce that now it can make 2 possible cuts. 00:17:41.000 --> 00:17:51.000 Now it can make 3 cuts and you can see as it allows for more and more cuts, it gets closer and closer to perfectly fitting the training data. 00:17:51.000 --> 00:17:56.000 And then once it gets to a point where you can see like these aren't really changing anymore, that's because it is now reached a point where it is perfectly classifying. 00:17:56.000 --> 00:18:07.000 Well, it looks like it did keep going. So if you kept going, it would reach a point where every training observation is perfectly classified. 00:18:07.000 --> 00:18:21.000 And then it wouldn't make any more cuts after that. Okay, so maximum. Depth is something you would find with a hyper parameter tuning to figure out like what will work best for your generalization error, but it's a way to. 00:18:21.000 --> 00:18:51.000 Control how far down the. Tree is able to, is able to go. Okay, before we talk about advantages and disadvantages, this is maybe a good time for anyone who has questions. 00:18:55.000 --> 00:19:03.000 There be a chance to miss good Tk's that way. So I believe in the like. Original formula, you go through each of the K features and then do like a binary search on from the minimum value to the maximum value to the maximum value to the maximum value. 00:19:03.000 --> 00:19:12.000 So like those are your endpoints and then do like a binary search on from the minimum value to the maximum value. 00:19:12.000 --> 00:19:16.000 So like those are your endpoints and then. Maybe it's not a binary search you do like a bisection method. 00:19:16.000 --> 00:19:28.000 So like you would choose the midpoint and then you know you get your new endpoints and then you choose the midpoint and you just keep going until you get to a certain tolerance. 00:19:28.000 --> 00:19:35.000 Yahweh is asking, can you comment more on? Parameters, how do we determine it? 00:19:35.000 --> 00:19:42.000 So. There's a number of ways that you can prevent your decision trees from overfitting to the training data. 00:19:42.000 --> 00:19:56.000 One of those is to set the maximum depth, which is what we did here. So for us, we just demonstrate this is just demonstrating the maximum depth that is not saying like how I would go about choosing the maximum depth for this problem. 00:19:56.000 --> 00:20:16.000 In general, if you're trying to find the maximum depth that works best, you would do a cross validation or a validation set, find the generalization error on all those splits, take the average and then see which maximum depth value gives you the best average performance for cross validation or best single set performance for a validation 00:20:16.000 --> 00:20:24.000 set. So that's maximum depth. You can also try things like min samples leaf, which will give you the minimum number of samples and each node that's allowed from a cut or min weight fraction leaf. 00:20:24.000 --> 00:20:38.000 And you can again produce values for these that are the best through cross validation or through a validation set approach. 00:20:38.000 --> 00:20:46.000 Melanie is asking, is there a restriction on what types of cuts are made? So. 00:20:46.000 --> 00:20:59.000 So like all cuts are going to be like an interval cut. So like every cut. Is gonna look at a feature and say, all right, everything on the everything that's on the left, go to the left, everything on the right go to the right. And that's how it works. 00:20:59.000 --> 00:21:10.000 So like with numeric, it's like doing cutting an interval in half and then with a binary it would be doing sort of like if you were a 0 go to the left if you're a one go to the right. 00:21:10.000 --> 00:21:18.000 If that makes sense. And then Yahweh is saying, sorry, I meant I right and I left parameter. 00:21:18.000 --> 00:21:20.000 So. 00:21:20.000 --> 00:21:33.000 What this function is saying is for a particular feature and a particular cut point, that's going to produce observations that are either going to go to the left of that cut or to the right of that cut. 00:21:33.000 --> 00:21:36.000 And so then what you have to do is figure out how many of those are going to the left and that's what goes here. 00:21:36.000 --> 00:21:49.000 And then here, this is the impurity measure on those left cut ones. So it could either be the uni impurity which we found up here or it would be the entropy depending on which one you use. 00:21:49.000 --> 00:22:00.000 By default, it's the Ginny impurity. Then the same thing for the right, but then this would be for those observations that are to the right of your cut point. 00:22:00.000 --> 00:22:12.000 So you don't choose these at all. The computer is doing it in the background. 00:22:12.000 --> 00:22:16.000 Okay. 00:22:16.000 --> 00:22:17.000 So what are some advantages of decision trees over some of the other methods? So they're very interpretable. 00:22:17.000 --> 00:22:32.000 There, you know, you maybe have heard this term black box algorithm where it takes in data, does some stuff to it that you don't really get to see and then spits out an answer. 00:22:32.000 --> 00:22:35.000 So this is known as a white box algorithm, because you are able to print out the decision tree and perfectly trace out where things are going. 00:22:35.000 --> 00:22:49.000 They are pretty fast to fit and make predictions. And then there's, you don't really don't have to pre process your data compared to other ones. 00:22:49.000 --> 00:23:03.000 So you don't have to do like a standard scalar because we're not looking at, you know, one feature being a vastly different scale from another feature isn't going to impact the cut points and the resulting stuff because you're looking at the cut points independently. 00:23:03.000 --> 00:23:13.000 Some disadvantages are it's a greedy algorithm. So it's possible that the best decision tree would would require what's known as a suboptimal cut. 00:23:13.000 --> 00:23:15.000 So like at each step, you're choosing the cut that reduces the impurity by the most at that step. 00:23:15.000 --> 00:23:27.000 But it's possible that if you were to have made a cut that didn't reduce it by the most, that would provide. 00:23:27.000 --> 00:23:37.000 A better tree overall. So that's possible. It's very eager to overfit on the training data, but we can try and control that with things like max steps. 00:23:37.000 --> 00:23:46.000 Your boundaries are always going to be orthogonal. So like in 2 dimensions, that means, you know, these right angle cuts, but we'll see in an algorithm that sort of helps us get around that a little bit. 00:23:46.000 --> 00:24:01.000 And there because they over fit, they're very sensitive to training data. So for instance, if we were to remove some of these points, like some of these ones that are close to the border, it may drastically change the fit. 00:24:01.000 --> 00:24:13.000 Okay. So that is a decision tree for classification. And again, if you want to learn about the regression version, check out the regression notebook. 00:24:13.000 --> 00:24:20.000 Maybe I'll show you where that is just real quick. So it's number 10 regression versions of classification algorithms. 00:24:20.000 --> 00:24:29.000 There's a pre recorded lecture and it goes over regression versions of K nearest neighbors, support vector machines, and decision trees. 00:24:29.000 --> 00:24:40.000 And you may also now if you're interested in time series and other advertisement for a different notebook, you're now writing, you can do tree based forecasting now that you know what a decision tree is. 00:24:40.000 --> 00:24:56.000 Okay. All right, but back to this lecture, this live lecture, we are now going to move on to ensemble learning and work our way through the first 3 notebooks with the rest of the time we have today. 00:24:56.000 --> 00:24:59.000 So Laura's asking. 00:24:59.000 --> 00:25:05.000 When would you use the decision tree over other methods? So you may want to use a decision tree if like, some of the assumptions of other methods aren't working. 00:25:05.000 --> 00:25:16.000 So like logistic regression has some assumptions. Things like LDA and QDA have some assumptions. 00:25:16.000 --> 00:25:22.000 So like decision tree may perform better. It doesn't take too long to fit. So like that can be a benefit to it. 00:25:22.000 --> 00:25:30.000 So like that can be a benefit to it. And then ultimately if it just tends to fit so like that can be a benefit to it. 00:25:30.000 --> 00:25:42.000 And then ultimately if it just, if it gives you a fit that performs better. Like then the other models through like a cross validation, you'd want to use it then. 00:25:42.000 --> 00:25:50.000 Okay, so what is ensemble learning? So the idea behind ensemble learning is sort of a wisdom of the crowd type thing. 00:25:50.000 --> 00:25:56.000 So up to this point we've learned like individual algorithms that do a fit and then produce a prediction. 00:25:56.000 --> 00:26:16.000 But sometimes the idea is it's possible that instead of having any one algorithm make the final choice, we're gonna train a bunch of slightly different algorithms and then average their predictions together and then a lot of times that will outperform any individual prediction. 00:26:16.000 --> 00:26:36.000 And so this is an appeal to this idea called the wisdom of the crowd. And so the idea of the wisdom of the crowd is instead of asking going to like the smartest people or the most expert people to make predictions about things, it can be useful to take a wide survey of people, average their answers together in some way and a lot of 00:26:36.000 --> 00:26:48.000 times this may outperform any single expert. And so there's a nice illustration of this on this NPR story about guessing the weight of cows. 00:26:48.000 --> 00:26:54.000 So they went to this fair. They asked a bunch of people, some of which raised livestock, some of which were just like little kids who made silly guesses. 00:26:54.000 --> 00:27:09.000 Some of which were adults that made silly guesses. Like guess the weight of this cow. And they found that on average, you know, on average, the average predictions were better than the individual predictions of like the best people, like the people who raise livestock. 00:27:09.000 --> 00:27:15.000 So that's sort of the idea that we're trying to get, we're gonna try and leverage with these ensemble learning methods. 00:27:15.000 --> 00:27:22.000 So in particular, we're going to cover random forests so that will be today and it's an extension of the decision tree. 00:27:22.000 --> 00:27:29.000 Well, in general, talk about random forests is just a particular type of bagging or pasting algorithm. 00:27:29.000 --> 00:27:34.000 We're also gonna talk about boosting and that will be tomorrow. And then also tomorrow if there's time. 00:27:34.000 --> 00:27:39.000 We'll talk about voter models. Okay. 00:27:39.000 --> 00:27:47.000 Alright. So let's dive right into a random forest. 00:27:47.000 --> 00:27:53.000 So it's called a random forest because it's many trees. So for just like a forest is made up, a real life forest is made up of many different types of trees. 00:27:53.000 --> 00:28:11.000 Random forests are made up of many different types of decision trees. So these trees are made different through a variety of random perturbations to the single decision tree that we fit in the earlier notebook today. 00:28:11.000 --> 00:28:16.000 And how is it random? Well, we're gonna see. 00:28:16.000 --> 00:28:22.000 I think we're going to rely on the same. Yep, the same data set that we just looked at. 00:28:22.000 --> 00:28:31.000 So. This one right here that we fill with the decision tree. That's what we're going to use to demonstrate the random forest classification. 00:28:31.000 --> 00:28:50.000 Okay, so I just wanna do a quick comparison. So we're going to build a decision tree of maximum depth 2 and then we are going to build a random forest classifier made up of 100 I think it's 100 made up of 100 decision trees of maximum depth of 100 decision trees of maximum depth too. 00:28:50.000 --> 00:28:51.000 So and then we'll talk about how they're fit and what the differences. So and then we'll talk about how they're fit and what the differences are. 00:28:51.000 --> 00:29:00.000 So, and then we'll talk about how they're fit and what the differences are. So from SK Learn dot tree, we're going to import the decision. 00:29:00.000 --> 00:29:10.000 Tree classifier. And then from SK Learn dot ensemble, we're going to import the random forest. 00:29:10.000 --> 00:29:18.000 Classifier. So this we just saw in the earlier notebook. It's making a single decision tree of maximum depth 2. 00:29:18.000 --> 00:29:25.000 And now we're going to do the same thing. We're going to make a random forest classifier of a bunch of decision trees, all of maximum depth too. 00:29:25.000 --> 00:29:34.000 So we're going to do RF is equal to. Random forest classifier. Then we're going to say. 00:29:34.000 --> 00:29:41.000 In estimators. So this is going to determine the number of trees that are in our random forest. 00:29:41.000 --> 00:29:48.000 So for us, let's just put in a hundred. And then we're going to do, maximum depth. 00:29:48.000 --> 00:29:53.000 And for us, this is gonna be 2, cause we wanna be a an apples to apples comparison. 00:29:53.000 --> 00:30:03.000 And then finally, random forest is random, so I'm going to import a random state. And so this is just so when if you try to copy my code and run it yourself it will match. 00:30:03.000 --> 00:30:08.000 So let's do a 101. 00:30:08.000 --> 00:30:15.000 Hey, now I'm gonna fit my model so rf dot fit x comma y tree dot fit. 00:30:15.000 --> 00:30:19.000 X comma Y. 00:30:19.000 --> 00:30:28.000 And then here, we'll be plotting. The predictions. The the decision boundaries along with the training data. 00:30:28.000 --> 00:30:34.000 So the solid points, that's our training data, and then the decision boundaries are the shaded regions. 00:30:34.000 --> 00:30:48.000 And so you can see how on the left is a single decision tree. We have this, but on the right we have sort of like what's going to be an average of many different decision trees that are randomly perturbed. 00:30:48.000 --> 00:31:06.000 So again, we're going to talk about that randomness in just a second. But you can see how the the random, the average random decision trees tend to perform a little bit better than a single decision tree so it's able to classify these guys in the corner and get a little bit closer to that sort of I think it's 00:31:06.000 --> 00:31:13.000 the y equals x decision boundary. Okay, so that's the true decision boundary here, as y equal x. 00:31:13.000 --> 00:31:18.000 Okay. 00:31:18.000 --> 00:31:28.000 So Yahweh is asking could ensemble models be used for a variety of models, decision trees, regression, or is it restricted to the same type of models? 00:31:28.000 --> 00:31:44.000 So it's it can be used in a variety of you can use a variety of different models. So the first one we're learning is like the random forest, which is a common model type, but we're going to soon see in later notebooks how you can really use any algorithm you want as the base for a different ensemble 00:31:44.000 --> 00:31:46.000 models. 00:31:46.000 --> 00:31:53.000 Okay, so how do we plant a random forest? What do we, you know, how do we get these random perturbations. 00:31:53.000 --> 00:31:57.000 So there's a bunch of different ways. And we're going to talk about the main one. 00:31:57.000 --> 00:32:06.000 So the first main one is you get a bunch of different decision trees by randomly sampling training points and then training those decision trees on the random sample. 00:32:06.000 --> 00:32:17.000 And so what we mean by that is a random forest is going to, if we have like we do here, You know, here we have the number of estimators equal to a hundred. 00:32:17.000 --> 00:32:26.000 So we have a hundred different decision trees. So for that particular random forest model, 100 different times, it looked at the training data. 00:32:26.000 --> 00:32:36.000 Randomly sampled with replacement that from that training set, then use that randomly sampled training set to fit the decision tree. 00:32:36.000 --> 00:32:50.000 And so this way we're going to get slightly different training sets and remember our decision trees are sensitive to the training set so those slightly different training sets can result in very different decision trees. 00:32:50.000 --> 00:33:01.000 So you can do the random sample. This is done by default, but you could for some reason if you wanted to, you could turn it off, in the SK. 00:33:01.000 --> 00:33:02.000 Learn model with, by setting the bootstrap argument equal to false. But this is true by. 00:33:02.000 --> 00:33:18.000 And I think it would be weird if you if you turned it off. So the idea here is if your training set has endpoints, the algorithm will randomly sample by default for S. 00:33:18.000 --> 00:33:26.000 Kaler and randomly sample endpoints with replacement. But you can make it so it samples less than the entire data set by setting a Mac samples argument. 00:33:26.000 --> 00:33:35.000 So like, let's say you had. 200 training points you could randomly sample 100 if you sent max samples equal to a hundred. Okay. 00:33:35.000 --> 00:33:43.000 And so the random forest again, so the default is 100 trees for n estimators. That's what we used. 00:33:43.000 --> 00:33:45.000 We just explicitly set it. But you can set it to be whatever you like. Could be 500, could be a thousand. 00:33:45.000 --> 00:33:58.000 Sort of the more trees you have, the more you're going to introduce bias into the model because you're averaging over a greater number of estimators. 00:33:58.000 --> 00:34:12.000 So this process of randomly selecting training sets for the data is known as with replacement is known as bagging, which we're going to talk about more in the next notebook, just the general process of bagging. 00:34:12.000 --> 00:34:26.000 So this is the first way and the main way you'll introduce randomness into your decision trees. The next way that people or machines will randomly introduce or will introduce randomness into a decision tree for a random forest is in addition to randomly selecting your data with replacement. 00:34:26.000 --> 00:34:43.000 You can randomly just select your features. And so instead of going through the cart process of choosing the best feature for your model, through this sort of, bisection search and then comparing the different ones. 00:34:43.000 --> 00:34:49.000 You can just randomly select your predictor. Some of them will have a good selection. Some of them will have a not so good selection. 00:34:49.000 --> 00:35:04.000 And then every time you make a cut point, you'll randomly select the feature that is being cut, but then the TEAM, you'll randomly select the feature that is being cut, but then the T Kenny is still found through that bisection search. 00:35:04.000 --> 00:35:09.000 So that's like the second way and you can control the number of features that are considered for each decision tree. 00:35:09.000 --> 00:35:19.000 So you can make it so that the decision tree will randomly select from all of your features. Or you can make it so that it randomly selects from a smaller sample of features. 00:35:19.000 --> 00:35:29.000 Which is controlled by with the max features argument. So max features here being like a number. So you would say, okay, only look at. 00:35:29.000 --> 00:35:37.000 3 features for this data set. It will randomly choose 3 features and then each time it makes a cut point will randomly choose from those 3. 00:35:37.000 --> 00:35:52.000 So I want to do a quick aside. This is a lot of hyper parameters. So we've got, you know, all the hyper parameters from a decision tree, but now we've also got hyper parameters for a random forest like max samples, the number of decision trees to use, the number of features to randomly sample from. 00:35:52.000 --> 00:36:03.000 So this can be a lot. So you, you know, put some thought into what features you'd like to test out when you're doing a hyper parameter tuning. 00:36:03.000 --> 00:36:10.000 And work from there. Hey. 00:36:10.000 --> 00:36:16.000 So Yahweh is asking is a replacement taking from the rest of the data or from the training set. 00:36:16.000 --> 00:36:19.000 So all of this is from the training set. So when you call dot fit X train, it's going to take all of the samples from the training. 00:36:19.000 --> 00:36:30.000 And remember the test set we don't touch till the end. So it's going to take the samples from with replacement from the training set. 00:36:30.000 --> 00:36:38.000 So the idea here is because it's with replacement, there's a good chance that you're not going to always, you know, sometimes you'll get maybe the more extreme observations. 00:36:38.000 --> 00:36:54.000 A lot of times, you know, sometimes you'll get, you won't get those. And so the idea is you're sort of simulating getting a bunch of different training sets and the idea here, decision trees are very sensitive to the training set and so you're gonna get a bunch of different. 00:36:54.000 --> 00:37:02.000 Decision trees resulting to average over. So icons asking. How does Cross validation work with 3 and 4? 00:37:02.000 --> 00:37:07.000 So it works in the same way. It's just now you have a lot more hyper parameters than we've had before to try out. 00:37:07.000 --> 00:37:15.000 So you could try out different numbers of estimators. You could try out doing different selection of the number of features to select from. 00:37:15.000 --> 00:37:28.000 Yeah, it's just a lot to try. And so a good thing I'll point out that we're not going to have time to cover in live lecture is there's a notebook in the supervised learning about grid search CV. 00:37:28.000 --> 00:37:39.000 So this is a nice thing from SK L and that will allow you to sort of run across validation with a single or a few lines of code as opposed to the for loops we've been doing. 00:37:39.000 --> 00:37:48.000 So this might be worth your time to look into. And I think we maybe talk about it in tomorrow's problem session. 00:37:48.000 --> 00:37:58.000 Melanie asked, what does it mean to sample endpoints with replacement if the training set has endpoints, then how is a random sample of endpoints different from the entire training set? 00:37:58.000 --> 00:38:03.000 So Because you're doing it with replacement. So that means like, let's say you have. 00:38:03.000 --> 00:38:09.000 Let's say 5 points and my 5 fingers or my 4 fingers in my thumb are the 5 points. 00:38:09.000 --> 00:38:17.000 So every time I want to sample like The computer will randomly choose, alright, this point, write it down, and then put it back in. 00:38:17.000 --> 00:38:27.000 And then when it goes to sample the second point, it could randomly select that point again. Right, because it's been replaced or it could choose one of the other 4 points randomly. 00:38:27.000 --> 00:38:40.000 So this will, even though you're still getting the same size of sample, it's sort of simulating getting a different sample from the training set because you're not always going to get every point represented because it's random. 00:38:40.000 --> 00:38:49.000 Sometimes you'll have points that are repeated. Sometimes you'll have points that aren't chosen at all. 00:38:49.000 --> 00:38:56.000 You said asking aside from doing cross foundation to choose the hyper parameters, is there a math based research for their selection? 00:38:56.000 --> 00:39:04.000 Are you aware of any? So because the hyper parameters are chosen sort of, independently, right, of the fitting process. 00:39:04.000 --> 00:39:13.000 There's, I don't know of like, okay, you run this algorithm and it will always tell you the best choice for max steps or anything like that. 00:39:13.000 --> 00:39:28.000 I think you just have to do. Across validation search, right? Because the best choice for things like max step, the number of estimators is something like is dependent upon data that you're not training your data like you're not training your algorithm on, right? 00:39:28.000 --> 00:39:34.000 It's dependent on the generalization data. 00:39:34.000 --> 00:39:47.000 Okay, are there any? Other questions? These are good questions, so. 00:39:47.000 --> 00:39:58.000 So Nathan asked, I think I understand that this is done to fight over fitting. Is there advantage to bootstrapping over Kfold other than more flexibility with flexibility with sample size. 00:39:58.000 --> 00:40:02.000 So. 00:40:02.000 --> 00:40:11.000 So if I think I understand correctly, you're asking like why so This random sampling with replacement is also known as bootstrapping. 00:40:11.000 --> 00:40:14.000 So I think Nathan, you're asking, why are we, you know, is there an advantage to doing bootstrapping here other than K. 00:40:14.000 --> 00:40:23.000 So I think the reason to do bootstrapping here is because KFL, you're pretty limited. 00:40:23.000 --> 00:40:49.000 You're limited with the. Like the number of splits you could do right you because data sets aren't like infinitely big you can't really do this like you know it would be very difficult to do 100 fold K for this data set to get the different decision trees, then your sample size would be pretty small. 00:40:49.000 --> 00:40:55.000 So, says if Bootstrap is false, wouldn't it choose the same set of features resulting in a hundred trees with identical accuracy? 00:40:55.000 --> 00:41:15.000 Yeah. So, you know, that's why I think it would be weird for people to turn it off, but people may have a reason to I guess like you would have the the other source of maybe you would turn it off and then you'd get randomness from this random selection of features instead of from the 00:41:15.000 --> 00:41:38.000 bootstrapping. 00:41:38.000 --> 00:41:46.000 Trees, algorithms, so there's rin, there's decision trees, there's random forests, and then on top of that, there's the even more random extra trees. 00:41:46.000 --> 00:41:55.000 So in addition to randomly selecting features, extra trees also randomly selects a cut point. So instead of doing like a bisection search. 00:41:55.000 --> 00:42:05.000 A bisection search, to choose. The best cut point for the given feature, it will just randomly say, okay, use this as the cut point and then move on. 00:42:05.000 --> 00:42:21.000 So these are faster than random forests, but you know they have more bias but however you know sometimes by adding some bias you can get an improved performance right from our from our bias variance trade off and improved performance, right, from, from our bias variance trade off. 00:42:21.000 --> 00:42:26.000 And there was this was a while ago. So, from our bias variance trade off. And there was this was a while ago. 00:42:26.000 --> 00:42:30.000 So before COVID, we had one of our winning projects, their best algorithm was an extra trees classifier. 00:42:30.000 --> 00:42:31.000 So, you know, just because it seems weird, like not only am I randomly selecting features, I'm also randomly selecting cut points. 00:42:31.000 --> 00:42:47.000 It still can outperform something that actually search for the best cut point, right? So So you can do this with SK Learn with the extra trees classifier. 00:42:47.000 --> 00:42:55.000 So we'll import that now from SK Learn dot ensemble. We're going to import. 00:42:55.000 --> 00:42:58.000 Extra trees classifier. I'm gonna make my model object. So I think I called this ET. 00:42:58.000 --> 00:43:15.000 E. Trees classifier. And I want to set my number of estimators. 00:43:15.000 --> 00:43:29.000 Equal to a hundred. And I want my maximum depth. Equal to 2. And I'm going to fit it ET dot fit x come away. 00:43:29.000 --> 00:43:42.000 Not classify. Here we go. And then. Here are the different grids so you can sort of see like the extra trees is pretty similar to the random forest. 00:43:42.000 --> 00:43:50.000 But it's slightly different. Particularly down here in the bottom left hand corner and also in the upper right hand corner. 00:43:50.000 --> 00:44:05.000 Okay, so this is just to demonstrate it's not saying that any one of these I do think the random forest or extra trees one of them is probably better than to the decision tree but you know it's not to say that the extra trees is better than the random forest. 00:44:05.000 --> 00:44:13.000 This is just to demonstrate like how you will get different decision boundaries by introducing different types of randomness. 00:44:13.000 --> 00:44:21.000 Okay, so another nice thing, so decision trees, right, are white boxes, white box algorithm, you can completely look at them. 00:44:21.000 --> 00:44:31.000 You, you lose that a little bit with the random forest or with the extra trees, but they do have a way to look at feature importances. 00:44:31.000 --> 00:44:51.000 So they have this thing called a feature importance score. And so what it does is it will look at each of your features and then it counts up the number like it basically averages the reductions and impurity made across all the decision trees whenever that feature was used to make a cut point. 00:44:51.000 --> 00:45:05.000 So it looks at the features. Sees when they were used to make cut points in each decision tree and then finds the reduction and impurity that was made after that cut and then averages it across all the decision trees. 00:45:05.000 --> 00:45:14.000 And so today, build a random for us classifier on the Irish training or irres not Irish Irish training. 00:45:14.000 --> 00:45:22.000 I risk training set. And then we can see that if you do our, at or I guess I call it forest forest. 00:45:22.000 --> 00:45:31.000 Dot feature, underscore importances. You get this array and then it gives you the scores. 00:45:31.000 --> 00:45:44.000 Okay. And so then what you can do is, right now this is kinda hard to tell what this means, but basically it's for sequel length, sequel width, pedal length and pedal width, those were the scores in order. 00:45:44.000 --> 00:46:00.000 And so I could put it in a data frame to get a nice looking version. And so what this is saying is that there are much higher reductions in impurity when the cuts were made for pedal length and pedal width, then the cuts made for simple length and sepal width. 00:46:00.000 --> 00:46:13.000 And so for the random forest, it appears that pedal length and pedal width were more important features for determining the type of iris that it got than then CPU length and sepulitz. 00:46:13.000 --> 00:46:16.000 So this is a way that you can try and interpret what the model is picking up on when it makes its decisions. 00:46:16.000 --> 00:46:31.000 And the same thing follows for extra trees. So the extra trees. Also has an important score. With it as well. 00:46:31.000 --> 00:46:41.000 Okay. All right. Are there any questions about random forest models? Yes, Kerthon is asking, is this a total of one? 00:46:41.000 --> 00:46:46.000 So I believe it gets the score and then like the final thing is normalized so that they all add up to one. 00:46:46.000 --> 00:46:48.000 Yeah. 00:46:48.000 --> 00:47:02.000 Yeah, so I have a question. Couldn't we use this say the importances from the, this year 3 as a, as a way of selecting the features for classification. 00:47:02.000 --> 00:47:20.000 Yeah, so you could try and then, you know, you could take this and say, all right, what if I built a different Yeah, so you could try that or you could try going back and saying only build it build a random forest but only look at pedal width and petal length and then see if that 00:47:20.000 --> 00:47:22.000 outperforms. Yeah. 00:47:22.000 --> 00:47:26.000 Yeah, I'm trying to compare this one to the cross validation that that we're talking about earlier. 00:47:26.000 --> 00:47:36.000 So you would still wanna do, you know, cross validation. This is just giving you like for the training set, like these pad the. 00:47:36.000 --> 00:47:49.000 So I guess if you wanted to. So like these are on the training set, so it doesn't necessarily tell you that a model built with only pedal length and petal width will outperform the model built with all the features. 00:47:49.000 --> 00:47:55.000 If that makes sense. So it doesn't necessarily tell you that in terms of the like the generalization error, right? 00:47:55.000 --> 00:48:04.000 Because these scores are being computed on the training set. Yeah. 00:48:04.000 --> 00:48:12.000 Laura is asking, does random forests usually perform better than decision trees? So I would say in general, A random forest will outperform a single decision tree. 00:48:12.000 --> 00:48:31.000 That's not to say that you can't produce a decision tree that will outperform the random forest, but probably, you know, on average, the random forest would outperform decision tree. 00:48:31.000 --> 00:48:40.000 Are there any other questions about these models? 00:48:40.000 --> 00:48:47.000 Alright, looks like we're making a good time today. 00:48:47.000 --> 00:48:52.000 Okay, so we're gonna sort of build on this and look at what's known as bagging and pasting. 00:48:52.000 --> 00:49:01.000 So these are a generalization of the concept behind random forests. So remember with the random forest like we just said, a lot of the randomness comes in from this random random sampling of the data. 00:49:01.000 --> 00:49:18.000 And so bagging and pasting is the generalization of, well, what if we took that process and used it for other models to get like, okay, a bunch of K nearest neighbors models trained on different subsets of the training set. 00:49:18.000 --> 00:49:26.000 So the main difference between a bagging model and a pasting model is how the random sampling of the training data is done. 00:49:26.000 --> 00:49:33.000 So if it's done with replacement, it's called bagging. So I believe it's called bagging because it's bootstrap aggregating. 00:49:33.000 --> 00:49:43.000 So capital B and then the ag and aggregating and then I and G. And so. Bagging is when you're doing samples with replacement. 00:49:43.000 --> 00:49:51.000 So remember bootstrap is synonymous with replacement. So bagging with replacement, bootstrap, with replacement. 00:49:51.000 --> 00:50:12.000 And then pasting is this process of randomly selecting subsets of the data without replacement. Now point out this is different than cross validation because in cross validation you only allow yourself to take observations from like 4 of the 5 for the training and pasting, you're just taking a different random sample each time. 00:50:12.000 --> 00:50:26.000 And so there's very likely going to be overlap between the different, like if this was a decision tree example, each decision tree would be trained on a different set, but there would be overlap. 00:50:26.000 --> 00:50:34.000 So icons asking I'm still not sure about bagging I think why would we want a training set that 1 point could be sampled more than once. 00:50:34.000 --> 00:50:40.000 So the idea, sort of sort of the idea behind bagging is you're more likely to be biased towards. 00:50:40.000 --> 00:50:45.000 You know, like the set away from the extreme values of the distribution. And so it's a way to sort of introduce bias into the model. 00:50:45.000 --> 00:51:02.000 To lower the variance of overfitting on any particular observation. And so bootstrapping is a pretty common practice in statistics where it's like your way of trying to simulate going out and collecting new data. 00:51:02.000 --> 00:51:07.000 But you can't collect new data, so the best you could do is taking like random subsets. 00:51:07.000 --> 00:51:28.000 Of the data you already have. And so this is done, you know, with replacement, the idea, I think the idea being that like your sample, your observations sort of serving as a surrogate for the actual distribution out there in the world. 00:51:28.000 --> 00:51:46.000 And so I maybe also help Clare, may I'm not sure maybe if I'm over explaining but like you know having things sampled more than once those if something shows up more than once I think the idea is over like let's say a hundred times right if something shows up more than once a lot over the 100 times it's probably 00:51:46.000 --> 00:51:53.000 because it's a pretty common value for that particular feature. And so it's more likely to show up. 00:51:53.000 --> 00:52:04.000 More than once, even if we were to go out and take another random sample fresh from the world. 00:52:04.000 --> 00:52:13.000 So yeah, so as a summary, bagging is what we did with the random forest. It's taking random samples of the training set with replacement. 00:52:13.000 --> 00:52:19.000 Pasting is taking random samples of this is my random sample motion apparently. Pasting is taking random. 00:52:19.000 --> 00:52:25.000 SIP samples without replacement. And so this it's the same exact thing you can do for any model. 00:52:25.000 --> 00:52:26.000 So we did it for a decision tree to get a random forest but in theory you could do it for any model you like. 00:52:26.000 --> 00:52:43.000 K nearest neighbors, support vector machine, logistic regression, etc. So and if you wanted to try it an escaler and they have a general bagging classifier model. 00:52:43.000 --> 00:52:53.000 So it's called the bagging classifier, but it's used to build both pasting and bagging models, the key being whether or not this bootstrap argument is equal to true. 00:52:53.000 --> 00:53:08.000 So we're gonna come back once again to this exact same. So we've got these blue circles, these, yellow triangles and then this, dotted line this time representing the true decision boundary. 00:53:08.000 --> 00:53:09.000 And so what we're gonna do is, first we're gonna import the bag game. 00:53:09.000 --> 00:53:24.000 So from Skeler and dot. Ensemble will import the bagging classifier. And then we're also going to import other things. 00:53:24.000 --> 00:53:28.000 So we're going to work this on the K nearest neighbors. And yeah, so I think just Kenya's neighbors. 00:53:28.000 --> 00:53:38.000 We're gonna work this on K nearest neighbors. So from, S. K. Lern dot neighbors. 00:53:38.000 --> 00:53:50.000 We're gonna import nearest. Neighbors. Is that what I want to call it? Is this right? 00:53:50.000 --> 00:53:52.000 Is it K neighbor's classifier? 00:53:52.000 --> 00:53:54.000 Yeah, I think it's K neighbor's class. 00:53:54.000 --> 00:53:59.000 Alright, there we go. Thank you. 00:53:59.000 --> 00:54:04.000 All right, so we're going to go through and build a bagging version of this and then a pasteing version of this to give you a sense of like, you know, how do you do it? 00:54:04.000 --> 00:54:20.000 And then you can, we can kind of compare the performance for this particular data. Okay. So to build that you first put in, for the bagger, you put in bagging classifier. 00:54:20.000 --> 00:54:27.000 Then the first thing you want to put in is the estimator. So the estimator should be the model class that you want to use. 00:54:27.000 --> 00:54:34.000 So we're going to do K. Neighbors classifier and then for simplicity, you know, you would want to maybe try out just like with decision trees trying different depths within. 00:54:34.000 --> 00:54:43.000 A random forest. Maybe you want to try out different neighbors for us, we're gonna say. 4, I guess I put 4. 00:54:43.000 --> 00:54:53.000 Okay. And then the number of estimators, this is the number of K and N models we would fit. 00:54:53.000 --> 00:54:57.000 So n estimators and then models we would fit. So n estimators and then let's say, 50. 00:54:57.000 --> 00:55:03.000 I don't know. This just to demonstrate. So I don't have a good like this is the right number. 00:55:03.000 --> 00:55:12.000 Okay. Then we want to say max samples. So this is the maximum number of samples that we're going to sample whenever we do the bagging. 00:55:12.000 --> 00:55:30.000 And so how many observations do we have? 200. So let's just say a hundred samples and then we would do just to demonstrate we'll set bootstrap and because it's bagging we want bootstrap to be equal to true and then A random state, so if you run it, it's the same as if I run 00:55:30.000 --> 00:55:38.000 it. Let's do, 3, 4, 5. Okay. So that's the bagging classifier version. 00:55:38.000 --> 00:55:44.000 So it's gonna be confusing because it sounds this it's the same SK Learn but now we're gonna make a pasting classifier. 00:55:44.000 --> 00:55:59.000 So. Yeah, back in classifier, estimator. Equals K. Neighbors classifier for 00:55:59.000 --> 00:56:15.000 And estimators will be the same. 50 and then I guess I could, we're just gonna copy and paste this and we'll just change the last 2. 00:56:15.000 --> 00:56:22.000 There we go. Okay, and so then for the pasting, bootstrap is set equal to false. 00:56:22.000 --> 00:56:30.000 So that means we're doing it without replacement. And then I'm gonna make a different random state so let's do I don't know. 00:56:30.000 --> 00:56:45.000 9, 5. Go on. And then we're going to compare it to just a single K, nearest neighbor's classifier. 00:56:45.000 --> 00:56:49.000 And so here we can see like this is. In this case, you know, the bagging and the pasting did worse on the training set. 00:56:49.000 --> 00:57:02.000 That happens, but it's just to sort of demonstrate the process of, how do you build a bagging model, how do you do build a pasting model? 00:57:02.000 --> 00:57:09.000 So this was with a kit one Kaneers neighbors. You could also try like seeing how it changes if you choose different number of max samples. 00:57:09.000 --> 00:57:17.000 So what if we did 150 instead? 00:57:17.000 --> 00:57:23.000 Okay, and you can see how it changes it and now we're even with or outperforming. 00:57:23.000 --> 00:57:31.000 But again, it's remember it's not important to look at how we do in a single training set, we would want to do like a cross validation to see if it's better. 00:57:31.000 --> 00:57:36.000 Again, this is sort of just to demonstrate. Like I'm not saying like, okay, you always want to use the bagger. 00:57:36.000 --> 00:57:47.000 You always want to use the paster. It's just to demonstrate it's a model that you can now have in your, tool chest. 00:57:47.000 --> 00:58:05.000 So Sanjay asked why I use a different random state. So. I just think it's, I think it's good practice to not always use the exact same random state because it may you know produce this sort of like weird behavior like maybe that random state always you know gives you the best to perform like it always 00:58:05.000 --> 00:58:15.000 makes the sample that gives you somehow the best performance so I think like not that my choosing of these numbers is random in any way. 00:58:15.000 --> 00:58:21.000 Not that my choosing of these is random in any way, but, you know, it's fine. 00:58:21.000 --> 00:58:26.000 So, Sanjay is then saying the comparison is not proper, but remember we're not comparing the practice. 00:58:26.000 --> 00:58:36.000 Like we're not comparing it like these would just be 2 completely different models and then ultimately the comparison we care about is on the cross validation or in the validation set, right? 00:58:36.000 --> 00:58:53.000 So we would then say like, okay, like let's turn maybe this pastor performs better than, you know, in cross foundation or in validation set this pasting performs better than the bagging classifier with this random state length and that's our model and that's what we care about. 00:58:53.000 --> 00:58:59.000 This is sort of just showing like, again, this is sort of just showing, again, this is just demonstrate how to fit the model. This is sort of just showing, again, this is just demonstrate how to fit the model. 00:58:59.000 --> 00:59:02.000 And I'm not saying that this is just demonstrate how to fit the model. And I'm not saying that this pasting is better than the bagger. 00:59:02.000 --> 00:59:03.000 I'm not saying one is better than the other on the training set, which I ultimately don't care about. 00:59:03.000 --> 00:59:11.000 If I were then to go do across foundation, it's okay because they're different. They are different models. 00:59:11.000 --> 00:59:18.000 They're not the same model, right? So it's okay if the if the training process is different, cause that's the idea, they're different models. 00:59:18.000 --> 00:59:28.000 Melanie is asking, so if I'm understanding this correctly, you are just introducing different ways to use training data to train any type of model. 00:59:28.000 --> 00:59:34.000 So this notebook is really about, you know, what we did with the random forest is not something that's unique to a decision tree. 00:59:34.000 --> 00:59:46.000 It's a process that you can use for any model that you might like. So you could make, you know, we did it with K nearest neighbors here, but we could do it with a support vector machine. 00:59:46.000 --> 01:00:09.000 We could do it with a logistic regression, whatever we would like. And so it's just sort of demonstrating this process of a way to get an ensemble of an ensemble of One way to get an ensemble of algorithms is to make random samples of the training data, either with replacement being bagging 01:00:09.000 --> 01:00:15.000 or without replacement being, pasting. So that's sort of the idea here. 01:00:15.000 --> 01:00:18.000 Now you know how to do that. I'm not sure how common it is. Out in the world to be like, okay, let's try a bagging. 01:00:18.000 --> 01:00:36.000 Let's try a bag in classifier version of this versus let's try a facing classifier, but now, you know how to do it as like it's another tool on your tool chest. 01:00:36.000 --> 01:00:45.000 So why use a bagging or pasting model? So bagging your pacing, as I've said earlier in the lecture introduces bias into the model through this random selection. 01:00:45.000 --> 01:01:02.000 So this is because like the extremes in your training set. So like maybe you have like a pesky outlier that seems to be, you know, ruining your model because it's always trying to over fit on that by doing this sort of random sampling process like some of your subsets. 01:01:02.000 --> 01:01:07.000 We'll have the outlier. Some of your subsets won't have the outlier. And so, you know, this is why it might be useful to introduce this process. 01:01:07.000 --> 01:01:19.000 Because you know, through the averaging process, if this is truly an outlier, probably won't show up very often in your randomly selected subsets. 01:01:19.000 --> 01:01:28.000 So your models will be able to be trained a little bit better. You know, then if we're then using the data set that has that outlier. 01:01:28.000 --> 01:01:32.000 And so I think here I'm just sort of making too big of a picture. I forgot to change this size. 01:01:32.000 --> 01:01:42.000 Let's do 6 comma 6. So here I think I've introduced a couple you know orange triangles on the wrong side and blue circles on the other side. 01:01:42.000 --> 01:02:03.000 And then showing off like. No, let's hope it works. Showing off how like the K nearest neighbors maybe tries to overfit on these 2 blue circles being in there but both the bagger and the paster, ignore the fact that they're there because sometimes in the random subset. 01:02:03.000 --> 01:02:10.000 you know, they get included, other times they don't, whereas like, I guess these orange stars here, they're so far, they're too deep in the blue. 01:02:10.000 --> 01:02:19.000 Too deep in the blue to be sampled out. 01:02:19.000 --> 01:02:24.000 Okay, so in general, why, you know, why would you want to use bagging or pasting. 01:02:24.000 --> 01:02:33.000 So in general, I think people use bagging as the default because they prefer to have the sampling with with replacement and pasting doesn't get used as much. 01:02:33.000 --> 01:02:52.000 I think the, so the main reason is because of sample size. So in order to be effective, Pacing needs a really large data set, because then like you're sort of like with replacement, you're getting closer to this sampling from the true population idea that we talked about earlier with the question and then like with 01:02:52.000 --> 01:03:04.000 pasting that kind of goes away if you don't do with replacement, you're more likely to choose those outliers when they become some of the only ones that are left. 01:03:04.000 --> 01:03:15.000 And then this is just pointing out that there are regression versions that you can do as well. So there's a bagging regressor, which you could do things like linear regression or cane nearest neighbors regression as your base estimator there. 01:03:15.000 --> 01:03:23.000 And then here's a nice reference on bagging predictors that you can look at. on your own. 01:03:23.000 --> 01:03:29.000 Okay. So, before we move on, are there any other questions about anything we've done so far up today? 01:03:29.000 --> 01:03:43.000 We got through what I thought we would be able to get through. And so I wanna make sure we're clear on everything. 01:03:43.000 --> 01:03:48.000 So Jenya, that sounds like, so Jenny is saying, yeah, thank you, Keith. 01:03:48.000 --> 01:03:57.000 And, so Jenny, I'm getting the error, underscore, underscore, and it underscore, and it underscore underscore parentheses got an unexpected keyword argument estimator. 01:03:57.000 --> 01:04:05.000 So that's probably because you have a different version of S Kaler and then I have and then I hopefully Rohan is able to help you with that error. 01:04:05.000 --> 01:04:06.000 So that's probably because you have a different version of Skeler and then I have and then I hopefully Rohan is able to, their suggestion is able to help you with that error. 01:04:06.000 --> 01:04:15.000 So Rohan's suggestion was if you're getting this error, try using base underscore estimator instead of estimator 01:04:15.000 --> 01:04:24.000 Okay, are there any other questions? 01:04:24.000 --> 01:04:31.000 Okay. 01:04:31.000 --> 01:04:40.000 So we will. Start with boosting and we'll see how far we get. Oh, you know what? 01:04:40.000 --> 01:04:51.000 What actually maybe be better is let's use this as an opportunity to finish the on the PCA stuff that we didn't have a chance to do yesterday. 01:04:51.000 --> 01:04:56.000 Just to make sure it's covered today. 01:04:56.000 --> 01:05:03.000 Especially because there are some things that popped up during the problem session that I want to make sure are really clear for everybody. 01:05:03.000 --> 01:05:08.000 Okay. 01:05:08.000 --> 01:05:22.000 So the first thing I want to clear up and it showed up because it seemed like a lot of people were a little bit confused when like today's problem session asks you to subset the data based on like why equals 0, y equals one. 01:05:22.000 --> 01:05:29.000 So let's imagine that we have, you know, these observations that we had yesterday when we looked at it. 01:05:29.000 --> 01:05:35.000 And so the idea behind PCA is you want to take the data and then transform it onto a different space. 01:05:35.000 --> 01:05:43.000 So like here, the reference here is like these X one and X 2. PCA is essentially taking these features. 01:05:43.000 --> 01:05:54.000 Combining them linearly and then giving you a new set of features. But the key point to remember is that each observation is still like the same observation. 01:05:54.000 --> 01:06:03.000 It's just looking at different variables. So for instance, like I think I have a picture in here somewhere. 01:06:03.000 --> 01:06:09.000 So here's this picture where we've got this red X. This is one of our observations. 01:06:09.000 --> 01:06:13.000 And this is what it looks like when we have x one and x 2 as a feature. But then when we plot the PCA transform version, the red X is here. 01:06:13.000 --> 01:06:29.000 Now this is like in the PCA space. So here would be like the horizontal axis maybe should say PCA. 01:06:29.000 --> 01:06:36.000 PCA value one, PCA value 2. And so it's the same point though. 01:06:36.000 --> 01:06:43.000 And so what this means, essentially what this means is in terms of like, like what you're interested in. 01:06:43.000 --> 01:06:53.000 I think I called this fit, right? Yes. Okay, so. The rows of. 01:06:53.000 --> 01:07:01.000 FIT, which is what I called the PCA transform, PCA. Transformed data. 01:07:01.000 --> 01:07:12.000 Are the same observations. As the rows of X. So like X at 0. Here's X at 0. 01:07:12.000 --> 01:07:25.000 This is the same represents the same observation as. Fit at 0. Okay, so these are just 2 different variables. 01:07:25.000 --> 01:07:37.000 Like, I guess 4 different, so these are the original X one and X 2, and these are like PCA one PCA 2 but they're for the same observation so 01:07:37.000 --> 01:07:45.000 These represent the same observation. 01:07:45.000 --> 01:07:59.000 So if like in today's problem session these points were colored according to some value like maybe they're colored blue and some were colored orange according to like Y equals one, y equals 0, right? 01:07:59.000 --> 01:08:17.000 Like then those colors would follow these points down here. So setting like the rows where y equals one up here, you would do the same exact process for like where y equals one in the PCA version. 01:08:17.000 --> 01:08:29.000 Okay. Alright, so are there any questions about that part before we continue? On with the stuff that we didn't have time to finish yesterday. 01:08:29.000 --> 01:08:31.000 Yeah, the question. 01:08:31.000 --> 01:08:32.000 Yeah. 01:08:32.000 --> 01:08:35.000 So the service is looking like it's like. We're doing kind of a coordinate transformation. 01:08:35.000 --> 01:08:47.000 Would that be a? One way to think about it, when you go, but I'm curious once we do this transformation, is there a one to one? 01:08:47.000 --> 01:08:53.000 So in today's example, there were. No, not just 2, they were like 5, 6 features and then you transform them. 01:08:53.000 --> 01:09:02.000 I'm curious when you go to PCA space, is there like a one to one relationship like PCF factor 0 is maybe like compactness or something. 01:09:02.000 --> 01:09:09.000 Is there some interpretation or is it like a linear combination of the 01:09:09.000 --> 01:09:10.000 Yeah, so 01:09:10.000 --> 01:09:13.000 Okay. 01:09:13.000 --> 01:09:21.000 So remember in PCA, we're trying to find a vector of W's of weights with a norm of one. 01:09:21.000 --> 01:09:29.000 That has the maximal variance of the linear combination. So the way did some of the features. 01:09:29.000 --> 01:09:37.000 And so the So like, let's say for example, we had maybe, I don't remember how many we had in problem session. 01:09:37.000 --> 01:09:42.000 Let's say we had 5. So we would have like an x one and x 2 all the way up to x 5. 01:09:42.000 --> 01:09:51.000 And so then the. Point and like the first PCA direction would be like whatever the weight vector is for the first piece. 01:09:51.000 --> 01:10:13.000 So the first principal component, value or component vector, sorry, the component vector, this vector, it would be whatever vector form of W one is so like W 1 one W 1 2 time you know W one X one W 1 one X one plus W 1 2 X 2, right? Does that make sense? 01:10:13.000 --> 01:10:21.000 And so like that's where like the original data. So even if you had 5 dimensions. You can still get like. 01:10:21.000 --> 01:10:27.000 Less than 5. Less than 5 PCA values. You can get as many as you want. 01:10:27.000 --> 01:10:31.000 It just depends on like how many of these subsets you're projecting onto. 01:10:31.000 --> 01:10:37.000 If you go from 5 to 5, all your, I presume that you'd have a bit of each of the components. 01:10:37.000 --> 01:10:41.000 At the weights would just kinda be. Okay. 01:10:41.000 --> 01:10:51.000 Yeah, and so basically like the first one is sort of, I don't know, I feel like I'm repeating myself. 01:10:51.000 --> 01:10:58.000 So maybe I'm not, you know, answering your question. But the first one is like the direction in the original data where like the biggest spread. 01:10:58.000 --> 01:10:59.000 And then the next is you search all the orthogonal directions in the original data to see like the next biggest spread and then you just keep going. 01:10:59.000 --> 01:11:09.000 And so like what tends to happen is a lot of times your data will have like features that are correlated with one another. 01:11:09.000 --> 01:11:15.000 And so different component vectors, so different W's tend to pick up on different features. And so that's gonna come up where we will see and later in the notebook today. 01:11:15.000 --> 01:11:28.000 Like that is helpful in determining like what is the first PCA component picking up in? What is the second PCA component picking up on. 01:11:28.000 --> 01:11:30.000 And so. 01:11:30.000 --> 01:11:34.000 And, 01:11:34.000 --> 01:11:45.000 Okay. So sort of picking up on I think Sanjay's comments. Rotating physical axes to different axes. 01:11:45.000 --> 01:11:56.000 So in a sense of like when I think in the sense of when you're doing. A PCA as the same number of features, same number as the number of features, the number of components is the same as the number of features. 01:11:56.000 --> 01:12:01.000 It is more or less just rotating it. Sort of like, I mean, that is all we really did here, right? 01:12:01.000 --> 01:12:10.000 So this was the original data and then you can kind of think of this as we rotated it and then I guess we had to because our W 2 is pointing downward here. 01:12:10.000 --> 01:12:39.000 Like, you know, if, flipped, does that make sense? 01:12:39.000 --> 01:12:49.000 If you have 5 dimensions and you project down to 2, you're not going to have all of the information that were present in the original dimensions. 01:12:49.000 --> 01:13:04.000 But you're doing it in a way that at least statistically is capturing as much of the variance which in statistical terms is sort of what we think of as information is variance contains information on the distribution. 01:13:04.000 --> 01:13:22.000 So. We're capturing as much of the variance as is possible through this process. And so you're maybe losing some data, but it's possible that you're losing information that wasn't so in the sense of a pre processing step, maybe you're losing information that like wasn't actually helping you make classifications, which 01:13:22.000 --> 01:13:29.000 is why it can be useful as a pre processing step. 01:13:29.000 --> 01:13:41.000 Okay. So then after just explaining this, so we talked about explained to variance. So explain the variance is giving you. 01:13:41.000 --> 01:13:48.000 Is giving you the variance from the original data set that's being explained by the dimension. So the variance along W one is 80.4 5. 01:13:48.000 --> 01:13:59.000 The variance along W 2 is 4.2 5. And so what becomes more useful in determining like a number of components is what's known as the explained variance ratio. 01:13:59.000 --> 01:14:13.000 So this takes those explained variances and then just divides them by the total variance of the data. 01:14:13.000 --> 01:14:26.000 And so in this case, because we have 2 dimensions and 2 components, these add up to one. But in general, you'll be looking at fewer than the number of components or number of features you have. 01:14:26.000 --> 01:14:47.000 And so the sum, the total sum will typically be, Lesson one and so here we can see the cumulative explained rate variance and so because it's 2 you know it goes from 95 to one so The example that we then looked at, so we looked at these faces in the wild, and you know, this was a 01:14:47.000 --> 01:14:53.000 very big rush at the end. And so these give pictures of various famous people's faces in a where was it? 01:14:53.000 --> 01:15:12.000 And a 87 by 65 grayscale image. So each Data point here is a pixel with a value from 0 to 255 0 being complete black and 255 being complete white and then in between there being a gray of some kind. 01:15:12.000 --> 01:15:22.000 And so this is a lot of data, so if we wanted to maybe try and build a classifier using this data to try and classify these people. 01:15:22.000 --> 01:15:30.000 It's a lot of data to try and fit an algorithm with. And so it might be useful to reduce the number of features through PCA. 01:15:30.000 --> 01:15:56.000 And so the question then becomes, well, how do I choose the number of components? So you can run it through PCA and then look at what's known as the explained variance curve, which we will see in a second after the PCA is done, being fit. 01:15:56.000 --> 01:16:05.000 So you can look at the explained variance curve which plots the cumulative explained variance ratio against the number of PCA components. 01:16:05.000 --> 01:16:10.000 And so typically what you'll try and look for is something called like the elbow of the plot. 01:16:10.000 --> 01:16:17.000 And so the elbow is essentially where you start to see diminishing returns and explained variance by adding additional principal components. 01:16:17.000 --> 01:16:36.000 So for us that maybe occurs somewhere between like a hundred and maybe a hundred 50 somewhere in there. So you could look at it that way or you could also say I want to get an exact you know I want to have at least 95% of the original variance. 01:16:36.000 --> 01:16:41.000 And then you can just put that in as the number of components, and then it will determine that for you. 01:16:41.000 --> 01:16:52.000 Without fitting all of the different. Without fitting all the different components and so this will take a little bit to run but when it's done running, I think it said it was like something like 200. 01:16:52.000 --> 01:17:00.000 Yeah, 205 captures, 95% of the, the variance. 01:17:00.000 --> 01:17:20.000 So that's just like covering everything we covered yesterday. So are there any other questions about the stuff we covered yesterday before I move on to new stuff. 01:17:20.000 --> 01:17:21.000 Okay. Oh. 01:17:21.000 --> 01:17:30.000 So after using PCA on these images, can you like re plot the images again and they'll be kinda like blurred out or like 01:17:30.000 --> 01:17:38.000 Yeah, so this is a an awesome question because that's exactly what we're gonna do now. 01:17:38.000 --> 01:17:39.000 Perfect. 01:17:39.000 --> 01:17:45.000 So good, good foresight. So yeah, so one way to think about PCA is sort of as like a, you can kind of think of it as like a data compression algorithm. 01:17:45.000 --> 01:17:55.000 So remember like when you have 2 vectors and you have When you have a vector, you can, break it. Okay. 01:17:55.000 --> 01:17:56.000 Let's start over. A little bit, you can break it. Okay, let's start over. 01:17:56.000 --> 01:18:09.000 Suppose you have 2 perpendicular vectors in R 2, V and U, then you can break any vector in R 2 X down into the projection of Exxon to you plus the projection of X onto V. 01:18:09.000 --> 01:18:15.000 So remember you and V here, I'm assuming are perpendicular. And so you can do the same thing and higher dimensions. 01:18:15.000 --> 01:18:24.000 So you can because PCA gives you a full set, you know, assuming you have enough observations, gives you a full set of. 01:18:24.000 --> 01:18:29.000 Of component vectors which are all orthogonal to one another, you can try and recreate the original observation by taking the sum of the projections onto the different W's. 01:18:29.000 --> 01:18:54.000 And so essentially you can kind of think of getting an estimate of some kind of the original feature by taking the sum of the, taking the sum, And, sorry, taking the sum of the projection onto the different component vectors. 01:18:54.000 --> 01:19:04.000 Okay. And so we can sort of luckily with these images like Jacob was saying, you can do this, so I'm gonna fit the PCA once again. 01:19:04.000 --> 01:19:07.000 I'm gonna do this and then we can go through and actually re plot. Compressions of the images for a different numbers of components. 01:19:07.000 --> 01:19:24.000 And so this will take a little bit. But once it's done running, we can see, okay, for 10 components, 50 components, a hundred components and so on, we can see sort of like the recreations of the original faces. 01:19:24.000 --> 01:19:36.000 And so here we have like this is what the quality of image we're getting with like a 60.8 2% explained variance ratio cumulative explained variance ratio and so forth. 01:19:36.000 --> 01:19:51.000 And you can kind of see like, all right, so if I tried to run an algorithm using sort of this 60.8 2 you might see how it's maybe capturing the existence of eye sockets and a nose and where your mouth is, but really not much else. 01:19:51.000 --> 01:19:52.000 And then you can kind of see like how you're starting to get improved recreations of the photos. 01:19:52.000 --> 01:20:07.000 With higher variances. And so in this particular example, We can like try to recreate the data with this lower resolution version from the PCA and actually see it. 01:20:07.000 --> 01:20:26.000 And so this gives us maybe like an intuition of we can't plot it with like the pumpkin seeds data that you worked on in the problem session, but gives you a sense of like what's going on like where sort of getting like a fuzzy picture of the original data if that makes sense and how having more components will give you a 01:20:26.000 --> 01:20:33.000 better recreation. 01:20:33.000 --> 01:20:37.000 So the last thing we'll talk about today is how can you interpret PCA? So I think this goes back to Walled's question. 01:20:37.000 --> 01:20:49.000 So we're looking at a different data set. So this one, you know, we went from synthetic to faces to basketball, so this one has a lot of different data sets. 01:20:49.000 --> 01:20:55.000 So one thing in basketball is you can, this is the court. So this is the court. So this is like half court. 01:20:55.000 --> 01:21:00.000 This is where the basket is is underneath one. So you could, and the MBA does this. 01:21:00.000 --> 01:21:05.000 You can break down your basketball court according to these different regions. You can break down your basketball court according to these different regions. 01:21:05.000 --> 01:21:07.000 And what the MBA does this, you can break down your basketball court according to these different regions. 01:21:07.000 --> 01:21:10.000 And what the MBA will do is then use a computer to sort of like take the XY position, put it in these different. 01:21:10.000 --> 01:21:20.000 And then you can look at like who's shooting well from different regions. And so here's an example where years ago for a project. 01:21:20.000 --> 01:21:31.000 I broke this down for different teams from like 2,000 to 2,018. And in this example, we are looking at just those 2 seasons, 2,000 t020-01-2018 to 2,019. 01:21:31.000 --> 01:21:49.000 And then looking at the percentage of those teams attempts from those different regions. So for instance, in the 2,000 to 2,001 season, the Atlanta Hawks took 27 to 2,001 season, the Atlanta Hawks took 27% of their shots from area one and then it can comparison 2,018 to 2,000 and 01:21:49.000 --> 01:22:04.000 19 they took 36% of their shots from there okay and so one thing i did in this project that I have a link to at the end was I ran this data through PCA to try and see if I could see trends, over time. 01:22:04.000 --> 01:22:12.000 And so you get something kind of like this for the whole data set. And this is what it looks like for just the, smaller data set. 01:22:12.000 --> 01:22:21.000 Yeah, and then what it is sort of interesting is it turns out. You know, this is like the 2 thousands and then the 2,018, 2,019 ones tend to be over here. 01:22:21.000 --> 01:22:28.000 And so then like a natural question is, well, why are these 2, you know, shot distributions? 01:22:28.000 --> 01:22:44.000 Why are the shot distributions of teams in the year 2,000 that different from the shot distributions in the year 2,018 if you follow basketball you might be able to guess but the idea is we can interpret what's going on by looking at the different weights from the component vectors. 01:22:44.000 --> 01:22:53.000 So remember when we are fitting PCA, we want to find a vector W such that the norm is equal to one that maximizes the variance of the projection. 01:22:53.000 --> 01:23:02.000 And so if we look at the W's, the individual W values within the vectors. So the W one, the W 2, all the way to the WM. 01:23:02.000 --> 01:23:12.000 This allows us to see what features of the data set, what columns of our matrix X are being more heavily weighted in the different directions. 01:23:12.000 --> 01:23:20.000 So here we can look at. The principal components. Sorted by the. 01:23:20.000 --> 01:23:32.000 The values of the W. So this is the W one. Component vector, the W 2 component vector, and then the row labels are the different regions of the court that they correspond to. 01:23:32.000 --> 01:23:41.000 And so what you can look at is you can see which features tend to be the most negative and which features tend to be the most positive. 01:23:41.000 --> 01:23:57.000 And that tends to indicate that, for instance, the first principal component in here is picking up on team shooting from zones 6, 5 and 7 and 8 versus team shooting from zones like (111) 014-1312. 01:23:57.000 --> 01:24:11.000 Now this is a little bit hard to interpret without the without the picture but luckily a former participant Patrick Valley made these nice images for me where they took the court and then colored it according to the component values. 01:24:11.000 --> 01:24:17.000 So as you can see the 1012, 1413, and 11, you know, remember those were our most positive here. 01:24:17.000 --> 01:24:27.000 If you know a little bit about basketball, these are where if you shoot the ball it's worth 3 points if you make it and then the most purple parts of the, those were where we were most negative. 01:24:27.000 --> 01:24:36.000 So, so in 6, 5, 7, and 8. This is what's known as the mid range. 01:24:36.000 --> 01:24:42.000 And so it turns out over time what's sort of happening is these teams over here on the left, they have higher values. 01:24:42.000 --> 01:24:47.000 And those mid range components. So they're over here on the left. Teams in the year 2,000 tended to shoot from there more. 01:24:47.000 --> 01:25:00.000 Whereas over time, teams started to realize that 3 points are worth more than 2 points if you make the basket. 01:25:00.000 --> 01:25:05.000 And so teams in the year, 2018, 2019 tended to shoot from 3 more and that's why they're over here on the right hand side of the court. 01:25:05.000 --> 01:25:16.000 Remember, the more positive values. Corresponded to these orangish regions and then the more negative values corresponded to these purple regions. 01:25:16.000 --> 01:25:23.000 So that's a way that you can sort of interpret in this particular problem, but it works in general. 01:25:23.000 --> 01:25:40.000 Like you could look at, for instance, for that pumpkin seed data that you're looking at in the problem session, you could make a, a data frame just like this that looks at the weights on the different features and then try to use it to interpret, okay, why is one observation tend to be on the 01:25:40.000 --> 01:25:48.000 left hand side of the first PCA value versus the second or the right hand side of the of the first PCA value. 01:25:48.000 --> 01:25:54.000 So Melanie is asking, how do you interpret the second principal component in this case, or is it much less interesting? 01:25:54.000 --> 01:26:07.000 So the way I sort of interpret it was like taking your shot. in the, not the, in the paint, taking your shot in the paint versus not, but it's not as interpretable as like the very clear first one. 01:26:07.000 --> 01:26:18.000 So not every component is always gonna be like super interpretable here. The first one had a really easy meaning that is you can see very clearly if you're someone who watches basketball, whereas like the second one is sort of just picking out like taking shots. 01:26:18.000 --> 01:26:25.000 In the paint. 01:26:25.000 --> 01:26:33.000 Yeah. If you're interested in learning more about this particular problem, I have a link to a blog post I made like years ago and I started my website that kind of dives into the data a little bit more. 01:26:33.000 --> 01:26:46.000 And then there's also some additional references if you want to learn more about PCA. Here, I believe all these links still work, but. 01:26:46.000 --> 01:26:50.000 You know, sometimes people change the links over time. Okay, so I'm going to stop recording, but I'll stick around for questions if anybody has them. 01:26:50.000 --> 01:27:00.000 And.