WEBVTT 00:00:06.000 --> 00:00:22.000 Lecture for the 2023 May data science boot camp from the Airish Institute. So today we're gonna wrap up ensemble learning by learning about boosting and then at the very end voter models and then tomorrow we'll do a brief introduction into neural networks. 00:00:22.000 --> 00:00:26.000 If you do the problem session tomorrow, you'll also do a brief introduction on your own within the groups to neural network. 00:00:26.000 --> 00:00:30.000 So show up tomorrow if you want to start to learn a little bit about. Neural networks before the live lecture. 00:00:30.000 --> 00:00:45.000 So. All of the notebooks are in the cover today are in the supervised learning folder and then within the ensemble learning folder there, we're gonna go through notebooks 4, 5, 6, 7, and 8. 00:00:45.000 --> 00:00:48.000 So this might seem like a lot, but some of these notebooks are kind of short. So there's time for questions. 00:00:48.000 --> 00:01:00.000 Probably the longest one is the XG boost one. So once we get past that, it'll be smooth sailing. 00:01:00.000 --> 00:01:04.000 Okay, so the first thing we're going to talk about is in Notebook number 4, boosting. 00:01:04.000 --> 00:01:16.000 So this is like there's no coding. It's just a conceptual notebook. So one of the ensemble learning approaches that is like most successful across all the ensemble learning is known as boosting. 00:01:16.000 --> 00:01:23.000 So boosting takes advantage of a concept known as. Week learnability. So, me zoom in so maybe it's easier to read. 00:01:23.000 --> 00:01:25.000 So the way it works is it works from this sort of sub field and statistical learning, which is a particular branch of machine learning. 00:01:25.000 --> 00:01:46.000 It's a subfield of this sub field called pack learnability. So pack here PAC stands for something called probably approximately correct, which is sort of a formal definition of like a almost like a delta epsilon argument. 00:01:46.000 --> 00:01:52.000 If you remember your Calc One. So if you're interested in learning more about what probably a approximately correct means, you can click on this link and learn more. 00:01:52.000 --> 00:02:01.000 For just sort of a broad overview of what's going on, we're gonna do a very brief dive into like what essentially is the theory. 00:02:01.000 --> 00:02:09.000 And then if you're inclined, I have references throughout all the boosting notebooks that you might be interested in. 00:02:09.000 --> 00:02:15.000 So this idea for boosting comes from the idea and notions of weak learners and strong learners. 00:02:15.000 --> 00:02:26.000 So as a vague definition, we would say that a statistical learning algorithm, which all of our algorithms up to this point are is referred to as a weak learner. 00:02:26.000 --> 00:02:33.000 If it does slightly better than random guessing. So think of like, we slightly outperform flipping a coin. 00:02:33.000 --> 00:02:41.000 Here slightly. Has a more formal meeting, but just take it as like a vague like, okay, we're doing a little bit better than just random guests. 00:02:41.000 --> 00:02:50.000 We would then say that an algorithm is called a strong learner if you could make it as close to the true relationship as you'd like. 00:02:50.000 --> 00:03:01.000 You know, assuming you have enough trading observations and computational power. So I obviously of these 2, it's a lot easier to make a week learner, right? 00:03:01.000 --> 00:03:06.000 So we can make a week learner with a really silly out like simple, not silly, but a very simple algorithm. 00:03:06.000 --> 00:03:20.000 And it's much harder in general to make a strong learner, but it has been shown at least theoretically that if you can show that I particular problem is weak learnable, meaning that a weak learner exists. 00:03:20.000 --> 00:03:30.000 So there is an algorithm out there that can be defined as a weak learner, then it is also strong learnable, meaning that a strong learner exists. 00:03:30.000 --> 00:03:42.000 Now, does this mean that you're going to be able to find such algorithms, not necessarily, but you know that if you can, you know, provide an algorithm does at least a little bit better than random guessing. 00:03:42.000 --> 00:03:47.000 Then in theory, somewhere out there in the world of algorithm should be something that is a strong learner. 00:03:47.000 --> 00:04:03.000 And so this theorem let to a bunch of ways to sort of try and construct strong learners. And, basically this is done by making an ensemble, so ensemble learning, an ensemble of weak learners. 00:04:03.000 --> 00:04:14.000 So basically the idea for all boosting algorithms is we're gonna take a bunch of weak learners, find some sort of creative way to combine them together and then the hope there is that we're going to get a strong learner out of that. 00:04:14.000 --> 00:04:23.000 So for us, a very common week learner that we're going to use is known as a decision stump. 00:04:23.000 --> 00:04:27.000 So this is a decision tree with a single cut point and they call it a stump, right? Cause it's just one decision. 00:04:27.000 --> 00:04:32.000 So it's like, you know, like you cut off the top of a tree, you're left with the stone. 00:04:32.000 --> 00:04:40.000 So we're gonna use decisions stumps. You can use other algorithms as your week learner, but we're going to use decision stumps in these notebooks. 00:04:40.000 --> 00:04:45.000 So we're gonna look at 2 specific boosting algorithms. The first one is called adaptive boosting or add a boost. 00:04:45.000 --> 00:04:52.000 The second one is called gradient boosting. So we're actually going to look at 2 implementations of. 00:04:52.000 --> 00:04:58.000 Gradient boosting, one through SK Learn and one through a different Python package that we'll touch on later. 00:04:58.000 --> 00:05:08.000 So before we dive into covering the specific algorithms, are there any questions on just like the general idea behind boosting. 00:05:08.000 --> 00:05:19.000 Not, general idea, but. Are we going to be touching on XG boost at all? 00:05:19.000 --> 00:05:20.000 Okay. 00:05:20.000 --> 00:05:26.000 Yep, yep. So that's the second. So that's the second gradient boosting, notebook Yeah, so XG, the X stands for extra, the G stands for 00:05:26.000 --> 00:05:33.000 What's an example of a week partner? Like would a random force be an example of like a week learner? 00:05:33.000 --> 00:05:56.000 Yeah, so the most common week learner, is a decision stump. So it's just like, You know, you look at the data, you make one cut point and oftentimes that can be better than just flipping a coin. 00:05:56.000 --> 00:06:02.000 All right, so let's dive into some of these algorithms. The first one is called add a boost. 00:06:02.000 --> 00:06:17.000 Which is short for adaptive boosting. And so the main idea with at a boost is you're gonna train a series of weak learners and then like each subsequent learner is going to sort of pay more attention to the things that the last ones got wrong. 00:06:17.000 --> 00:06:23.000 So that's kind of the idea and here's a more formal. Like actually how it works. 00:06:23.000 --> 00:06:33.000 So the idea is nice. How does it actually work? So you're gonna start off like the very first week learner you're just gonna train like you normally would train the algorithm. 00:06:33.000 --> 00:06:38.000 Now for us this is going to be a decision stump but in essence it could be any algorithm you'd like to use. 00:06:38.000 --> 00:06:47.000 You want it to be an algorithm that is quick to train because you're going to be training a lot of different algorithms as a, part of your add a boost. 00:06:47.000 --> 00:06:59.000 So after you train the first one, the general steps are for week learner, subj, or week learner J, the weights on the training samples will be determined by the performance of the previous week learner. 00:06:59.000 --> 00:07:11.000 So for instance, the second week learner like will pay attention to the training observations in a slightly different way than the previous week learner. 00:07:11.000 --> 00:07:28.000 Depending on how well it did. And then after you train capital W Total Week learners, your final prediction will be made by performing a weighted vote among all of the weak learners you've trained, where the weighted vote is determined by each week learners accuracy. 00:07:28.000 --> 00:07:44.000 So the more accurate a week learner is the more of a vote it's going to get. So we're gonna now dive into the formulas, but just sort of remember the key is we're going to first just train a regular algorithm like we would normally do. 00:07:44.000 --> 00:07:49.000 And then each subsequent algorithm like we would normally do. And then each subsequent algorithm is going to pay more attention to the points that we would normally do. 00:07:49.000 --> 00:07:59.000 And then each subsequent algorithm is going to pay more attention to the points that were previously incorrect. And then at the very end, the way we finally make our prediction is by a weighted vote among all of the week learners. 00:07:59.000 --> 00:08:14.000 So in general, let's assume we have an observation why and then with the superscript parentheses I denoting the class of observation I so if like the tenth row of your data set was a one. 00:08:14.000 --> 00:08:23.000 Y superscript 10 would be one. Why sub j superscript i hat will be the prediction for observation I. 00:08:23.000 --> 00:08:37.000 Of week learner J. So, you know, through this we're gonna train a bunch of week learners the particular prediction that week learner Jay is making for observation I is why hat subjay super I. 00:08:37.000 --> 00:08:47.000 And then W super I is the current weight assigned to observation I. So we're going to be going through an iterative process where we update the weights at each step. 00:08:47.000 --> 00:08:55.000 So in general, after you train the J-thweek learner, the learners waited error rate is calculated. 00:08:55.000 --> 00:09:03.000 And so this might look confusing, but basically what you're doing is all of your trading observations are going to have weights assigned to them. 00:09:03.000 --> 00:09:10.000 And then in the numerator, you're gonna sum up those weights whenever you got something wrong with your prediction. 00:09:10.000 --> 00:09:22.000 And then you're going to then divide by just all of the weights. You could then think of this instead as being one minus a weighted accuracy where the points are weighted according to the W's. 00:09:22.000 --> 00:09:31.000 This is going to be denoted as R sub j. So the way we've defined our subj, this is going to be big if our J. 00:09:31.000 --> 00:09:33.000 Ferner is bad. And then it'll be small when the JF learner is good, right? 00:09:33.000 --> 00:09:43.000 So why would it be big? So if our Jake learners bad it's going to have a lot of incorrect predictions. 00:09:43.000 --> 00:09:53.000 So this Y hat is gonna be not equal to Y regular regular Y, right? Quite often. So that means the numerator will be big then. 00:09:53.000 --> 00:10:00.000 And it would be small if the prediction is equal to the actual value often, which means that the numerator would be small. 00:10:00.000 --> 00:10:10.000 Okay, another way to think of it is just this one minus weighted accuracy. So like if you have a good week learner, weighted accuracy is high, which means RJ will be small. 00:10:10.000 --> 00:10:18.000 And then if you have a bad week learner, weighted accuracy will be bad, meaning. RJ will be, large. 00:10:18.000 --> 00:10:24.000 Okay, after you've calculated these Rj's, then you will compute the weight assigned to that particular week learner in the final voting process. 00:10:24.000 --> 00:10:42.000 This is determined to be its alpha subj. It's eta times the log, and I believe this is natural log, one minus rj over rj. 00:10:42.000 --> 00:10:51.000 So. Data is a learning rate. It's called the learning rate of the algorithm. It's a hyper parameter that you would set before you do any of the fitting. 00:10:51.000 --> 00:11:03.000 We can remember what we know about r sub j. So alpha sub j is small. When r sub j is large, remember our subjays large is when it's bad. 00:11:03.000 --> 00:11:07.000 So that means like bad week learners will have a small vote. And then good week learners will have a larger vote because the alpha j will be larger when RJ is small. 00:11:07.000 --> 00:11:21.000 And then after we calculate the R sub j and the alpha sub j, it's finally time to then go through and update these weights. 00:11:21.000 --> 00:11:29.000 So you're gonna update them. So in the following way. So it stays the same if you were correctly predicting. 00:11:29.000 --> 00:11:33.000 And then you multiply it by e to the alpha j. If you're incorrectly predicting. This is assuming that the alpha j's are greater than 0. 00:11:33.000 --> 00:11:45.000 In practical applications, they typically are, but you know, theoretically, I don't think there's a guarantee that this is. 00:11:45.000 --> 00:11:55.000 Greater than 0. So I just wanted to point that out. So this is probably confusing. Which is why I think it's really helpful to go through just like a silly example. 00:11:55.000 --> 00:12:03.000 So here's our silly example. I've got 3 blue circles, 1, 2, and 3, and then 3 orange triangles, labeled 4, 5, and 6. 00:12:03.000 --> 00:12:11.000 So for the very first week learner, every observation has the same weight. So the same WI of one sixth. 00:12:11.000 --> 00:12:19.000 And then let's just say our first decision stump or whatever week learner we use gives us the following rule. 00:12:19.000 --> 00:12:27.000 So the shaded region over here means we would predict blues and then the orange shaded region means we would predict orange triangle. 00:12:27.000 --> 00:12:35.000 So one and 2 are correctly predicted. 4 or 5 and 6 are correctly predicted, but 3 is incorrectly predicted. 00:12:35.000 --> 00:12:40.000 So what was the first step we can go back after we do that we have to calculate our sub one. 00:12:40.000 --> 00:12:48.000 So we go through for our sub one. And it's gonna be zeros everywhere except for. 00:12:48.000 --> 00:12:56.000 The third entry, why is that? Because we incorrectly predicted on the 3 so that's a one over 6 and then the denominator just sums up to one because of our weights. 00:12:56.000 --> 00:13:09.000 So R sub one is just one over 6. That's what it simplifies to. After we calculate the R, we then go and calculate the alpha, which would give us a log of 5. 00:13:09.000 --> 00:13:15.000 And now we update our weights. So we correctly predicted one. So the weight stays the same. 00:13:15.000 --> 00:13:19.000 We correctly predicted 2 so the weight stays the same. And then the same thing for 4, 5, and 6. 00:13:19.000 --> 00:13:32.000 The only one we were incorrect on was the third observation. So we're going to multiply, the weight there by E to the log of 5 because that's what alpha one was. 00:13:32.000 --> 00:13:37.000 And so now these are our new updated weights for the second week learner. 00:13:37.000 --> 00:13:57.000 So let's say then the second week learner goes and then it produces the follow. The following decision rule, blue shaded regions were predicting a blue or in shaded regions were predicting an orange triangle. 00:13:57.000 --> 00:14:08.000 Here the one everything is correct except for 4 and so when we do the calculations R 2 everything's going to be 0 except for the observation for 4. 00:14:08.000 --> 00:14:13.000 Alpha 2 is then updated accordingly. And then the only weight that we have to change is the weight on observation for. 00:14:13.000 --> 00:14:19.000 So the previous weight, one over 6 times E to the alpha 2. Which is gonna give you 3 over 2. 00:14:19.000 --> 00:14:27.000 So if we would stop here, we have 2 weeks learners now, we could stop if we wanted to. 00:14:27.000 --> 00:14:37.000 Week learner one would get a vote worth log of 5. And we Kerner 2 would get a vote worth log of 9 and then we would keep going. 00:14:37.000 --> 00:14:48.000 Until we're happy, typically this is done by some sort of looking at. Like cross validation. 00:14:48.000 --> 00:14:58.000 So icons asking why is the third term 5 over 6. So the wait for observation 3 becomes 5 over 6 in this step when we updated the weights. 00:14:58.000 --> 00:15:06.000 So remember the first week learner incorrectly predicted observation 3 and so according to the rules we set up here. 00:15:06.000 --> 00:15:12.000 That means the weights are updated to be the previous or the current weight times e to the alpha j. 00:15:12.000 --> 00:15:22.000 So for us that would be alpha one. And we calculated alpha one to be log of 5. And so then when you do one over 6 times e to the log of 5, that gives us the 5 over 6. 00:15:22.000 --> 00:15:31.000 Does that make sense? Yeah, yep, it carries on to the next step. 00:15:31.000 --> 00:15:38.000 Any other questions on the setup or this toy example? 00:15:38.000 --> 00:15:47.000 Sorry, how, how exactly are you getting from? The weights to actually creating this decision boundary. 00:15:47.000 --> 00:15:52.000 So this is just like a made up example, but like the weights would be considered in the training process. 00:15:52.000 --> 00:15:54.000 So like when the algorithms being like the computer will add the weights to the observations. 00:15:54.000 --> 00:16:01.000 Does that make sense? 00:16:01.000 --> 00:16:06.000 Hmm. 00:16:06.000 --> 00:16:14.000 Not quite sure. 00:16:14.000 --> 00:16:15.000 Oh, I see. 00:16:15.000 --> 00:16:18.000 Yeah, so like this is a general week learner. So we don't have like an actual algorithm to fit in this toy example. 00:16:18.000 --> 00:16:24.000 Yeah, yeah, but like in practice, like the weights would be considered and whatever algorithms being fit behind the scenes. 00:16:24.000 --> 00:16:29.000 But so, is A to boost, like have we not talked about Ada boost yet or because I thought that was a specific week learner. 00:16:29.000 --> 00:16:40.000 So this is So that at a boost is a process, a boosting algorithm for trying to create a strong learner out of a series of weekliners. 00:16:40.000 --> 00:16:46.000 So this whole process that we've just looked at is like doing at a boost up to 2 week learners. 00:16:46.000 --> 00:17:02.000 So like in general, like the week later, we would choose is maybe like a decision stump. And so then like the decision stump here would produce a cut like this and then the second one would maybe produce a cut like this because we've more heavily weighted observation 3. 00:17:02.000 --> 00:17:15.000 The decision stump had follows the cart algorithm and then the weights would play into that where like certain observations maybe get a higher weight. 00:17:15.000 --> 00:17:16.000 Okay. 00:17:16.000 --> 00:17:21.000 I'm not entirely sure how it gets implemented in like SK Learn or something, but the weights are being impacted on like the fitting process of the of the week 00:17:21.000 --> 00:17:22.000 Alright, thanks. 00:17:22.000 --> 00:17:27.000 Yeah, Keithon asked for subsequent week learners. Does the sum of the weights not equal one? 00:17:27.000 --> 00:17:31.000 No, so it doesn't have to equal one because remember like when we calculate the RJ, we're dividing by the sum of all the weights. 00:17:31.000 --> 00:17:43.000 So like R sub j will never be more than one because we're dividing by the sum of all the current weights. 00:17:43.000 --> 00:17:50.000 So that's like the most important part. 00:17:50.000 --> 00:17:58.000 I'm sorry, so you said that the weights are gonna be in incorporated into whatever week learner we choose. 00:17:58.000 --> 00:18:04.000 So if for example the week learner is say like a random forest or something like 1. One depth or whatever. 00:18:04.000 --> 00:18:18.000 So the weights would come into play during like the calculation of the information gain. Or something and from that information gain the way in which it is like forming the decision boundaries. 00:18:18.000 --> 00:18:19.000 Yes, yeah. 00:18:19.000 --> 00:18:26.000 Okay. 00:18:26.000 --> 00:18:35.000 Any other questions? 00:18:35.000 --> 00:18:45.000 Okay, so now let's go ahead and show you how to do this in SK Learn and now that we're not like reading a bunch of things, I'm going to zoom out a little bit. So my code isn't as big. 00:18:45.000 --> 00:18:53.000 So. We're going to use this. A smaller version of this same data problem we've been looking at for all of our classifications. 00:18:53.000 --> 00:19:04.000 So we're gonna look at it with classification, but remember the boosting algorithms can be for classification or for regression because our weak learners, right, can also be used for regression. 00:19:04.000 --> 00:19:08.000 So if you have a decision stump for classification, you can just as easily have a decision stump for regression. 00:19:08.000 --> 00:19:14.000 To do. So we have smaller observations here because I want to make it easier to see like what's going on. 00:19:14.000 --> 00:19:22.000 So we have smaller observations here because I want to make it easier to see like what's going on. 00:19:22.000 --> 00:19:26.000 So let's go ahead and show you how to make the decision stump. So we're going to, er, for the add a boost. 00:19:26.000 --> 00:19:37.000 So, Ensemble, so from . K. Learn dot ensemble will import Add a boost classifier. 00:19:37.000 --> 00:19:45.000 And then I think I have a link to the documentation for this further up in the notebook. And then we're also have to import what our base classifier is. 00:19:45.000 --> 00:19:50.000 So the base classifier, that's the week learner we're using. So from. 00:19:50.000 --> 00:19:56.000 SK Learn dot, we're gonna use the decision stump. So from SK Learn dot tree. 00:19:56.000 --> 00:20:03.000 We're going to import decision. Tree classifier. 00:20:03.000 --> 00:20:11.000 Okay, so here's how we. Set it up. So I'm gonna get to the for loop after I just show you the setup. 00:20:11.000 --> 00:20:18.000 So you call at a boost classifier. The first argument is your week learner, which for us is a decision stump. 00:20:18.000 --> 00:20:26.000 So that's the decision tree. With a maximum depth of one. The number of estimators you're going to use. 00:20:26.000 --> 00:20:34.000 So that's the number of week learners. So in this previous toy example, we stopped with n estimators equal to 2, but we could have kept going. 00:20:34.000 --> 00:20:40.000 Currently, in this code, I have it set equal to I and I'll explain that in a second. 00:20:40.000 --> 00:20:45.000 Then I have this argument algorithm equals S A M M E dot R. So this is the algorithm used to fit. 00:20:45.000 --> 00:20:57.000 The add a boost classifier. So the reason I'm specifying this is the default one doesn't allow you to predict probabilities. 00:20:57.000 --> 00:21:06.000 So I wanted to point out that if you would like to be able to predict probabilities for your classification. 00:21:06.000 --> 00:21:17.000 You need to use SAM ME dot R. The default one is S A M M E and can, like you wouldn't be able to use a predict prob if you didn't use the one with the dot R. 00:21:17.000 --> 00:21:22.000 And then I set a random state so that when I run the code and you run the code, it'll be the same. 00:21:22.000 --> 00:21:29.000 So what I'm doing here is I'm just showing you how the decision boundaries change. When I use a different number of week learners. 00:21:29.000 --> 00:21:32.000 So the first time through it will just be a single decision stump. The second time through it will be 2 decision stumps and then so on and so on. 00:21:32.000 --> 00:21:42.000 Okay. 00:21:42.000 --> 00:21:49.000 So here we can slowly watch the processes as we add additional week learners, how it changes our decision boundary. 00:21:49.000 --> 00:21:56.000 And you can see how like it starts to overreact but starts to react to getting things wrong. So for the first 2, react to getting things wrong. 00:21:56.000 --> 00:21:59.000 So for the first 2, it doesn't really change much. So for the first 2, it doesn't really change much. 00:21:59.000 --> 00:22:11.000 But with the third one, these incorrect blue ones lead to this. You know, branch, then the fact that I have these incorrect orange ones over here lead to this adjustment. 00:22:11.000 --> 00:22:19.000 Then I incorrectly predicted that blue one so that leads to this and then after 5 we've correctly predicted everything in the training set. 00:22:19.000 --> 00:22:34.000 And so it stops making adjustments because after that every subsequent week learner will perfectly like the overall algorithm will be predicting everything correctly. 00:22:34.000 --> 00:22:39.000 Okay, so that's add a boost. That's the theory, at least the setup of the theory. 00:22:39.000 --> 00:22:44.000 We haven't like proven anything and then that's the how to implement it in SK Learn. 00:22:44.000 --> 00:22:52.000 So here's some good additional references on add a boost models. These are pretty, I think they're pretty good. 00:22:52.000 --> 00:22:59.000 I'd probably do want a mathematical background if you're if you're gonna try and dive into it, but you can try reading it if you're interested, but don't have a math background. 00:22:59.000 --> 00:23:10.000 Yeah. 00:23:10.000 --> 00:23:17.000 So just reading through the question. Okay, so Zach's asking, are there versions of boosting that change the employed week learner methods. 00:23:17.000 --> 00:23:24.000 It's the number of guesses increases. So if I understand correctly, I believe Zack you are asking like. 00:23:24.000 --> 00:23:31.000 Boosting algorithm that would maybe change from a decision stump to like a different type of algorithm part way through. 00:23:31.000 --> 00:23:33.000 Is that what you're asking? 00:23:33.000 --> 00:23:43.000 Yeah, either a decision stumped to a different type of algorithm or increasing the number of branches. Just improving the learner as you guess more. 00:23:43.000 --> 00:23:54.000 So I guess I don't think so because I mean the overall idea is like you don't need a sophisticated algorithm to do it like the improvement comes from adding weak learners. 00:23:54.000 --> 00:24:02.000 So that's the idea. So I don't think there's ones that say like, okay, I'm gonna use a really bad, a really bad algorithm and then like as I get closer, I'm going to then employ like a better algorithm if that makes sense. 00:24:02.000 --> 00:24:13.000 So I think it's typically the case that you're always doing the same base algorithm. 00:24:13.000 --> 00:24:14.000 Okay, thank you. 00:24:14.000 --> 00:24:15.000 Yep, yeah. And then RAM is on asked, is it common to try different learning rates? 00:24:15.000 --> 00:24:23.000 How does it play out with the number of estimators if we increase the learning rate? So that's a good question. 00:24:23.000 --> 00:24:28.000 Cause I think I forgot to specify the learning rate. So let's go and see if we can check the documentation. 00:24:28.000 --> 00:24:32.000 Because I'm not sure what they call the learning rate. So here's the documentation. So the learning rate is default set to one. 00:24:32.000 --> 00:24:44.000 In general, you might want to try, you know, putting this in a cross validation loop with the number of estimators. 00:24:44.000 --> 00:24:55.000 So you can try different numbers of estimators and different learning rates. So I think a. We can try and reason out and let's see if I'm good at this in a live lecture setting. 00:24:55.000 --> 00:25:03.000 We can look at the setup and try and reason out with the learning rate would do. So the learning rate impacts. 00:25:03.000 --> 00:25:16.000 So let's see, loading rate impacts alpha alpha is the vote the weight of the vote. And so if Ada is higher, that means alpha should be higher, which means that you're giving bigger votes. 00:25:16.000 --> 00:25:20.000 So I think your adjustments would be more drastic. Whereas if Ada is lower, then everything gets a smaller. 00:25:20.000 --> 00:25:34.000 A smaller vote. So maybe your adjustments would be less drastic. So I think it's just sort of a learning rate in the same sense of like a gradient descent where if you're bigger you're more likely to make bigger shifts. 00:25:34.000 --> 00:25:35.000 And if you're smaller, you're more likely to make like smaller incremental changes and it would take you longer. 00:25:35.000 --> 00:25:54.000 So just like with any sort of algorithm with the learning rate, I think there's probably a. Sort of a trade off that you need to sort of try and balance with like a cross validation or something. 00:25:54.000 --> 00:26:00.000 Yeah, so does that answer your question. 00:26:00.000 --> 00:26:09.000 Great. Jacobs asking, is this an example of a greedy algorithm? So greedy algorithms are ones just as a reminder for everybody. 00:26:09.000 --> 00:26:19.000 Greedy algorithms are ones where at each step it makes the optimal decision. So the decision at that step that leads to the greatest increase in improvement. 00:26:19.000 --> 00:26:29.000 Here, I'm not entirely sure. I would have to double check, but based on the updating rules, I don't know that this would be considered. 00:26:29.000 --> 00:26:38.000 A greedy algorithm. Oh, genetic algorithm. That I am not so sure about. I read a book a long time ago that explained what a genetic algorithm was. 00:26:38.000 --> 00:26:53.000 And I don't remember. So I can't give you a good answer as to whether or not it's a genetic out. 00:26:53.000 --> 00:27:02.000 And Sanjay is saying that they are not the same. 00:27:02.000 --> 00:27:08.000 All right, so that's add a boost. The next type of boosting is called gradient boosting. 00:27:08.000 --> 00:27:15.000 And then after this, will learn something called extra gradient boosting. 00:27:15.000 --> 00:27:33.000 Okay, so add a boost. Basically paid extra attention to the things that were wrong. The way gradient boosting works is like you're going to first build a week learner then you're going to build a second week learner to train directly on the. 00:27:33.000 --> 00:27:43.000 Errors of the previous week learner. So this is gonna be. Easiest to explain. In a regression, and a regression formulation. 00:27:43.000 --> 00:27:51.000 So these, like the, it still works for classification just like with all the other ensemble learning, you can do it for regression or classification. 00:27:51.000 --> 00:27:58.000 So we're gonna do it in the regression setting because it's easier to write out and then I think I have some references at the end if you're interested in looking at the classification version. 00:27:58.000 --> 00:28:08.000 Or maybe it's in the practice problems, either one of those. So let's go through it. 00:28:08.000 --> 00:28:20.000 So here are the steps for gradient boosting. You first train a week learner and then this setting in regression, a week learning regression algorithm. So let's say another decisions stump to predict why. 00:28:20.000 --> 00:28:37.000 And then this is called the week learner number one. Then you calculate there residuals and I believe we said this when we are doing regression 2 weeks ago, seems like forever ago now, where you calculate the actual minus the predicted. 00:28:37.000 --> 00:28:41.000 And so here I'm going to say H one of X I'm going to use as notation to denote the prediction of weak learner one. 00:28:41.000 --> 00:28:56.000 So H one is going to be Y hat for the first week learner. And then H 2 would be Y hat for the second, or not Y hat exactly, but it's gonna be the prediction of the second week learner. 00:28:56.000 --> 00:29:00.000 And we'll see why it's not why I had exactly. Okay, so this is step one. 00:29:00.000 --> 00:29:04.000 You, you train a, a week learner regression algorithm to predict why, then you calculate the error. 00:29:04.000 --> 00:29:14.000 So the actual minus the predicted. So then in general, your for step J, you will train a weak learner to predict to the residuals at step J minus one. 00:29:14.000 --> 00:29:40.000 So for instance, in step 2, instead of making a decision tree that. A decision stump progressor that predicts why why you're going to take a decision stump regressor that predicts our one so the residuals are that predicts our one. So the residuals of the previous model. 00:29:40.000 --> 00:29:46.000 So then you're going to set H sub j of X. So the residuals of the previous model. So then you're going to set H sub j of X. 00:29:46.000 --> 00:29:55.000 That's going to denote the predictions of the previous steps residuals. Then you're going to then calculate the residuals for this week learner. 00:29:55.000 --> 00:30:07.000 So this that be the previous steps residuals minus the current prediction. And then you would stop given like you're gonna preset, okay, use this many capital J week learners. 00:30:07.000 --> 00:30:18.000 And so now you might be wondering, well, how do I then get the prediction for why at a given step, you do that by summing up all of the different predictions you've made. 00:30:18.000 --> 00:30:27.000 So H one of X is the predicted value for Y. H 2 of X is the predicted value for the residuals from the first week learner. 00:30:27.000 --> 00:30:28.000 H 3 of X would be the predicted value for the residuals of the second week learner and so on. 00:30:28.000 --> 00:30:43.000 And so by summing all these up, you're trying to get closer and closer and closer to the prediction of why, which I'm going to just call H of X here. 00:30:43.000 --> 00:30:50.000 So before opening it up for questions, I'm going to show you a visualization of this to hopefully make it clear. 00:30:50.000 --> 00:30:56.000 So let's say we have this data, Y and X. And I'm going to build. 00:30:56.000 --> 00:31:01.000 I'm not going to use the SK L and gradient boosting algorithm just yet. 00:31:01.000 --> 00:31:09.000 I'm doing just the decision tree, to show you what's going on. So for the first week learner I make my decision stump. 00:31:09.000 --> 00:31:21.000 Of depth, decision tree of depth one. Then I fit it on the X and Y. I get my prediction and stored in H one and then I calculate their residuals and store it in R one. 00:31:21.000 --> 00:31:33.000 And then here I'm just plotting both of those things. So on the left hand side is going to be a plot of all the individual H subj's and then on the right hand side will be the running plot of the H's. 00:31:33.000 --> 00:31:39.000 So week learner too is then fit on our one. So the residuals from step one. And then those predictions are stored in H 2 and then I calculate the residuals from that step. 00:31:39.000 --> 00:31:51.000 R 2. So the errors on the previous models residuals. 00:31:51.000 --> 00:32:00.000 Then we learner 3 does a similar thing. So I fit the decision stump on the residuals from the previous step. 00:32:00.000 --> 00:32:08.000 Store the predictions and then calculate the residuals for that step. Okay, so here's what this looks like as a picture. 00:32:08.000 --> 00:32:15.000 So on the left hand side of all these plots, I'm going to have the H. Sub j along with the training data. 00:32:15.000 --> 00:32:21.000 So for the first row, that's H sub one. So this is just the X and the Y. 00:32:21.000 --> 00:32:27.000 And then H sub one is just a single cut point. 00:32:27.000 --> 00:32:34.000 And so if you're unfamiliar, a decision stump regressor will just take the bins and then average them. 00:32:34.000 --> 00:32:43.000 So everything to the left of the cut point is like the average of all of these points and then to the right of the cut point it's the average of all these points. 00:32:43.000 --> 00:32:49.000 So then on the right hand side I'm going to have the running value of H, which remember is the sum of all the H sub j. 00:32:49.000 --> 00:32:58.000 So for right now, we only have H one. So H is equal to H one. Now H 2 is trained on the residuals from the first step. 00:32:58.000 --> 00:33:04.000 So if we look at this, we can kind of see where that's coming from. So here we can see we've got all these ones up here. 00:33:04.000 --> 00:33:10.000 So this is what H sub 2 looks like and this is the data that was used to train H sub 2 looks like and this is the data that was used to train H sub 2. 00:33:10.000 --> 00:33:15.000 Now on the right hand side here we've got H, which is equal to H one plus H 2. 00:33:15.000 --> 00:33:21.000 And then the original training data. 00:33:21.000 --> 00:33:29.000 Now we've got R 2 and we're training a new model that gives us H sub 3 to predict our 2. 00:33:29.000 --> 00:33:34.000 And now we've got the running sum of H one H 2 and H 3 on the right hand side. 00:33:34.000 --> 00:33:40.000 And then finally this is where I stop. We've got H 4, which is trained on the residuals from the previous step. 00:33:40.000 --> 00:33:52.000 And then here's the running sum H one plus H 2 plus H 3 plus H 4. So that's the idea behind gradient boosting. 00:33:52.000 --> 00:34:05.000 And again, this can be done for classification as well. But the setup slightly different right because the residuals aren't exactly like actual, predicted. 00:34:05.000 --> 00:34:14.000 I either have a reference at the bottom of this or it's in the practice problems. I can't remember which. 00:34:14.000 --> 00:34:20.000 Okay, so I'm just checking the questions. 00:34:20.000 --> 00:34:25.000 So Keithon asked, this may be silly. It's not silly. It's not a silly question. 00:34:25.000 --> 00:34:30.000 What's the benefit of using a decision tree over or a random forest without? Add a without add a boost is it computation time. 00:34:30.000 --> 00:34:42.000 So there are 2 different approaches the main reason we use the decision stump and add a boost and gradient boosting is because it's a weak learner that is easy and quick to train. 00:34:42.000 --> 00:34:45.000 We just need to find a single cut point. So random forests and gradient, like they're just different approaches. 00:34:45.000 --> 00:34:58.000 One may perform better for your particular problem. You just have to, it just depends. 00:34:58.000 --> 00:35:15.000 And then thank you, Brooks, for your nice comment. Are there any other questions about gradient boosting? 00:35:15.000 --> 00:35:29.000 Okay, so how do we do this in SK Learn with the gradient boosting regressor in this case or classifier So we would do from SK Learn. 00:35:29.000 --> 00:35:43.000 Import gradient. Boosting. So, and we can note that we do not have to import the decision tree regressor because we did that earlier. 00:35:43.000 --> 00:35:51.000 We didn't have this in the previous setup because I didn't wanna, you know, make it even more confusing, but just like with, Add a boost, there's a learning rate. 00:35:51.000 --> 00:36:02.000 So the updates would be multiplied by some learning rate, Ada. So I believe the default is we could just see what the default is. 00:36:02.000 --> 00:36:11.000 So the default is point one. And so a higher learning rate means that your adjustments are going to be made faster and then a lower learning rate would mean your adjustments will be made more slowly. 00:36:11.000 --> 00:36:33.000 Which one is best depends upon you know your data right so what we're gonna do here is we're gonna show you 2 different this gradient boosting regressors one with a lower learning rate one with a higher learning rate and then you'll see like the differences. 00:36:33.000 --> 00:36:39.000 So we're gonna use the same number of estimators here. So we're gonna have 00:36:39.000 --> 00:36:47.000 Boosting regressor. Then we have to put in our base, which is our decision. 00:36:47.000 --> 00:36:56.000 Tree regressor with a maximum. Depth of one. We're gonna set the number of estimators. 00:36:56.000 --> 00:37:02.000 Equal to 10. So I just chose 10 here for demonstration purposes. It doesn't mean that's what you're gonna want to use all the time. 00:37:02.000 --> 00:37:16.000 And then my max depth. I already did that. Then I want just to be clear, I'm gonna set my learning rates equal to point one and then I'll make a note. 00:37:16.000 --> 00:37:26.000 No, this is the default. Value. Okay. Then I'm gonna slately cheat and just copy and paste this so I don't have to type it all again. 00:37:26.000 --> 00:37:32.000 But now I'm gonna change and make it a larger learning rate. I'm gonna set it to one. 00:37:32.000 --> 00:37:43.000 Oh, what did I do? 00:37:43.000 --> 00:37:58.000 Let's just check. 00:37:58.000 --> 00:38:01.000 Interesting. Alright, let's. 00:38:01.000 --> 00:38:09.000 Do another cheat and peak so I don't have to spend too much time on debugging. 00:38:09.000 --> 00:38:18.000 Oh, okay, awesome. So here's the difference. Why am I getting an error? A gradient boosting uses decision trees. 00:38:18.000 --> 00:38:31.000 So I just have to put in a max step. So. In general in theory you could use any week learning you'd like but by default it's always a decision tree and then you just have to set the maximum depth. 00:38:31.000 --> 00:38:39.000 So I'm sorry if that was confusing. I just had a brain slip and forgot that they used decision trees by default. 00:38:39.000 --> 00:38:49.000 Hey, so there we go. So you just set the maximum depth like you would in any other, any other decision tree or random forest. 00:38:49.000 --> 00:38:55.000 But now just. Unlike at a boost in gradient boosting, it's just always a decision stump. 00:38:55.000 --> 00:38:59.000 Well, decision tree, you could change the depth to be more than one. Okay. So here's the difference between the 2. 00:38:59.000 --> 00:39:16.000 So you can see how with a lower learning rate, we're more slowly fitting to the data and with the higher learning rate we're maybe more likely to overfit on the data and with the higher learning rate we're maybe more likely to over fit on the data like quickly. 00:39:16.000 --> 00:39:17.000 And so that's the impact of the learning rate, we're maybe more likely to over fit on the data like quickly. 00:39:17.000 --> 00:39:18.000 And so that's the impact of the learning rate. It's just how much we're adjusting to the residuals. 00:39:18.000 --> 00:39:21.000 If we were to run this longer, so if we ran the first one longer, we would begin to look like this. 00:39:21.000 --> 00:39:35.000 We would just need more estimators. So I believe the preference in general is to show you is to use a slightly smaller learning rate and then just use more trees. 00:39:35.000 --> 00:39:45.000 Now that has the problem of a longer training time. But that's something you'll have to consider. 00:39:45.000 --> 00:39:58.000 Okay, so. You can try and get a number of estimators. So that's what, so that can be optimized with a cross validation or a lot of times you might use just a validation set. 00:39:58.000 --> 00:40:12.000 And the reason there is that it can take a long time to fit sometimes. And so if you want to not have to wait and say like, fit 200 subsequent week learners or 200 subsequent decision stumps. 00:40:12.000 --> 00:40:24.000 5 different times. You might just use a validation set depending on how long it takes. So for us, cause it's a lecture, I'm gonna use the validation set so it's a little bit faster. 00:40:24.000 --> 00:40:38.000 So here I'm getting a validation set. It was randomly generated data. So I can just randomly generate more data instead of having to actually do a train test split kind of thing. 00:40:38.000 --> 00:40:57.000 We're going to calculate the mean squared error on the validation set. Alright, so what we're gonna do is we're gonna go through 200 decision trees or decision stumps and then we're going to calculate the mean squared error on the validation set for each of the of the fitted, week 00:40:57.000 --> 00:41:05.000 learners. And so the way that we can do this is gradient boosting has a method called stage to predict. 00:41:05.000 --> 00:41:14.000 So we're gonna do, maybe I'll do it separately so you can see. So we'll do GB dot stage to underscore predict. 00:41:14.000 --> 00:41:22.000 We're gonna put x dot, x, vowel dot reshape negative 1 1. 00:41:22.000 --> 00:41:32.000 And so you can see this is a generator object. And so what this is doing is it's going to allow us to loop through each of the week learners. 00:41:32.000 --> 00:41:42.000 And then provide the prediction that you get from stopping at that point. And so it's gonna do this in what's known as a generator, which is something you have to iterate through. 00:41:42.000 --> 00:41:51.000 So we're gonna copy this. And put it into a list comprehension. So we want the mean squared error. 00:41:51.000 --> 00:42:02.000 Of we want y valve first the true values And then the predicted values for the predictions. 00:42:02.000 --> 00:42:05.000 And the stage predict. 00:42:05.000 --> 00:42:14.000 Okay, so here we have our mean squared errors. And this would be the mean squared error if we stop that a single decision tree. 00:42:14.000 --> 00:42:21.000 If we stopped at 2. If we stopped at 3 and so forth, all on the validation set. 00:42:21.000 --> 00:42:30.000 And so then what you'll do or what you would do is you could look at this. And I've plotted it here so you can see the MSE as a function of the number of week learners and then you'd find the one with the smallest. 00:42:30.000 --> 00:42:43.000 Value, the smallest MSE, again, on the validation set. And then you would say, okay, this will be the number of week learners I would use. 00:42:43.000 --> 00:42:51.000 So that's a hundred 12 week learners. And then you could retrain it. And then here's what it looks like on the training set. 00:42:51.000 --> 00:43:00.000 So retraining it like this is the model we get. Okay. All right, so. 00:43:00.000 --> 00:43:05.000 There's another way to do this called early stopping. So notice here that we had to go through and train about 90 more week learners than we needed for the smallest value. 00:43:05.000 --> 00:43:19.000 And so one way that you can keep yourself from doing as many fits as we did there is doing what's known as early stopping. 00:43:19.000 --> 00:43:26.000 So if you include an SK Learn, an argument called warm start and set it equal to true. 00:43:26.000 --> 00:43:41.000 This is going to allow you to implement early stopping. So how does early stoping work? Early stopping will as you add a week learner, keep track of what is my current best MSE. 00:43:41.000 --> 00:43:48.000 And then if I don't go below the current best for some set number of times in a row, so for us it's going to be 10. 00:43:48.000 --> 00:43:58.000 If I don't outperform my current best MSE after 10 more week learners. I'm going to stop early and not keep going. 00:43:58.000 --> 00:44:03.000 So here is the code where we implement that and I'll walk it walk us through it. So we set the warm start argument. 00:44:03.000 --> 00:44:09.000 Equal to true. That's what's going to allow us to do or about to do. 00:44:09.000 --> 00:44:14.000 And then we set a minimum validation error. So here, this might look weird. I'm setting it to infinity and we'll see why I'm doing that in a second. 00:44:14.000 --> 00:44:21.000 And we'll see why I'm doing that in a second. Now I'm providing a list where I'm doing that in a second. 00:44:21.000 --> 00:44:30.000 Now I'm providing a list where I'm going to keep that in a second. Now I'm providing a list where I'm going to keep track of my validation errors and then I'm also keeping a counter that's going to count the number of times my error was higher than my minimum error. 00:44:30.000 --> 00:44:37.000 And then if this ever gets to 10, I'll stop early. I will just not do it anymore. 00:44:37.000 --> 00:44:46.000 So I'm then gonna loop through one to 500. And then train my, gradient boosting tree to have that many week learners. 00:44:46.000 --> 00:44:55.000 So what I'm gonna say is. Each time through the loop, I set the number of estimators for my gradient boosting, regressor to be n estimators. 00:44:55.000 --> 00:45:11.000 So the first time 3 would be one, then 2, then 3. I fit slash refit the model so because I had this warm start argument I'm able to do this so it won't go all the way back to one and then, you know, refit one and 2 every time. 00:45:11.000 --> 00:45:24.000 Then I calculate the validation errors for training up to that point. I checked to see if my current validation error is better than my absolute minimum that I have so far. 00:45:24.000 --> 00:45:34.000 I guess it could be a local minimum. I just met the smallest one I currently have. And if it does, I record that as the new minimum and then I reset my counter. 00:45:34.000 --> 00:45:41.000 And then if my counter ever gets to 10 times in a row, like meaning like, okay, I'd trained a new. 00:45:41.000 --> 00:45:52.000 I trained a new week learner and my error was still higher than my current minimum value. Then I'm going to increase my counter and if I ever get to 10 I'm not going to do the loop anymore. 00:45:52.000 --> 00:46:06.000 So this is called early stopping and so this is just printing to see what we're doing. And so you can see that once we got to 122, we stopped. 00:46:06.000 --> 00:46:11.000 And so this is what it looks like here. So I think, it's still 1, 12. 00:46:11.000 --> 00:46:15.000 That's what we had before, right? Yeah. So once we got to 1, 1210 times in a row. 00:46:15.000 --> 00:46:22.000 We didn't outperform the MSE we had on the validation set at 1 12. 00:46:22.000 --> 00:46:30.000 So we stopped. Okay. 00:46:30.000 --> 00:46:31.000 Yeah. 00:46:31.000 --> 00:46:40.000 Where a question. At the very bottom of your code block, yeah, that one. 00:46:40.000 --> 00:46:48.000 You see the comment where it says if this is the fifth time in a row has gone up. Should that be tenth? 00:46:48.000 --> 00:46:49.000 Yes. Thanks for putting that. 00:46:49.000 --> 00:46:52.000 Okay. Otherwise, let the divide by 2 thing going on that I was missing. 00:46:52.000 --> 00:46:57.000 No, no, no, yeah, it's just a comment that I missed in my editing. 00:46:57.000 --> 00:46:58.000 Cool. 00:46:58.000 --> 00:47:06.000 Yep. And then icon asked, I expected to loop over and estimators in the previous example, line 12. 00:47:06.000 --> 00:47:11.000 So are you talking about 00:47:11.000 --> 00:47:17.000 When we did this stage predict thing. 00:47:17.000 --> 00:47:28.000 Okay, so it though when you do stage predict this so when we fit it it fits it for all like the total number of estimators we chose, which for this was 200. 00:47:28.000 --> 00:47:36.000 And then stage predict basically allows us to loop through each of those like at the training point that it was already at. 00:47:36.000 --> 00:47:44.000 If that makes sense. So like the first the first entry would be we've trained. Only using one week learner. 00:47:44.000 --> 00:47:48.000 Here's the prediction for that week learner. So that's why it's like stage predict. 00:47:48.000 --> 00:47:54.000 It's the prediction at all the different stages of the training. 00:47:54.000 --> 00:48:00.000 Yeah, are there any other questions about any of the stuff we just did? 00:48:00.000 --> 00:48:13.000 So warm start, I believe, is what allows us to do. Resetting the number of estimators and then refitting the model. 00:48:13.000 --> 00:48:18.000 Oh, sorry. And that was in response to a question Clark had. I have to remember the people watching the recording won't see that. 00:48:18.000 --> 00:48:25.000 Clark asked, where are we using warm start in the code? So here we set warm start equal to true. 00:48:25.000 --> 00:48:43.000 And if I understand correctly, it's been a while since I wrote the notebook so I may be in forgetting but it's what allows us to reset the number of estimators and then fit the model to include the next number of Oh, thanks. 00:48:43.000 --> 00:48:44.000 Thanks. 00:48:44.000 --> 00:48:51.000 Yeah, just check the documentation that's correct for warm stuff. 00:48:51.000 --> 00:49:01.000 Alright, any other questions? 00:49:01.000 --> 00:49:06.000 Okay. 00:49:06.000 --> 00:49:12.000 So you might be wondering why this is called gradient boosting. And so Here's just a quick sort of explanation as to why it's called gradient boosting. 00:49:12.000 --> 00:49:21.000 It's not because we're computing, it's not exactly because we're computing a gradient, but we'll see like why. 00:49:21.000 --> 00:49:35.000 So let's say, you know, our current prediction for Y Hat. As we go through step J, I'm gonna call it capital H sub j at X. 00:49:35.000 --> 00:49:45.000 And remember this is the sum of all the little H's. So to get the estimate of why at step j plus one, we're going to call this. 00:49:45.000 --> 00:49:54.000 H, capital H, J plus one. So this is. You know, hopefully approximating why. 00:49:54.000 --> 00:50:02.000 It's going to be capital H of J, so the previous prediction for Y. Plus the current, you know, little h of J plus one. 00:50:02.000 --> 00:50:19.000 So remember that's for the residuals, right? So I'll turn, The little sub h sub j, little h j plus one is an approximation of y minus the capital H of j. 00:50:19.000 --> 00:50:25.000 And if you remember for a regression problem, we typically attempt to minimize the MSE of the estimate. 00:50:25.000 --> 00:50:29.000 So for simplicity, we could denote this as one over n, y minus capital H sub j squared. 00:50:29.000 --> 00:50:43.000 This is at the J plus one step. So if we took the negative gradient of this with respect to the estimate capital H sub j, we end up with 2 over N. 00:50:43.000 --> 00:50:48.000 Y minus capital H sub j, which from our earlier approximation is 2 over n, little h sub j plus one. 00:50:48.000 --> 00:51:02.000 So what we're saying here is gradient boosting is roughly speaking an estimate or a gradient descent algorithm in some sense. 00:51:02.000 --> 00:51:08.000 So that's where the gradient part comes from in gradient boosting. 00:51:08.000 --> 00:51:14.000 Okay. 00:51:14.000 --> 00:51:21.000 So Zack is saying the gradient boost method reminds me of a Taylor series. Is that an imagined connection? 00:51:21.000 --> 00:51:34.000 So like, we're like with a Taylor series, you're approximating by adding additional polynomials and like here you're sort of kind of, you're saying like in some sense you're doing a similar thing, is that what you're trying to say? 00:51:34.000 --> 00:51:43.000 Yeah, you're adding additional higher power polynomials. And, the, terms are decided by the derivatives. 00:51:43.000 --> 00:51:54.000 Yeah. And each one is a smaller correction. 00:51:54.000 --> 00:51:55.000 Okay. 00:51:55.000 --> 00:52:00.000 That's possible. I would have to like, I'd have to sit down and think about it more than my brain is able to do right now. 00:52:00.000 --> 00:52:01.000 Yeah. 00:52:01.000 --> 00:52:04.000 Okay, okay. 00:52:04.000 --> 00:52:06.000 Okay. So with gradient boosting in mind, the next thing we're gonna learn is extra. 00:52:06.000 --> 00:52:15.000 Gradient boosting, which is what XG Boost stands for. 00:52:15.000 --> 00:52:23.000 So. Reminder about gradient boosting. There's Skelards gradient boosting regressor. 00:52:23.000 --> 00:52:30.000 This is, you know, we're just like we said, or iteratively training weak learners by using the next week learner to predict the current week learners, residuals or errors. 00:52:30.000 --> 00:52:41.000 So what is XG boost if we already have a perfectly good implementation of gradient boosting? Why do we need another one? 00:52:41.000 --> 00:52:50.000 So XG boost is a very popular package for gradient boosting in Python stands for extreme gradient boosting. 00:52:50.000 --> 00:53:08.000 It's particular package is, yeah, at least when I wrote this, it was used a lot in winning data science competitions, which is probably why it became so popular, which I think is also why like at a boost became so popular whenever it was introduced because it was used to win a lot of Kaggle competitions and so typically whatever is 00:53:08.000 --> 00:53:21.000 winning the Kaggle competitions picks up and data science circles. So before we doive in like xj boost is not a package that's typically comes installed in like an anaconda distribution of Python, I believe. 00:53:21.000 --> 00:53:33.000 So you'll need to install it. So to do that, you can follow instructions here for both the conda or the pip version and then we have it's been a while since like we one since we talked about installing a package. 00:53:33.000 --> 00:53:45.000 If you're unsure of how to install Python package, I believe we have instructions on the data science boot camp website that you can get to through the first steps button. 00:53:45.000 --> 00:53:52.000 This is an outdated line because I think the M one is probably fine, but I now I know now that there's also an M 2. 00:53:52.000 --> 00:54:01.000 So if you have an M 2 chip, the standard instructions may not work for you. And also if you have an M one chip and a Mac, it's possible that they don't work for you. 00:54:01.000 --> 00:54:14.000 So. You will probably need to do a web search to find relevant instructions if you're unable to install and you think it's because you have either the Apple M 2 chip or possibly the Apple M one chip. 00:54:14.000 --> 00:54:23.000 So why might we use XG boost? So XG boost code for fitting boosting models is faster than the SK Learn version. 00:54:23.000 --> 00:54:30.000 It also tends to outperform the SK Learn version. So it's a slight modification of the gradient boosting algorithm that I don't quite remember because I never bothered to look into it. 00:54:30.000 --> 00:54:41.000 But you can check out the documentation if you'd like to see like what they're doing to improve upon the standard gradient boosting algorithm. 00:54:41.000 --> 00:54:46.000 I believe there's some, it's like an extra step where like we were sort of approximating the first gradient and then I think extra boost is something maybe approximating the second derivative as well. 00:54:46.000 --> 00:55:01.000 But I'd have to dive into the documentation to remember. But the big takeaway is that it's a faster version of gradient boosting that some tends to outperform regular gradient boosting. 00:55:01.000 --> 00:55:11.000 It also offers the ability to train the model in parallel, which at the time of writing this notebook, SK Learn did not do for regular gradient boosting. 00:55:11.000 --> 00:55:17.000 So that's another reason why you might want to use it. 00:55:17.000 --> 00:55:23.000 So we're gonna use the same exact data set from the previous notebook to show you how to do everything that we did in the last notebook using XG Boost version of, gradient boosting. 00:55:23.000 --> 00:55:41.000 So we're gonna do this so they have a couple different ways to implement things. We're gonna do the way that is most similar to SK Learn, but that's sort of just scratching the capabilities of what XG Boost can do. 00:55:41.000 --> 00:55:47.000 So if you're really interested in this and want to use it, I encourage you to dive into the documentation and look it up there. 00:55:47.000 --> 00:56:01.000 So we're gonna import, and then remember this won't run if you, If you haven't installed it and then another thing we should check it's been a while since I've updated. 00:56:01.000 --> 00:56:16.000 So my version of 1.7 point 4 and they're probably beyond that. Let's see. 00:56:16.000 --> 00:56:19.000 I'm not seeing where the most recent version is but they're probably beyond that I think I installed this a little over 2 years ago. 00:56:19.000 --> 00:56:31.000 So if there's something in the code that's working for me but doesn't work for you, it's probably because there's a different version. 00:56:31.000 --> 00:56:32.000 Okay, so Brooke says he has 1.7 point 5. So maybe I'm not that far behind. 00:56:32.000 --> 00:56:38.000 Okay. 00:56:38.000 --> 00:56:50.000 So how do we create an XG boost version? We will define it basically follows the same workflow, this version of their algorithm follows the same workflow as an SK Learn. 00:56:50.000 --> 00:57:05.000 So you first create a model object. So you do xg. Boost dot xg regressor and I'll point out we could have just installed x or imported XGB regressor directly, but I decided not to. 00:57:05.000 --> 00:57:12.000 I don't know why. So then we set the learning rate. And this is going to mimic. 00:57:12.000 --> 00:57:25.000 I just want to mimic. This plot. So I'm just showing you how to make this plot, but now with XG boost, instead of so learning rate will be point 1. 00:57:25.000 --> 00:57:33.000 My maximum depth. Will be one and then the number of estimators will be 10. 00:57:33.000 --> 00:57:42.000 Then you would just fit it like normal so that fit. X, dot reshape negative 1 one and then Y. 00:57:42.000 --> 00:57:47.000 And then I'm gonna copy and paste just like I did in the last notebook. And then I'm going to replace my learning rate from point one to one. 00:57:47.000 --> 00:57:59.000 This will be like my bigger learning rate and then, dot fit. X dot reshape negative one comma one. 00:57:59.000 --> 00:58:03.000 Why? Okay. 00:58:03.000 --> 00:58:10.000 And here's what. The plots look like. And so if we were to reference back. 00:58:10.000 --> 00:58:15.000 They're basically the same, right? There's, they're slightly different, but they're more or less the same. 00:58:15.000 --> 00:58:24.000 So, XG Boost, S. Kaler, and extra boost. . Okay 00:58:24.000 --> 00:58:36.000 So. A nice feature about XG boost in addition to the stuff I told you about before that it's faster and typically has better accuracy or or MSC. 00:58:36.000 --> 00:58:45.000 Is that it will automatically record validation set performance as it's going through as opposed to needing to use like a stage to predict. 00:58:45.000 --> 00:58:51.000 So here I make my validation set. And what you'll do is You define the regressor like you would. 00:58:51.000 --> 00:59:04.000 And then you when you call that fit in addition to X and Y. You can provide this argument called eval underscore set. 00:59:04.000 --> 00:59:09.000 So EVA, L, underscore set. And then to that you provide a list. 00:59:09.000 --> 00:59:20.000 And, in that list, you'll have tuples. So those tuples will have like whatever sets you'd like to get, the performance on. 00:59:20.000 --> 00:59:30.000 So since we had only have a single validation set, you would do a tuple with. First of the features followed by the, y values. 00:59:30.000 --> 00:59:36.000 Okay. And then if we had sometimes, you know, people maybe have more than one set they'd like to get their performances on and then they would provide that like you could provide. 00:59:36.000 --> 00:59:45.000 Another. X. Another Y. Okay. 00:59:45.000 --> 00:59:48.000 But we only have the one. 00:59:48.000 --> 00:59:53.000 And so you can see here all of this is printed to the screen. We're providing the validation set RMSE. 00:59:53.000 --> 01:00:02.000 And now, you know, we have this on here, but you might reasonably be like, well, this like am I gonna have to go through and copy and paste everything? 01:00:02.000 --> 01:00:09.000 No, you're not. So you just do, what did I call this? XGB, underscore, so XGB. 01:00:09.000 --> 01:00:26.000 Underscore reg. Result and so this creates a dictionary. And within that dictionary is gonna be the stuff from all of your different validation sets, but because I only have a single validation set. 01:00:26.000 --> 01:00:30.000 I only have validation 0. 01:00:30.000 --> 01:00:36.000 Okay, so we can access the stuff there. 01:00:36.000 --> 01:00:46.000 At validation. Underscore 0. So this is also a dictionary. We can look at the keys for it. 01:00:46.000 --> 01:00:50.000 And so the only key here is the RMSE. So we can get the RMSE. 01:00:50.000 --> 01:01:02.000 By doing square brackets a string And so now we have the root mean squared errors for all of the different, number of week learners. 01:01:02.000 --> 01:01:04.000 And so we could, use this, which, we could use this in our plot. 01:01:04.000 --> 01:01:19.000 So here I've plotted that and then I provide the minimum. Okay, so for this, version of it, the minimum, number is Somewhere between 200. 300. 01:01:19.000 --> 01:01:22.000 But you can kind of see like there's really not that much of a difference in performance once you get past 100, but this is the app like this is the minimum. 01:01:22.000 --> 01:01:39.000 For the week learners. So to do, another nice thing is, we can also do like, we don't have to do that weird warm start loop thing. 01:01:39.000 --> 01:01:48.000 You can just provide a number of rounds for early stopping. So here I'm defining my XG boost progresser again. 01:01:48.000 --> 01:01:55.000 But now when I call fit. I first put in my training, my training data, which I just called x. 01:01:55.000 --> 01:02:04.000 Dot reshape, negative 1 one, then I provide my why. Then you can provide this argument early underscore stopping. 01:02:04.000 --> 01:02:22.000 Underscore rounds and we'll set this equal to 10 like we did in the last notebook and then we can provide when you do this you have to provide an evaluation set and so I'm gonna and a cheat and just copy and paste the previous one. 01:02:22.000 --> 01:02:32.000 So, you have to provide at least one validation set. I actually don't know what it would do if you provide more than one, like if it would, how it determines early stopping from that. 01:02:32.000 --> 01:02:46.000 But we can scroll down and see that it stopped. Instead of going all 500, it stopped at 2 29 and we can see that in this plot as well. 01:02:46.000 --> 01:02:56.000 And so then here is just I'm, I'm redefining the model with the optimal number of week learners according to this validation set. 01:02:56.000 --> 01:03:05.000 And then this is what it looks like. So this is just really a surface level introduction into the XG boost package, but I think it's a good. 01:03:05.000 --> 01:03:17.000 It's, good to look at. And like, see like, okay, this is a pretty, like, it already has some nice features over Escalar and like you don't have to do warm start for early stopping. 01:03:17.000 --> 01:03:21.000 You don't have to do staged predict for the validation set. All that is already built into just the regular model. 01:03:21.000 --> 01:03:33.000 And you can go to the documentation page. And see all of these different tutorials on like, you know, looks like you can use a GPU. 01:03:33.000 --> 01:03:38.000 You can use something called Pi Spark. So it, this is maybe worth looking into if you would like to learn. 01:03:38.000 --> 01:03:59.000 More about this particular model. 01:03:59.000 --> 01:04:12.000 Hi, just, confused about the. So the n estimators and these arguments that you're putting into your model, if it's a regressor, then, cause I thought these represent like the, you know, the number of trees. 01:04:12.000 --> 01:04:23.000 Or the number of decision. Boundaries are, you know, but how does that make sense in the context of a aggressor? 01:04:23.000 --> 01:04:48.000 All right, I thought these were. 01:04:48.000 --> 01:04:57.000 Through the features. Here we only have a single feature and finds a cut point that minimizes like something like the MSE, instead of the impurity. 01:04:57.000 --> 01:05:12.000 And so, Here it's determined that this cut point should happen here and then what it does is in one of the node in any node it will just its prediction for that node will just be the average value of y within that node. 01:05:12.000 --> 01:05:23.000 So to the left of the cut point, we are predicting the average value here. Of all of these observations and then to the right of the cut point, we're taking the average value of all these observations. 01:05:23.000 --> 01:05:33.000 And so that's how a decision tree regressor works. Now you could include like a more depth if you'd like, but because this is a week learner, it's just, we're just using one. 01:05:33.000 --> 01:05:38.000 And then like the way this works is the number of estimators is like the number of decision, stumps you're using. 01:05:38.000 --> 01:05:43.000 So on the left is like each individual decision stump but on the right is like our final prediction for why. 01:05:43.000 --> 01:05:55.000 And it's just the addition of all these decision stumps together. So at 2 week learners, it's little H one plus little H 2. 01:05:55.000 --> 01:05:59.000 So it's this curve. Plus this curve and that's how you end up with this. 01:05:59.000 --> 01:06:10.000 And then it just keeps going. So at the end for This example, if we made it all the way to the end, it would be H one plus H 2 plus H 3. 01:06:10.000 --> 01:06:22.000 All the way up to H 500. But you know here we stopped at 2 29 because of the early stopping. 01:06:22.000 --> 01:06:25.000 And thanks. 01:06:25.000 --> 01:06:39.000 Any other questions? 01:06:39.000 --> 01:06:44.000 Okay. 01:06:44.000 --> 01:06:58.000 So our last notebook is voter models and this one is another type of ensemble model. We're gonna sort of go back to, I think I just like the visuals for classification a little bit better maybe. 01:06:58.000 --> 01:07:10.000 But we're going back to our classification, setting. So. The idea behind a voter model is you have a few maybe like after you go through and you fit a couple of different algorithms. 01:07:10.000 --> 01:07:19.000 So in this setting, classification, like let's say you have a logistic regression that you're happy with, okay nearest neighbors you're happy with. 01:07:19.000 --> 01:07:27.000 I support vector machine you're happy with and maybe a random forest model you're happy with. So the voting classifier. 01:07:27.000 --> 01:07:36.000 Is then going to. Combine all of these together and to make its predictions, it's going to vote. 01:07:36.000 --> 01:07:42.000 Like using the individual classifiers as the voters. So for instance, a voting classifier of these 4 models will say for an individual observation, it will say what does the logistic regression model say? 01:07:42.000 --> 01:07:57.000 What does the K nearest neighbor's model say? What does a support vector machine model say and what does the random forest model say and then it will just take. 01:07:57.000 --> 01:08:03.000 In hard voting which we'll talk about in a second it would just take the majority class and then again if there's a tie I think it just randomly decides. 01:08:03.000 --> 01:08:18.000 So that's what's called the voting model. And then with like regression, the regression version, I think would just take probably the average of the predictions, instead of doing like a vote. 01:08:18.000 --> 01:08:27.000 So that's what we mean when we say voter models. Does anyone have any like just before we show you how to fit everything in S. K. 01:08:27.000 --> 01:08:57.000 And are there any questions about like. How a voter model works. Is that clear enough? If like feel free to ask if it's still unclear and I can try and explain it again. 01:08:59.000 --> 01:09:14.000 Yes, you would have to fit all these models, but I mean, like, so the idea would be in your course of trying these different models you would have at least you would have already fit them right so you've already invested the time and you would have at least you would have already fit them right so you've already invested the time into checking these models out right 01:09:14.000 --> 01:09:21.000 so Like you've already spent the time fitting them, if that makes any sense. You wouldn't like from the gate. 01:09:21.000 --> 01:09:25.000 Like say like, alright, the first model I'm gonna try is a voter model that uses 5 different models. 01:09:25.000 --> 01:09:37.000 Like typically like a voter would be something you try after you're already happy with a series of other models. 01:09:37.000 --> 01:09:47.000 Okay, so like maybe it makes sense sort of the reason why you think this might be good. Is these models are in some sense like different from one another. 01:09:47.000 --> 01:09:54.000 And so it's possible that the errors they make are also different from one another. So like the types of errors the logistic regression model might be making is maybe different from the other 3. 01:09:54.000 --> 01:10:14.000 And so then the hope is that by combining them all together together in a voter model. It's going to sort of all the individual errors that one model makes with the others don't will hopefully be canceled out by this voting process. 01:10:14.000 --> 01:10:20.000 So that's sort of the hope of why we're doing a voting classifier. 01:10:20.000 --> 01:10:37.000 Okay. So we'll show you again, we're gonna go back to this really, I just think I wanted to keep it simple with these algorithms and stick with one basic data set where we've got upper left hand corner of a square and then the bottom right hand corner of a square. 01:10:37.000 --> 01:10:45.000 So the way this works in SK Learn is you need to have both the just like sort of Jacob alluded to with with his question. 01:10:45.000 --> 01:10:54.000 You need both the base classifiers and the voting classifier object. So, for us, we're gonna Test my memory to import everything. 01:10:54.000 --> 01:11:04.000 So from SK Learn dot linear underscore model will import. Logistic regression. 01:11:04.000 --> 01:11:14.000 From S. Neighbors. We'll import, K neighbor's classifier. 01:11:14.000 --> 01:11:26.000 From S. K. Learn dot, linear or dot SVM, will import linear SVC. 01:11:26.000 --> 01:11:32.000 From S. K. 01:11:32.000 --> 01:11:44.000 Ensemble. We'll import random forest classifier. And then for the voting classifier, we would do from. 01:11:44.000 --> 01:11:54.000 SK Learn dot ensemble. Import. Voting classifier. And then at the end, I'm just gonna import my accuracy score. 01:11:54.000 --> 01:11:57.000 So. 01:11:57.000 --> 01:12:02.000 From S scalar and dot metrics. And import. Accuracy score. I'm sure you're watching me type all this was very exciting. 01:12:02.000 --> 01:12:18.000 Okay. So. When you make a voter model You do not have to train the individual models on their own. 01:12:18.000 --> 01:12:27.000 But what I'm doing here is I would like to show you a comparison between the individual models and then the voter model. 01:12:27.000 --> 01:12:34.000 So what I'm gonna do is I'm going to go through each of these 4 models. And then make one of them. 01:12:34.000 --> 01:12:43.000 And then when you do the voter model, you'd have to. Make a fresh version and so hopefully that will become clear as I do it. 01:12:43.000 --> 01:12:52.000 So I'm first gonna start off by making model objects of all 4 model types. So and I have my names laid out for me here. 01:12:52.000 --> 01:13:02.000 So I'm gonna do, log is equal to logistic. Regression and then I'll make a note notes. 01:13:02.000 --> 01:13:08.000 I've kept in the. 01:13:08.000 --> 01:13:16.000 Regularization. Just It doesn't necessarily matter. I've just decided to keep it in here for no particular reason. 01:13:16.000 --> 01:13:25.000 Okay. Kn N is gonna be equal to K. Nearest or K neighbor's classifier. 01:13:25.000 --> 01:13:34.000 And, Let's do. 7, 7 neighbors. Again, like this is just for demonstration purposes. 01:13:34.000 --> 01:13:44.000 So like typically you would have gone through and done some sort of like cross validation for these models. Sv. 01:13:44.000 --> 01:13:53.000 SVM is gonna be equal to linear SVC. And let's say C equal to 10. 01:13:53.000 --> 01:14:03.000 And then finally, our F will be random forest classifier. Let's say maximum depth is equal to. 01:14:03.000 --> 01:14:10.000 3 and my number of estimators is equal to, I don't know, 2 50. Okay, so now these are going to just be just be for comparison purposes. 01:14:10.000 --> 01:14:17.000 So I'm just going to keep these so I can finally compare them to the voting classifier. 01:14:17.000 --> 01:14:39.000 Now for the voting classifier, you have to put in fresh. Never been you know fresh never frozen for in twenties fresh never been fitted before, versions of the, of the, 01:14:39.000 --> 01:14:44.000 Of the classifiers. So you're gonna put in a list so it's very similar to a pipeline. 01:14:44.000 --> 01:14:50.000 You put in a list of 2 pools. So we're gonna do the first one will be log and then. 01:14:50.000 --> 01:14:54.000 We do a logistic regression. 01:14:54.000 --> 01:15:06.000 Next will be the KNN and then that will be the K. Neighbor's classifier of 7 neighbors. 01:15:06.000 --> 01:15:14.000 Then we'll have the support vector machine, which will just be a linear SVC with C equal to 10. 01:15:14.000 --> 01:15:29.000 Then we're gonna have the random forest. Which again, I'm gonna copy and paste. 01:15:29.000 --> 01:15:40.000 So you'll also notice that I have this argument here voting equals hard. So this means that voting works like the way you think of like just counting voting. 01:15:40.000 --> 01:15:48.000 Another argument you can have is voting equals soft and I'll talk about that, after we go through this example. 01:15:48.000 --> 01:16:04.000 Okay, so what this loop is gonna do is it's going to loop through these different lists of the name and then the model type, it will fit the model, predict on the model, and then print out the accuracy and then be mindful that this is the training set. 01:16:04.000 --> 01:16:09.000 So, you know, ultimately, if it does well, it doesn't really matter, right? Cause we'd want it. 01:16:09.000 --> 01:16:15.000 Per like cross validation. Hey. 01:16:15.000 --> 01:16:24.000 Okay, so now we'll just go through and compare. So this was like the logistic regression. 01:16:24.000 --> 01:16:31.000 This was the random forest. 01:16:31.000 --> 01:16:38.000 This was the support vector machine. 01:16:38.000 --> 01:16:44.000 This was the K nearest neighbors. 01:16:44.000 --> 01:16:48.000 And this was the voting classifier. So you can kind of see if we go through and compare like how the voting classifiers boundary is. 01:16:48.000 --> 01:17:01.000 Sort of determined by like an averaging of the previous boundary. So like if you notice in the random forest and the KAYNERS neighbors, this boundary was. 01:17:01.000 --> 01:17:05.000 Was blue, right? It was blue. 01:17:05.000 --> 01:17:18.000 Was very blue and so you can kind of see how like that carries over from the other 2 models, and then how some of the intrusions into the other side of the actual boundary, right? 01:17:18.000 --> 01:17:22.000 Also coming from these other models. So that's sort of the idea. So I see that I have a question. 01:17:22.000 --> 01:17:31.000 So why do we have to put in classifiers that are not previously fit. Does the new fit not override the previous one? 01:17:31.000 --> 01:17:41.000 So I didn't want the fits to be overridden. So I wanted in this particular example, I wanted these to be fit independently of the voter model, which wouldn't have made a difference for anything but the random forest classifier, right? 01:17:41.000 --> 01:17:51.000 But I wanted them to be fit, independently. So I didn't want a fit as the base model to be impacted when I then went to go fit this model. 01:17:51.000 --> 01:18:02.000 So like when I fit the voter model. It would then go through and refit all of these individual models. 01:18:02.000 --> 01:18:13.000 So if I just put, you know, instead of logistic regression. If I had put log here, when I call fit to the voter model, it would have fit log. 01:18:13.000 --> 01:18:21.000 And I, and I'm not entirely, I'd have to double check. I'm not sure if I'm able to access the individual models like in voting and get the predictions that way. 01:18:21.000 --> 01:18:31.000 I can't remember. I'd have to check. 01:18:31.000 --> 01:18:39.000 Okay, so I mentioned this like hard versus soft. So hard voting, which is the argument that I had. 01:18:39.000 --> 01:18:51.000 Is sort of just like how you think a voting works if you're coming from a place where voting is we count up the number of yeses and the number of knows and whichever one has more as the winner. 01:18:51.000 --> 01:18:59.000 So that's what hard voting means. So for example, if 3 out of 4 say we're gonna have a one, that means you get a one. 01:18:59.000 --> 01:19:03.000 Now if there's a tie I'm pretty sure it's like just randomly decided. 01:19:03.000 --> 01:19:16.000 So that's how I think ties work. And then the other option is voting equals soft. And so voting equals soft means that the predictions are sort of weighted according to the probabilities. 01:19:16.000 --> 01:19:29.000 So for instance, Instead of doing a hard vote, you're going to, for each class that's possible, you're going to sum up the probabilities across the different voter model. 01:19:29.000 --> 01:19:33.000 So here would be from one to 4. And so you're going to sum up the probabilities across the different voter model. 01:19:33.000 --> 01:19:43.000 So here it'd be from one to 4. And so you'd get the probability from logistic regression, the probability from cane nearest neighbors, the probability from support vector machine, and the probability from the random forest and then whichever class from that has the highest sum of probabilities would be the class that gets predicted. 01:19:43.000 --> 01:19:58.000 You can also perform for either case you can perform weighted voting, which will then you have to provide an argument of weighted voting, which will then you have to provide an argument of weights. 01:19:58.000 --> 01:20:03.000 And so a typical thing you could do I think is. Well, never mind. I actually don't know. 01:20:03.000 --> 01:20:09.000 But you can provide, you could provide weights. So I think these would be weights. So I think these would be weights on the individual model. 01:20:09.000 --> 01:20:10.000 So you could maybe do weights on the individual model. So you could maybe do something like by waiting by the performance. 01:20:10.000 --> 01:20:15.000 So you could maybe do something like by waiting by the performance. So like if your voter, if your one model has like, by the performance. 01:20:15.000 --> 01:20:22.000 So like, if your voter, if your one model has like a 99% accuracy, maybe you wait that more, if your one model has like a 99% accuracy, maybe you weight that more highly than the others that maybe are lower. 01:20:22.000 --> 01:20:34.000 So as a quick note, if you use voting equals soft. It won't work if the particular algorithm doesn't have the ability to provide a probability. 01:20:34.000 --> 01:20:44.000 So I believe in this example, if we were to change this to be voter equals soft. 01:20:44.000 --> 01:20:55.000 We should eventually get an error. And so why do we get an error? That's because, linear SVC doesn't provide a way to get predict prob with the, default arguments. 01:20:55.000 --> 01:21:08.000 So I think. I know in support vector machines the general one there is I'm not sure if there is a linear SBC so I'd have to check the documentation to see what argument I'd want to use to make predict prob, available. 01:21:08.000 --> 01:21:18.000 Okay, so that's an important thing to remember with soft is if you want to use soft voting, the algorithms you use as your voters better have the ability to provide probabilities. 01:21:18.000 --> 01:21:38.000 So for voter models for regression. You can do an ensemble of independent regression models and then I think it's just the average of the, of whatever the model predicts or weighted average depending on if you're like waiting by maybe inverse and MSC or something like that. 01:21:38.000 --> 01:21:51.000 So I want to point out for this one in particular, this does not mean like you build a bunch of different linear regression models with slightly different features and then feed that into the voter model. 01:21:51.000 --> 01:22:00.000 So those would all still be like pretty dependent and might make the same types of errors. So the idea behind voter models is you're getting models that are like sort of fundamentally different from one another. 01:22:00.000 --> 01:22:09.000 So if they're making predictions in the same type of way, the various models are probably gonna make the same types of errors. 01:22:09.000 --> 01:22:22.000 And so in the voting, that's sort of just gonna get compounded. The idea with the voter model is like this model makes these types of errors this model makes slightly different types of errors and the third model makes even other slightly different types of errors. 01:22:22.000 --> 01:22:30.000 And then the hope is by voting together, like individual errors will get wiped out for an overall better performance. 01:22:30.000 --> 01:22:41.000 And so for example, you could do. A voter model with linear regression, K nearest neighbors regression, a support vector regressor and a random forest progresser as an example. Okay. 01:22:41.000 --> 01:22:47.000 So I see I have some questions. 01:22:47.000 --> 01:22:50.000 Okay, so the only question is Pedro. Pedro is asking, could you use boosting with voter models. 01:22:50.000 --> 01:22:58.000 Yes, so you could use a booster. You could put like an add a boost, or an add a boost classifier in. 01:22:58.000 --> 01:23:13.000 You could do that as well and just be mindful like if you were That might make similar predictions to like a decision tree or to a random forest because it's using a decision tree as its base. 01:23:13.000 --> 01:23:28.000 Are there other questions about voter models? 01:23:28.000 --> 01:23:34.000 Okay, so that's it for Ensemble Learning, before we sign off for today. 01:23:34.000 --> 01:23:42.000 I wanna make a quick note. So we're gonna start on our last day is not, you know, we're just gonna have the last day of last lecture day tomorrow. 01:23:42.000 --> 01:23:50.000 But for this last day, we're gonna do. Neural network stuff. So the package we're using is the package I know, which is Keras. 01:23:50.000 --> 01:23:57.000 There are other packages, but we're gonna use Keras. So This is also not installed. 01:23:57.000 --> 01:24:02.000 By default, so I just want you to be aware that if you want to be ready for the problem session tomorrow, so problem session 11 and for live lecture day 12. 01:24:02.000 --> 01:24:13.000 You want to go through and make sure you have Keras installed and can sort of import things. So this could be a good one to look at. 01:24:13.000 --> 01:24:17.000 As you're getting ready, try and follow these instructions to make sure you have it installed. 01:24:17.000 --> 01:24:29.000 So it's Keras, it's also like a package within TensorFlow, so you may also have to go through TensorFlow depending on your computer. 01:24:29.000 --> 01:24:30.000 Okay, so that's gonna be it for today. I will hang around for any questions. 01:24:30.000 --> 01:24:46.000 Remember to check out this Keras thing for tomorrow's stuff. But if not, I will see you tomorrow and have a great rest of your Wednesday.