WEBVTT

00:00:06.000 --> 00:00:22.000
Lecture for the 2023 May data science boot camp from the Airish Institute. So today we're gonna wrap up ensemble learning by learning about boosting and then at the very end voter models and then tomorrow we'll do a brief introduction into neural networks.

00:00:22.000 --> 00:00:26.000
If you do the problem session tomorrow, you'll also do a brief introduction on your own within the groups to neural network.

00:00:26.000 --> 00:00:30.000
So show up tomorrow if you want to start to learn a little bit about. Neural networks before the live lecture.

00:00:30.000 --> 00:00:45.000
So. All of the notebooks are in the cover today are in the supervised learning folder and then within the ensemble learning folder there, we're gonna go through notebooks 4, 5, 6, 7, and 8.

00:00:45.000 --> 00:00:48.000
So this might seem like a lot, but some of these notebooks are kind of short. So there's time for questions.

00:00:48.000 --> 00:01:00.000
Probably the longest one is the XG boost one. So once we get past that, it'll be smooth sailing.

00:01:00.000 --> 00:01:04.000
Okay, so the first thing we're going to talk about is in Notebook number 4, boosting.

00:01:04.000 --> 00:01:16.000
So this is like there's no coding. It's just a conceptual notebook. So one of the ensemble learning approaches that is like most successful across all the ensemble learning is known as boosting.

00:01:16.000 --> 00:01:23.000
So boosting takes advantage of a concept known as. Week learnability. So, me zoom in so maybe it's easier to read.

00:01:23.000 --> 00:01:25.000
So the way it works is it works from this sort of sub field and statistical learning, which is a particular branch of machine learning.

00:01:25.000 --> 00:01:46.000
It's a subfield of this sub field called pack learnability. So pack here PAC stands for something called probably approximately correct, which is sort of a formal definition of like a almost like a delta epsilon argument.

00:01:46.000 --> 00:01:52.000
If you remember your Calc One. So if you're interested in learning more about what probably a approximately correct means, you can click on this link and learn more.

00:01:52.000 --> 00:02:01.000
For just sort of a broad overview of what's going on, we're gonna do a very brief dive into like what essentially is the theory.

00:02:01.000 --> 00:02:09.000
And then if you're inclined, I have references throughout all the boosting notebooks that you might be interested in.

00:02:09.000 --> 00:02:15.000
So this idea for boosting comes from the idea and notions of weak learners and strong learners.

00:02:15.000 --> 00:02:26.000
So as a vague definition, we would say that a statistical learning algorithm, which all of our algorithms up to this point are is referred to as a weak learner.

00:02:26.000 --> 00:02:33.000
If it does slightly better than random guessing. So think of like, we slightly outperform flipping a coin.

00:02:33.000 --> 00:02:41.000
Here slightly. Has a more formal meeting, but just take it as like a vague like, okay, we're doing a little bit better than just random guests.

00:02:41.000 --> 00:02:50.000
We would then say that an algorithm is called a strong learner if you could make it as close to the true relationship as you'd like.

00:02:50.000 --> 00:03:01.000
You know, assuming you have enough trading observations and computational power. So I obviously of these 2, it's a lot easier to make a week learner, right?

00:03:01.000 --> 00:03:06.000
So we can make a week learner with a really silly out like simple, not silly, but a very simple algorithm.

00:03:06.000 --> 00:03:20.000
And it's much harder in general to make a strong learner, but it has been shown at least theoretically that if you can show that I particular problem is weak learnable, meaning that a weak learner exists.

00:03:20.000 --> 00:03:30.000
So there is an algorithm out there that can be defined as a weak learner, then it is also strong learnable, meaning that a strong learner exists.

00:03:30.000 --> 00:03:42.000
Now, does this mean that you're going to be able to find such algorithms, not necessarily, but you know that if you can, you know, provide an algorithm does at least a little bit better than random guessing.

00:03:42.000 --> 00:03:47.000
Then in theory, somewhere out there in the world of algorithm should be something that is a strong learner.

00:03:47.000 --> 00:04:03.000
And so this theorem let to a bunch of ways to sort of try and construct strong learners. And, basically this is done by making an ensemble, so ensemble learning, an ensemble of weak learners.

00:04:03.000 --> 00:04:14.000
So basically the idea for all boosting algorithms is we're gonna take a bunch of weak learners, find some sort of creative way to combine them together and then the hope there is that we're going to get a strong learner out of that.

00:04:14.000 --> 00:04:23.000
So for us, a very common week learner that we're going to use is known as a decision stump.

00:04:23.000 --> 00:04:27.000
So this is a decision tree with a single cut point and they call it a stump, right? Cause it's just one decision.

00:04:27.000 --> 00:04:32.000
So it's like, you know, like you cut off the top of a tree, you're left with the stone.

00:04:32.000 --> 00:04:40.000
So we're gonna use decisions stumps. You can use other algorithms as your week learner, but we're going to use decision stumps in these notebooks.

00:04:40.000 --> 00:04:45.000
So we're gonna look at 2 specific boosting algorithms. The first one is called adaptive boosting or add a boost.

00:04:45.000 --> 00:04:52.000
The second one is called gradient boosting. So we're actually going to look at 2 implementations of.

00:04:52.000 --> 00:04:58.000
Gradient boosting, one through SK Learn and one through a different Python package that we'll touch on later.

00:04:58.000 --> 00:05:08.000
So before we dive into covering the specific algorithms, are there any questions on just like the general idea behind boosting.

00:05:08.000 --> 00:05:19.000
Not, general idea, but. Are we going to be touching on XG boost at all?

00:05:19.000 --> 00:05:20.000
Okay.

00:05:20.000 --> 00:05:26.000
Yep, yep. So that's the second. So that's the second gradient boosting, notebook Yeah, so XG, the X stands for extra, the G stands for

00:05:26.000 --> 00:05:33.000
What's an example of a week partner? Like would a random force be an example of like a week learner?

00:05:33.000 --> 00:05:56.000
Yeah, so the most common week learner, is a decision stump. So it's just like, You know, you look at the data, you make one cut point and oftentimes that can be better than just flipping a coin.

00:05:56.000 --> 00:06:02.000
All right, so let's dive into some of these algorithms. The first one is called add a boost.

00:06:02.000 --> 00:06:17.000
Which is short for adaptive boosting. And so the main idea with at a boost is you're gonna train a series of weak learners and then like each subsequent learner is going to sort of pay more attention to the things that the last ones got wrong.

00:06:17.000 --> 00:06:23.000
So that's kind of the idea and here's a more formal. Like actually how it works.

00:06:23.000 --> 00:06:33.000
So the idea is nice. How does it actually work? So you're gonna start off like the very first week learner you're just gonna train like you normally would train the algorithm.

00:06:33.000 --> 00:06:38.000
Now for us this is going to be a decision stump but in essence it could be any algorithm you'd like to use.

00:06:38.000 --> 00:06:47.000
You want it to be an algorithm that is quick to train because you're going to be training a lot of different algorithms as a, part of your add a boost.

00:06:47.000 --> 00:06:59.000
So after you train the first one, the general steps are for week learner, subj, or week learner J, the weights on the training samples will be determined by the performance of the previous week learner.

00:06:59.000 --> 00:07:11.000
So for instance, the second week learner like will pay attention to the training observations in a slightly different way than the previous week learner.

00:07:11.000 --> 00:07:28.000
Depending on how well it did. And then after you train capital W Total Week learners, your final prediction will be made by performing a weighted vote among all of the weak learners you've trained, where the weighted vote is determined by each week learners accuracy.

00:07:28.000 --> 00:07:44.000
So the more accurate a week learner is the more of a vote it's going to get. So we're gonna now dive into the formulas, but just sort of remember the key is we're going to first just train a regular algorithm like we would normally do.

00:07:44.000 --> 00:07:49.000
And then each subsequent algorithm like we would normally do. And then each subsequent algorithm is going to pay more attention to the points that we would normally do.

00:07:49.000 --> 00:07:59.000
And then each subsequent algorithm is going to pay more attention to the points that were previously incorrect. And then at the very end, the way we finally make our prediction is by a weighted vote among all of the week learners.

00:07:59.000 --> 00:08:14.000
So in general, let's assume we have an observation why and then with the superscript parentheses I denoting the class of observation I so if like the tenth row of your data set was a one.

00:08:14.000 --> 00:08:23.000
Y superscript 10 would be one. Why sub j superscript i hat will be the prediction for observation I.

00:08:23.000 --> 00:08:37.000
Of week learner J. So, you know, through this we're gonna train a bunch of week learners the particular prediction that week learner Jay is making for observation I is why hat subjay super I.

00:08:37.000 --> 00:08:47.000
And then W super I is the current weight assigned to observation I. So we're going to be going through an iterative process where we update the weights at each step.

00:08:47.000 --> 00:08:55.000
So in general, after you train the J-thweek learner, the learners waited error rate is calculated.

00:08:55.000 --> 00:09:03.000
And so this might look confusing, but basically what you're doing is all of your trading observations are going to have weights assigned to them.

00:09:03.000 --> 00:09:10.000
And then in the numerator, you're gonna sum up those weights whenever you got something wrong with your prediction.

00:09:10.000 --> 00:09:22.000
And then you're going to then divide by just all of the weights. You could then think of this instead as being one minus a weighted accuracy where the points are weighted according to the W's.

00:09:22.000 --> 00:09:31.000
This is going to be denoted as R sub j. So the way we've defined our subj, this is going to be big if our J.

00:09:31.000 --> 00:09:33.000
Ferner is bad. And then it'll be small when the JF learner is good, right?

00:09:33.000 --> 00:09:43.000
So why would it be big? So if our Jake learners bad it's going to have a lot of incorrect predictions.

00:09:43.000 --> 00:09:53.000
So this Y hat is gonna be not equal to Y regular regular Y, right? Quite often. So that means the numerator will be big then.

00:09:53.000 --> 00:10:00.000
And it would be small if the prediction is equal to the actual value often, which means that the numerator would be small.

00:10:00.000 --> 00:10:10.000
Okay, another way to think of it is just this one minus weighted accuracy. So like if you have a good week learner, weighted accuracy is high, which means RJ will be small.

00:10:10.000 --> 00:10:18.000
And then if you have a bad week learner, weighted accuracy will be bad, meaning. RJ will be, large.

00:10:18.000 --> 00:10:24.000
Okay, after you've calculated these Rj's, then you will compute the weight assigned to that particular week learner in the final voting process.

00:10:24.000 --> 00:10:42.000
This is determined to be its alpha subj. It's eta times the log, and I believe this is natural log, one minus rj over rj.

00:10:42.000 --> 00:10:51.000
So. Data is a learning rate. It's called the learning rate of the algorithm. It's a hyper parameter that you would set before you do any of the fitting.

00:10:51.000 --> 00:11:03.000
We can remember what we know about r sub j. So alpha sub j is small. When r sub j is large, remember our subjays large is when it's bad.

00:11:03.000 --> 00:11:07.000
So that means like bad week learners will have a small vote. And then good week learners will have a larger vote because the alpha j will be larger when RJ is small.

00:11:07.000 --> 00:11:21.000
And then after we calculate the R sub j and the alpha sub j, it's finally time to then go through and update these weights.

00:11:21.000 --> 00:11:29.000
So you're gonna update them. So in the following way. So it stays the same if you were correctly predicting.

00:11:29.000 --> 00:11:33.000
And then you multiply it by e to the alpha j. If you're incorrectly predicting. This is assuming that the alpha j's are greater than 0.

00:11:33.000 --> 00:11:45.000
In practical applications, they typically are, but you know, theoretically, I don't think there's a guarantee that this is.

00:11:45.000 --> 00:11:55.000
Greater than 0. So I just wanted to point that out. So this is probably confusing. Which is why I think it's really helpful to go through just like a silly example.

00:11:55.000 --> 00:12:03.000
So here's our silly example. I've got 3 blue circles, 1, 2, and 3, and then 3 orange triangles, labeled 4, 5, and 6.

00:12:03.000 --> 00:12:11.000
So for the very first week learner, every observation has the same weight. So the same WI of one sixth.

00:12:11.000 --> 00:12:19.000
And then let's just say our first decision stump or whatever week learner we use gives us the following rule.

00:12:19.000 --> 00:12:27.000
So the shaded region over here means we would predict blues and then the orange shaded region means we would predict orange triangle.

00:12:27.000 --> 00:12:35.000
So one and 2 are correctly predicted. 4 or 5 and 6 are correctly predicted, but 3 is incorrectly predicted.

00:12:35.000 --> 00:12:40.000
So what was the first step we can go back after we do that we have to calculate our sub one.

00:12:40.000 --> 00:12:48.000
So we go through for our sub one. And it's gonna be zeros everywhere except for.

00:12:48.000 --> 00:12:56.000
The third entry, why is that? Because we incorrectly predicted on the 3 so that's a one over 6 and then the denominator just sums up to one because of our weights.

00:12:56.000 --> 00:13:09.000
So R sub one is just one over 6. That's what it simplifies to. After we calculate the R, we then go and calculate the alpha, which would give us a log of 5.

00:13:09.000 --> 00:13:15.000
And now we update our weights. So we correctly predicted one. So the weight stays the same.

00:13:15.000 --> 00:13:19.000
We correctly predicted 2 so the weight stays the same. And then the same thing for 4, 5, and 6.

00:13:19.000 --> 00:13:32.000
The only one we were incorrect on was the third observation. So we're going to multiply, the weight there by E to the log of 5 because that's what alpha one was.

00:13:32.000 --> 00:13:37.000
And so now these are our new updated weights for the second week learner.

00:13:37.000 --> 00:13:57.000
So let's say then the second week learner goes and then it produces the follow. The following decision rule, blue shaded regions were predicting a blue or in shaded regions were predicting an orange triangle.

00:13:57.000 --> 00:14:08.000
Here the one everything is correct except for 4 and so when we do the calculations R 2 everything's going to be 0 except for the observation for 4.

00:14:08.000 --> 00:14:13.000
Alpha 2 is then updated accordingly. And then the only weight that we have to change is the weight on observation for.

00:14:13.000 --> 00:14:19.000
So the previous weight, one over 6 times E to the alpha 2. Which is gonna give you 3 over 2.

00:14:19.000 --> 00:14:27.000
So if we would stop here, we have 2 weeks learners now, we could stop if we wanted to.

00:14:27.000 --> 00:14:37.000
Week learner one would get a vote worth log of 5. And we Kerner 2 would get a vote worth log of 9 and then we would keep going.

00:14:37.000 --> 00:14:48.000
Until we're happy, typically this is done by some sort of looking at. Like cross validation.

00:14:48.000 --> 00:14:58.000
So icons asking why is the third term 5 over 6. So the wait for observation 3 becomes 5 over 6 in this step when we updated the weights.

00:14:58.000 --> 00:15:06.000
So remember the first week learner incorrectly predicted observation 3 and so according to the rules we set up here.

00:15:06.000 --> 00:15:12.000
That means the weights are updated to be the previous or the current weight times e to the alpha j.

00:15:12.000 --> 00:15:22.000
So for us that would be alpha one. And we calculated alpha one to be log of 5. And so then when you do one over 6 times e to the log of 5, that gives us the 5 over 6.

00:15:22.000 --> 00:15:31.000
Does that make sense? Yeah, yep, it carries on to the next step.

00:15:31.000 --> 00:15:38.000
Any other questions on the setup or this toy example?

00:15:38.000 --> 00:15:47.000
Sorry, how, how exactly are you getting from? The weights to actually creating this decision boundary.

00:15:47.000 --> 00:15:52.000
So this is just like a made up example, but like the weights would be considered in the training process.

00:15:52.000 --> 00:15:54.000
So like when the algorithms being like the computer will add the weights to the observations.

00:15:54.000 --> 00:16:01.000
Does that make sense?

00:16:01.000 --> 00:16:06.000
Hmm.

00:16:06.000 --> 00:16:14.000
Not quite sure.

00:16:14.000 --> 00:16:15.000
Oh, I see.

00:16:15.000 --> 00:16:18.000
Yeah, so like this is a general week learner. So we don't have like an actual algorithm to fit in this toy example.

00:16:18.000 --> 00:16:24.000
Yeah, yeah, but like in practice, like the weights would be considered and whatever algorithms being fit behind the scenes.

00:16:24.000 --> 00:16:29.000
But so, is A to boost, like have we not talked about Ada boost yet or because I thought that was a specific week learner.

00:16:29.000 --> 00:16:40.000
So this is So that at a boost is a process, a boosting algorithm for trying to create a strong learner out of a series of weekliners.

00:16:40.000 --> 00:16:46.000
So this whole process that we've just looked at is like doing at a boost up to 2 week learners.

00:16:46.000 --> 00:17:02.000
So like in general, like the week later, we would choose is maybe like a decision stump. And so then like the decision stump here would produce a cut like this and then the second one would maybe produce a cut like this because we've more heavily weighted observation 3.

00:17:02.000 --> 00:17:15.000
The decision stump had follows the cart algorithm and then the weights would play into that where like certain observations maybe get a higher weight.

00:17:15.000 --> 00:17:16.000
Okay.

00:17:16.000 --> 00:17:21.000
I'm not entirely sure how it gets implemented in like SK Learn or something, but the weights are being impacted on like the fitting process of the of the week

00:17:21.000 --> 00:17:22.000
Alright, thanks.

00:17:22.000 --> 00:17:27.000
Yeah, Keithon asked for subsequent week learners. Does the sum of the weights not equal one?

00:17:27.000 --> 00:17:31.000
No, so it doesn't have to equal one because remember like when we calculate the RJ, we're dividing by the sum of all the weights.

00:17:31.000 --> 00:17:43.000
So like R sub j will never be more than one because we're dividing by the sum of all the current weights.

00:17:43.000 --> 00:17:50.000
So that's like the most important part.

00:17:50.000 --> 00:17:58.000
I'm sorry, so you said that the weights are gonna be in incorporated into whatever week learner we choose.

00:17:58.000 --> 00:18:04.000
So if for example the week learner is say like a random forest or something like 1. One depth or whatever.

00:18:04.000 --> 00:18:18.000
So the weights would come into play during like the calculation of the information gain. Or something and from that information gain the way in which it is like forming the decision boundaries.

00:18:18.000 --> 00:18:19.000
Yes, yeah.

00:18:19.000 --> 00:18:26.000
Okay.

00:18:26.000 --> 00:18:35.000
Any other questions?

00:18:35.000 --> 00:18:45.000
Okay, so now let's go ahead and show you how to do this in SK Learn and now that we're not like reading a bunch of things, I'm going to zoom out a little bit. So my code isn't as big.

00:18:45.000 --> 00:18:53.000
So. We're going to use this. A smaller version of this same data problem we've been looking at for all of our classifications.

00:18:53.000 --> 00:19:04.000
So we're gonna look at it with classification, but remember the boosting algorithms can be for classification or for regression because our weak learners, right, can also be used for regression.

00:19:04.000 --> 00:19:08.000
So if you have a decision stump for classification, you can just as easily have a decision stump for regression.

00:19:08.000 --> 00:19:14.000
To do. So we have smaller observations here because I want to make it easier to see like what's going on.

00:19:14.000 --> 00:19:22.000
So we have smaller observations here because I want to make it easier to see like what's going on.

00:19:22.000 --> 00:19:26.000
So let's go ahead and show you how to make the decision stump. So we're going to, er, for the add a boost.

00:19:26.000 --> 00:19:37.000
So, Ensemble, so from . K. Learn dot ensemble will import Add a boost classifier.

00:19:37.000 --> 00:19:45.000
And then I think I have a link to the documentation for this further up in the notebook. And then we're also have to import what our base classifier is.

00:19:45.000 --> 00:19:50.000
So the base classifier, that's the week learner we're using. So from.

00:19:50.000 --> 00:19:56.000
SK Learn dot, we're gonna use the decision stump. So from SK Learn dot tree.

00:19:56.000 --> 00:20:03.000
We're going to import decision. Tree classifier.

00:20:03.000 --> 00:20:11.000
Okay, so here's how we. Set it up. So I'm gonna get to the for loop after I just show you the setup.

00:20:11.000 --> 00:20:18.000
So you call at a boost classifier. The first argument is your week learner, which for us is a decision stump.

00:20:18.000 --> 00:20:26.000
So that's the decision tree. With a maximum depth of one. The number of estimators you're going to use.

00:20:26.000 --> 00:20:34.000
So that's the number of week learners. So in this previous toy example, we stopped with n estimators equal to 2, but we could have kept going.

00:20:34.000 --> 00:20:40.000
Currently, in this code, I have it set equal to I and I'll explain that in a second.

00:20:40.000 --> 00:20:45.000
Then I have this argument algorithm equals S A M M E dot R. So this is the algorithm used to fit.

00:20:45.000 --> 00:20:57.000
The add a boost classifier. So the reason I'm specifying this is the default one doesn't allow you to predict probabilities.

00:20:57.000 --> 00:21:06.000
So I wanted to point out that if you would like to be able to predict probabilities for your classification.

00:21:06.000 --> 00:21:17.000
You need to use SAM ME dot R. The default one is S A M M E and can, like you wouldn't be able to use a predict prob if you didn't use the one with the dot R.

00:21:17.000 --> 00:21:22.000
And then I set a random state so that when I run the code and you run the code, it'll be the same.

00:21:22.000 --> 00:21:29.000
So what I'm doing here is I'm just showing you how the decision boundaries change. When I use a different number of week learners.

00:21:29.000 --> 00:21:32.000
So the first time through it will just be a single decision stump. The second time through it will be 2 decision stumps and then so on and so on.

00:21:32.000 --> 00:21:42.000
Okay.

00:21:42.000 --> 00:21:49.000
So here we can slowly watch the processes as we add additional week learners, how it changes our decision boundary.

00:21:49.000 --> 00:21:56.000
And you can see how like it starts to overreact but starts to react to getting things wrong. So for the first 2, react to getting things wrong.

00:21:56.000 --> 00:21:59.000
So for the first 2, it doesn't really change much. So for the first 2, it doesn't really change much.

00:21:59.000 --> 00:22:11.000
But with the third one, these incorrect blue ones lead to this. You know, branch, then the fact that I have these incorrect orange ones over here lead to this adjustment.

00:22:11.000 --> 00:22:19.000
Then I incorrectly predicted that blue one so that leads to this and then after 5 we've correctly predicted everything in the training set.

00:22:19.000 --> 00:22:34.000
And so it stops making adjustments because after that every subsequent week learner will perfectly like the overall algorithm will be predicting everything correctly.

00:22:34.000 --> 00:22:39.000
Okay, so that's add a boost. That's the theory, at least the setup of the theory.

00:22:39.000 --> 00:22:44.000
We haven't like proven anything and then that's the how to implement it in SK Learn.

00:22:44.000 --> 00:22:52.000
So here's some good additional references on add a boost models. These are pretty, I think they're pretty good.

00:22:52.000 --> 00:22:59.000
I'd probably do want a mathematical background if you're if you're gonna try and dive into it, but you can try reading it if you're interested, but don't have a math background.

00:22:59.000 --> 00:23:10.000
Yeah.

00:23:10.000 --> 00:23:17.000
So just reading through the question. Okay, so Zach's asking, are there versions of boosting that change the employed week learner methods.

00:23:17.000 --> 00:23:24.000
It's the number of guesses increases. So if I understand correctly, I believe Zack you are asking like.

00:23:24.000 --> 00:23:31.000
Boosting algorithm that would maybe change from a decision stump to like a different type of algorithm part way through.

00:23:31.000 --> 00:23:33.000
Is that what you're asking?

00:23:33.000 --> 00:23:43.000
Yeah, either a decision stumped to a different type of algorithm or increasing the number of branches. Just improving the learner as you guess more.

00:23:43.000 --> 00:23:54.000
So I guess I don't think so because I mean the overall idea is like you don't need a sophisticated algorithm to do it like the improvement comes from adding weak learners.

00:23:54.000 --> 00:24:02.000
So that's the idea. So I don't think there's ones that say like, okay, I'm gonna use a really bad, a really bad algorithm and then like as I get closer, I'm going to then employ like a better algorithm if that makes sense.

00:24:02.000 --> 00:24:13.000
So I think it's typically the case that you're always doing the same base algorithm.

00:24:13.000 --> 00:24:14.000
Okay, thank you.

00:24:14.000 --> 00:24:15.000
Yep, yeah. And then RAM is on asked, is it common to try different learning rates?

00:24:15.000 --> 00:24:23.000
How does it play out with the number of estimators if we increase the learning rate? So that's a good question.

00:24:23.000 --> 00:24:28.000
Cause I think I forgot to specify the learning rate. So let's go and see if we can check the documentation.

00:24:28.000 --> 00:24:32.000
Because I'm not sure what they call the learning rate. So here's the documentation. So the learning rate is default set to one.

00:24:32.000 --> 00:24:44.000
In general, you might want to try, you know, putting this in a cross validation loop with the number of estimators.

00:24:44.000 --> 00:24:55.000
So you can try different numbers of estimators and different learning rates. So I think a. We can try and reason out and let's see if I'm good at this in a live lecture setting.

00:24:55.000 --> 00:25:03.000
We can look at the setup and try and reason out with the learning rate would do. So the learning rate impacts.

00:25:03.000 --> 00:25:16.000
So let's see, loading rate impacts alpha alpha is the vote the weight of the vote. And so if Ada is higher, that means alpha should be higher, which means that you're giving bigger votes.

00:25:16.000 --> 00:25:20.000
So I think your adjustments would be more drastic. Whereas if Ada is lower, then everything gets a smaller.

00:25:20.000 --> 00:25:34.000
A smaller vote. So maybe your adjustments would be less drastic. So I think it's just sort of a learning rate in the same sense of like a gradient descent where if you're bigger you're more likely to make bigger shifts.

00:25:34.000 --> 00:25:35.000
And if you're smaller, you're more likely to make like smaller incremental changes and it would take you longer.

00:25:35.000 --> 00:25:54.000
So just like with any sort of algorithm with the learning rate, I think there's probably a. Sort of a trade off that you need to sort of try and balance with like a cross validation or something.

00:25:54.000 --> 00:26:00.000
Yeah, so does that answer your question.

00:26:00.000 --> 00:26:09.000
Great. Jacobs asking, is this an example of a greedy algorithm? So greedy algorithms are ones just as a reminder for everybody.

00:26:09.000 --> 00:26:19.000
Greedy algorithms are ones where at each step it makes the optimal decision. So the decision at that step that leads to the greatest increase in improvement.

00:26:19.000 --> 00:26:29.000
Here, I'm not entirely sure. I would have to double check, but based on the updating rules, I don't know that this would be considered.

00:26:29.000 --> 00:26:38.000
A greedy algorithm. Oh, genetic algorithm. That I am not so sure about. I read a book a long time ago that explained what a genetic algorithm was.

00:26:38.000 --> 00:26:53.000
And I don't remember. So I can't give you a good answer as to whether or not it's a genetic out.

00:26:53.000 --> 00:27:02.000
And Sanjay is saying that they are not the same.

00:27:02.000 --> 00:27:08.000
All right, so that's add a boost. The next type of boosting is called gradient boosting.

00:27:08.000 --> 00:27:15.000
And then after this, will learn something called extra gradient boosting.

00:27:15.000 --> 00:27:33.000
Okay, so add a boost. Basically paid extra attention to the things that were wrong. The way gradient boosting works is like you're going to first build a week learner then you're going to build a second week learner to train directly on the.

00:27:33.000 --> 00:27:43.000
Errors of the previous week learner. So this is gonna be. Easiest to explain. In a regression, and a regression formulation.

00:27:43.000 --> 00:27:51.000
So these, like the, it still works for classification just like with all the other ensemble learning, you can do it for regression or classification.

00:27:51.000 --> 00:27:58.000
So we're gonna do it in the regression setting because it's easier to write out and then I think I have some references at the end if you're interested in looking at the classification version.

00:27:58.000 --> 00:28:08.000
Or maybe it's in the practice problems, either one of those. So let's go through it.

00:28:08.000 --> 00:28:20.000
So here are the steps for gradient boosting. You first train a week learner and then this setting in regression, a week learning regression algorithm. So let's say another decisions stump to predict why.

00:28:20.000 --> 00:28:37.000
And then this is called the week learner number one. Then you calculate there residuals and I believe we said this when we are doing regression 2 weeks ago, seems like forever ago now, where you calculate the actual minus the predicted.

00:28:37.000 --> 00:28:41.000
And so here I'm going to say H one of X I'm going to use as notation to denote the prediction of weak learner one.

00:28:41.000 --> 00:28:56.000
So H one is going to be Y hat for the first week learner. And then H 2 would be Y hat for the second, or not Y hat exactly, but it's gonna be the prediction of the second week learner.

00:28:56.000 --> 00:29:00.000
And we'll see why it's not why I had exactly. Okay, so this is step one.

00:29:00.000 --> 00:29:04.000
You, you train a, a week learner regression algorithm to predict why, then you calculate the error.

00:29:04.000 --> 00:29:14.000
So the actual minus the predicted. So then in general, your for step J, you will train a weak learner to predict to the residuals at step J minus one.

00:29:14.000 --> 00:29:40.000
So for instance, in step 2, instead of making a decision tree that. A decision stump progressor that predicts why why you're going to take a decision stump regressor that predicts our one so the residuals are that predicts our one. So the residuals of the previous model.

00:29:40.000 --> 00:29:46.000
So then you're going to set H sub j of X. So the residuals of the previous model. So then you're going to set H sub j of X.

00:29:46.000 --> 00:29:55.000
That's going to denote the predictions of the previous steps residuals. Then you're going to then calculate the residuals for this week learner.

00:29:55.000 --> 00:30:07.000
So this that be the previous steps residuals minus the current prediction. And then you would stop given like you're gonna preset, okay, use this many capital J week learners.

00:30:07.000 --> 00:30:18.000
And so now you might be wondering, well, how do I then get the prediction for why at a given step, you do that by summing up all of the different predictions you've made.

00:30:18.000 --> 00:30:27.000
So H one of X is the predicted value for Y. H 2 of X is the predicted value for the residuals from the first week learner.

00:30:27.000 --> 00:30:28.000
H 3 of X would be the predicted value for the residuals of the second week learner and so on.

00:30:28.000 --> 00:30:43.000
And so by summing all these up, you're trying to get closer and closer and closer to the prediction of why, which I'm going to just call H of X here.

00:30:43.000 --> 00:30:50.000
So before opening it up for questions, I'm going to show you a visualization of this to hopefully make it clear.

00:30:50.000 --> 00:30:56.000
So let's say we have this data, Y and X. And I'm going to build.

00:30:56.000 --> 00:31:01.000
I'm not going to use the SK L and gradient boosting algorithm just yet.

00:31:01.000 --> 00:31:09.000
I'm doing just the decision tree, to show you what's going on. So for the first week learner I make my decision stump.

00:31:09.000 --> 00:31:21.000
Of depth, decision tree of depth one. Then I fit it on the X and Y. I get my prediction and stored in H one and then I calculate their residuals and store it in R one.

00:31:21.000 --> 00:31:33.000
And then here I'm just plotting both of those things. So on the left hand side is going to be a plot of all the individual H subj's and then on the right hand side will be the running plot of the H's.

00:31:33.000 --> 00:31:39.000
So week learner too is then fit on our one. So the residuals from step one. And then those predictions are stored in H 2 and then I calculate the residuals from that step.

00:31:39.000 --> 00:31:51.000
R 2. So the errors on the previous models residuals.

00:31:51.000 --> 00:32:00.000
Then we learner 3 does a similar thing. So I fit the decision stump on the residuals from the previous step.

00:32:00.000 --> 00:32:08.000
Store the predictions and then calculate the residuals for that step. Okay, so here's what this looks like as a picture.

00:32:08.000 --> 00:32:15.000
So on the left hand side of all these plots, I'm going to have the H. Sub j along with the training data.

00:32:15.000 --> 00:32:21.000
So for the first row, that's H sub one. So this is just the X and the Y.

00:32:21.000 --> 00:32:27.000
And then H sub one is just a single cut point.

00:32:27.000 --> 00:32:34.000
And so if you're unfamiliar, a decision stump regressor will just take the bins and then average them.

00:32:34.000 --> 00:32:43.000
So everything to the left of the cut point is like the average of all of these points and then to the right of the cut point it's the average of all these points.

00:32:43.000 --> 00:32:49.000
So then on the right hand side I'm going to have the running value of H, which remember is the sum of all the H sub j.

00:32:49.000 --> 00:32:58.000
So for right now, we only have H one. So H is equal to H one. Now H 2 is trained on the residuals from the first step.

00:32:58.000 --> 00:33:04.000
So if we look at this, we can kind of see where that's coming from. So here we can see we've got all these ones up here.

00:33:04.000 --> 00:33:10.000
So this is what H sub 2 looks like and this is the data that was used to train H sub 2 looks like and this is the data that was used to train H sub 2.

00:33:10.000 --> 00:33:15.000
Now on the right hand side here we've got H, which is equal to H one plus H 2.

00:33:15.000 --> 00:33:21.000
And then the original training data.

00:33:21.000 --> 00:33:29.000
Now we've got R 2 and we're training a new model that gives us H sub 3 to predict our 2.

00:33:29.000 --> 00:33:34.000
And now we've got the running sum of H one H 2 and H 3 on the right hand side.

00:33:34.000 --> 00:33:40.000
And then finally this is where I stop. We've got H 4, which is trained on the residuals from the previous step.

00:33:40.000 --> 00:33:52.000
And then here's the running sum H one plus H 2 plus H 3 plus H 4. So that's the idea behind gradient boosting.

00:33:52.000 --> 00:34:05.000
And again, this can be done for classification as well. But the setup slightly different right because the residuals aren't exactly like actual, predicted.

00:34:05.000 --> 00:34:14.000
I either have a reference at the bottom of this or it's in the practice problems. I can't remember which.

00:34:14.000 --> 00:34:20.000
Okay, so I'm just checking the questions.

00:34:20.000 --> 00:34:25.000
So Keithon asked, this may be silly. It's not silly. It's not a silly question.

00:34:25.000 --> 00:34:30.000
What's the benefit of using a decision tree over or a random forest without? Add a without add a boost is it computation time.

00:34:30.000 --> 00:34:42.000
So there are 2 different approaches the main reason we use the decision stump and add a boost and gradient boosting is because it's a weak learner that is easy and quick to train.

00:34:42.000 --> 00:34:45.000
We just need to find a single cut point. So random forests and gradient, like they're just different approaches.

00:34:45.000 --> 00:34:58.000
One may perform better for your particular problem. You just have to, it just depends.

00:34:58.000 --> 00:35:15.000
And then thank you, Brooks, for your nice comment. Are there any other questions about gradient boosting?

00:35:15.000 --> 00:35:29.000
Okay, so how do we do this in SK Learn with the gradient boosting regressor in this case or classifier So we would do from SK Learn.

00:35:29.000 --> 00:35:43.000
Import gradient. Boosting. So, and we can note that we do not have to import the decision tree regressor because we did that earlier.

00:35:43.000 --> 00:35:51.000
We didn't have this in the previous setup because I didn't wanna, you know, make it even more confusing, but just like with, Add a boost, there's a learning rate.

00:35:51.000 --> 00:36:02.000
So the updates would be multiplied by some learning rate, Ada. So I believe the default is we could just see what the default is.

00:36:02.000 --> 00:36:11.000
So the default is point one. And so a higher learning rate means that your adjustments are going to be made faster and then a lower learning rate would mean your adjustments will be made more slowly.

00:36:11.000 --> 00:36:33.000
Which one is best depends upon you know your data right so what we're gonna do here is we're gonna show you 2 different this gradient boosting regressors one with a lower learning rate one with a higher learning rate and then you'll see like the differences.

00:36:33.000 --> 00:36:39.000
So we're gonna use the same number of estimators here. So we're gonna have

00:36:39.000 --> 00:36:47.000
Boosting regressor. Then we have to put in our base, which is our decision.

00:36:47.000 --> 00:36:56.000
Tree regressor with a maximum. Depth of one. We're gonna set the number of estimators.

00:36:56.000 --> 00:37:02.000
Equal to 10. So I just chose 10 here for demonstration purposes. It doesn't mean that's what you're gonna want to use all the time.

00:37:02.000 --> 00:37:16.000
And then my max depth. I already did that. Then I want just to be clear, I'm gonna set my learning rates equal to point one and then I'll make a note.

00:37:16.000 --> 00:37:26.000
No, this is the default. Value. Okay. Then I'm gonna slately cheat and just copy and paste this so I don't have to type it all again.

00:37:26.000 --> 00:37:32.000
But now I'm gonna change and make it a larger learning rate. I'm gonna set it to one.

00:37:32.000 --> 00:37:43.000
Oh, what did I do?

00:37:43.000 --> 00:37:58.000
Let's just check.

00:37:58.000 --> 00:38:01.000
Interesting. Alright, let's.

00:38:01.000 --> 00:38:09.000
Do another cheat and peak so I don't have to spend too much time on debugging.

00:38:09.000 --> 00:38:18.000
Oh, okay, awesome. So here's the difference. Why am I getting an error? A gradient boosting uses decision trees.

00:38:18.000 --> 00:38:31.000
So I just have to put in a max step. So. In general in theory you could use any week learning you'd like but by default it's always a decision tree and then you just have to set the maximum depth.

00:38:31.000 --> 00:38:39.000
So I'm sorry if that was confusing. I just had a brain slip and forgot that they used decision trees by default.

00:38:39.000 --> 00:38:49.000
Hey, so there we go. So you just set the maximum depth like you would in any other, any other decision tree or random forest.

00:38:49.000 --> 00:38:55.000
But now just. Unlike at a boost in gradient boosting, it's just always a decision stump.

00:38:55.000 --> 00:38:59.000
Well, decision tree, you could change the depth to be more than one. Okay. So here's the difference between the 2.

00:38:59.000 --> 00:39:16.000
So you can see how with a lower learning rate, we're more slowly fitting to the data and with the higher learning rate we're maybe more likely to overfit on the data and with the higher learning rate we're maybe more likely to over fit on the data like quickly.

00:39:16.000 --> 00:39:17.000
And so that's the impact of the learning rate, we're maybe more likely to over fit on the data like quickly.

00:39:17.000 --> 00:39:18.000
And so that's the impact of the learning rate. It's just how much we're adjusting to the residuals.

00:39:18.000 --> 00:39:21.000
If we were to run this longer, so if we ran the first one longer, we would begin to look like this.

00:39:21.000 --> 00:39:35.000
We would just need more estimators. So I believe the preference in general is to show you is to use a slightly smaller learning rate and then just use more trees.

00:39:35.000 --> 00:39:45.000
Now that has the problem of a longer training time. But that's something you'll have to consider.

00:39:45.000 --> 00:39:58.000
Okay, so. You can try and get a number of estimators. So that's what, so that can be optimized with a cross validation or a lot of times you might use just a validation set.

00:39:58.000 --> 00:40:12.000
And the reason there is that it can take a long time to fit sometimes. And so if you want to not have to wait and say like, fit 200 subsequent week learners or 200 subsequent decision stumps.

00:40:12.000 --> 00:40:24.000
5 different times. You might just use a validation set depending on how long it takes. So for us, cause it's a lecture, I'm gonna use the validation set so it's a little bit faster.

00:40:24.000 --> 00:40:38.000
So here I'm getting a validation set. It was randomly generated data. So I can just randomly generate more data instead of having to actually do a train test split kind of thing.

00:40:38.000 --> 00:40:57.000
We're going to calculate the mean squared error on the validation set. Alright, so what we're gonna do is we're gonna go through 200 decision trees or decision stumps and then we're going to calculate the mean squared error on the validation set for each of the of the fitted, week

00:40:57.000 --> 00:41:05.000
learners. And so the way that we can do this is gradient boosting has a method called stage to predict.

00:41:05.000 --> 00:41:14.000
So we're gonna do, maybe I'll do it separately so you can see. So we'll do GB dot stage to underscore predict.

00:41:14.000 --> 00:41:22.000
We're gonna put x dot, x, vowel dot reshape negative 1 1.

00:41:22.000 --> 00:41:32.000
And so you can see this is a generator object. And so what this is doing is it's going to allow us to loop through each of the week learners.

00:41:32.000 --> 00:41:42.000
And then provide the prediction that you get from stopping at that point. And so it's gonna do this in what's known as a generator, which is something you have to iterate through.

00:41:42.000 --> 00:41:51.000
So we're gonna copy this. And put it into a list comprehension. So we want the mean squared error.

00:41:51.000 --> 00:42:02.000
Of we want y valve first the true values And then the predicted values for the predictions.

00:42:02.000 --> 00:42:05.000
And the stage predict.

00:42:05.000 --> 00:42:14.000
Okay, so here we have our mean squared errors. And this would be the mean squared error if we stop that a single decision tree.

00:42:14.000 --> 00:42:21.000
If we stopped at 2. If we stopped at 3 and so forth, all on the validation set.

00:42:21.000 --> 00:42:30.000
And so then what you'll do or what you would do is you could look at this. And I've plotted it here so you can see the MSE as a function of the number of week learners and then you'd find the one with the smallest.

00:42:30.000 --> 00:42:43.000
Value, the smallest MSE, again, on the validation set. And then you would say, okay, this will be the number of week learners I would use.

00:42:43.000 --> 00:42:51.000
So that's a hundred 12 week learners. And then you could retrain it. And then here's what it looks like on the training set.

00:42:51.000 --> 00:43:00.000
So retraining it like this is the model we get. Okay. All right, so.

00:43:00.000 --> 00:43:05.000
There's another way to do this called early stopping. So notice here that we had to go through and train about 90 more week learners than we needed for the smallest value.

00:43:05.000 --> 00:43:19.000
And so one way that you can keep yourself from doing as many fits as we did there is doing what's known as early stopping.

00:43:19.000 --> 00:43:26.000
So if you include an SK Learn, an argument called warm start and set it equal to true.

00:43:26.000 --> 00:43:41.000
This is going to allow you to implement early stopping. So how does early stoping work? Early stopping will as you add a week learner, keep track of what is my current best MSE.

00:43:41.000 --> 00:43:48.000
And then if I don't go below the current best for some set number of times in a row, so for us it's going to be 10.

00:43:48.000 --> 00:43:58.000
If I don't outperform my current best MSE after 10 more week learners. I'm going to stop early and not keep going.

00:43:58.000 --> 00:44:03.000
So here is the code where we implement that and I'll walk it walk us through it. So we set the warm start argument.

00:44:03.000 --> 00:44:09.000
Equal to true. That's what's going to allow us to do or about to do.

00:44:09.000 --> 00:44:14.000
And then we set a minimum validation error. So here, this might look weird. I'm setting it to infinity and we'll see why I'm doing that in a second.

00:44:14.000 --> 00:44:21.000
And we'll see why I'm doing that in a second. Now I'm providing a list where I'm doing that in a second.

00:44:21.000 --> 00:44:30.000
Now I'm providing a list where I'm going to keep that in a second. Now I'm providing a list where I'm going to keep track of my validation errors and then I'm also keeping a counter that's going to count the number of times my error was higher than my minimum error.

00:44:30.000 --> 00:44:37.000
And then if this ever gets to 10, I'll stop early. I will just not do it anymore.

00:44:37.000 --> 00:44:46.000
So I'm then gonna loop through one to 500. And then train my, gradient boosting tree to have that many week learners.

00:44:46.000 --> 00:44:55.000
So what I'm gonna say is. Each time through the loop, I set the number of estimators for my gradient boosting, regressor to be n estimators.

00:44:55.000 --> 00:45:11.000
So the first time 3 would be one, then 2, then 3. I fit slash refit the model so because I had this warm start argument I'm able to do this so it won't go all the way back to one and then, you know, refit one and 2 every time.

00:45:11.000 --> 00:45:24.000
Then I calculate the validation errors for training up to that point. I checked to see if my current validation error is better than my absolute minimum that I have so far.

00:45:24.000 --> 00:45:34.000
I guess it could be a local minimum. I just met the smallest one I currently have. And if it does, I record that as the new minimum and then I reset my counter.

00:45:34.000 --> 00:45:41.000
And then if my counter ever gets to 10 times in a row, like meaning like, okay, I'd trained a new.

00:45:41.000 --> 00:45:52.000
I trained a new week learner and my error was still higher than my current minimum value. Then I'm going to increase my counter and if I ever get to 10 I'm not going to do the loop anymore.

00:45:52.000 --> 00:46:06.000
So this is called early stopping and so this is just printing to see what we're doing. And so you can see that once we got to 122, we stopped.

00:46:06.000 --> 00:46:11.000
And so this is what it looks like here. So I think, it's still 1, 12.

00:46:11.000 --> 00:46:15.000
That's what we had before, right? Yeah. So once we got to 1, 1210 times in a row.

00:46:15.000 --> 00:46:22.000
We didn't outperform the MSE we had on the validation set at 1 12.

00:46:22.000 --> 00:46:30.000
So we stopped. Okay.

00:46:30.000 --> 00:46:31.000
Yeah.

00:46:31.000 --> 00:46:40.000
Where a question. At the very bottom of your code block, yeah, that one.

00:46:40.000 --> 00:46:48.000
You see the comment where it says if this is the fifth time in a row has gone up. Should that be tenth?

00:46:48.000 --> 00:46:49.000
Yes. Thanks for putting that.

00:46:49.000 --> 00:46:52.000
Okay. Otherwise, let the divide by 2 thing going on that I was missing.

00:46:52.000 --> 00:46:57.000
No, no, no, yeah, it's just a comment that I missed in my editing.

00:46:57.000 --> 00:46:58.000
Cool.

00:46:58.000 --> 00:47:06.000
Yep. And then icon asked, I expected to loop over and estimators in the previous example, line 12.

00:47:06.000 --> 00:47:11.000
So are you talking about

00:47:11.000 --> 00:47:17.000
When we did this stage predict thing.

00:47:17.000 --> 00:47:28.000
Okay, so it though when you do stage predict this so when we fit it it fits it for all like the total number of estimators we chose, which for this was 200.

00:47:28.000 --> 00:47:36.000
And then stage predict basically allows us to loop through each of those like at the training point that it was already at.

00:47:36.000 --> 00:47:44.000
If that makes sense. So like the first the first entry would be we've trained. Only using one week learner.

00:47:44.000 --> 00:47:48.000
Here's the prediction for that week learner. So that's why it's like stage predict.

00:47:48.000 --> 00:47:54.000
It's the prediction at all the different stages of the training.

00:47:54.000 --> 00:48:00.000
Yeah, are there any other questions about any of the stuff we just did?

00:48:00.000 --> 00:48:13.000
So warm start, I believe, is what allows us to do. Resetting the number of estimators and then refitting the model.

00:48:13.000 --> 00:48:18.000
Oh, sorry. And that was in response to a question Clark had. I have to remember the people watching the recording won't see that.

00:48:18.000 --> 00:48:25.000
Clark asked, where are we using warm start in the code? So here we set warm start equal to true.

00:48:25.000 --> 00:48:43.000
And if I understand correctly, it's been a while since I wrote the notebook so I may be in forgetting but it's what allows us to reset the number of estimators and then fit the model to include the next number of Oh, thanks.

00:48:43.000 --> 00:48:44.000
Thanks.

00:48:44.000 --> 00:48:51.000
Yeah, just check the documentation that's correct for warm stuff.

00:48:51.000 --> 00:49:01.000
Alright, any other questions?

00:49:01.000 --> 00:49:06.000
Okay.

00:49:06.000 --> 00:49:12.000
So you might be wondering why this is called gradient boosting. And so Here's just a quick sort of explanation as to why it's called gradient boosting.

00:49:12.000 --> 00:49:21.000
It's not because we're computing, it's not exactly because we're computing a gradient, but we'll see like why.

00:49:21.000 --> 00:49:35.000
So let's say, you know, our current prediction for Y Hat. As we go through step J, I'm gonna call it capital H sub j at X.

00:49:35.000 --> 00:49:45.000
And remember this is the sum of all the little H's. So to get the estimate of why at step j plus one, we're going to call this.

00:49:45.000 --> 00:49:54.000
H, capital H, J plus one. So this is. You know, hopefully approximating why.

00:49:54.000 --> 00:50:02.000
It's going to be capital H of J, so the previous prediction for Y. Plus the current, you know, little h of J plus one.

00:50:02.000 --> 00:50:19.000
So remember that's for the residuals, right? So I'll turn, The little sub h sub j, little h j plus one is an approximation of y minus the capital H of j.

00:50:19.000 --> 00:50:25.000
And if you remember for a regression problem, we typically attempt to minimize the MSE of the estimate.

00:50:25.000 --> 00:50:29.000
So for simplicity, we could denote this as one over n, y minus capital H sub j squared.

00:50:29.000 --> 00:50:43.000
This is at the J plus one step. So if we took the negative gradient of this with respect to the estimate capital H sub j, we end up with 2 over N.

00:50:43.000 --> 00:50:48.000
Y minus capital H sub j, which from our earlier approximation is 2 over n, little h sub j plus one.

00:50:48.000 --> 00:51:02.000
So what we're saying here is gradient boosting is roughly speaking an estimate or a gradient descent algorithm in some sense.

00:51:02.000 --> 00:51:08.000
So that's where the gradient part comes from in gradient boosting.

00:51:08.000 --> 00:51:14.000
Okay.

00:51:14.000 --> 00:51:21.000
So Zack is saying the gradient boost method reminds me of a Taylor series. Is that an imagined connection?

00:51:21.000 --> 00:51:34.000
So like, we're like with a Taylor series, you're approximating by adding additional polynomials and like here you're sort of kind of, you're saying like in some sense you're doing a similar thing, is that what you're trying to say?

00:51:34.000 --> 00:51:43.000
Yeah, you're adding additional higher power polynomials. And, the, terms are decided by the derivatives.

00:51:43.000 --> 00:51:54.000
Yeah. And each one is a smaller correction.

00:51:54.000 --> 00:51:55.000
Okay.

00:51:55.000 --> 00:52:00.000
That's possible. I would have to like, I'd have to sit down and think about it more than my brain is able to do right now.

00:52:00.000 --> 00:52:01.000
Yeah.

00:52:01.000 --> 00:52:04.000
Okay, okay.

00:52:04.000 --> 00:52:06.000
Okay. So with gradient boosting in mind, the next thing we're gonna learn is extra.

00:52:06.000 --> 00:52:15.000
Gradient boosting, which is what XG Boost stands for.

00:52:15.000 --> 00:52:23.000
So. Reminder about gradient boosting. There's Skelards gradient boosting regressor.

00:52:23.000 --> 00:52:30.000
This is, you know, we're just like we said, or iteratively training weak learners by using the next week learner to predict the current week learners, residuals or errors.

00:52:30.000 --> 00:52:41.000
So what is XG boost if we already have a perfectly good implementation of gradient boosting? Why do we need another one?

00:52:41.000 --> 00:52:50.000
So XG boost is a very popular package for gradient boosting in Python stands for extreme gradient boosting.

00:52:50.000 --> 00:53:08.000
It's particular package is, yeah, at least when I wrote this, it was used a lot in winning data science competitions, which is probably why it became so popular, which I think is also why like at a boost became so popular whenever it was introduced because it was used to win a lot of Kaggle competitions and so typically whatever is

00:53:08.000 --> 00:53:21.000
winning the Kaggle competitions picks up and data science circles. So before we doive in like xj boost is not a package that's typically comes installed in like an anaconda distribution of Python, I believe.

00:53:21.000 --> 00:53:33.000
So you'll need to install it. So to do that, you can follow instructions here for both the conda or the pip version and then we have it's been a while since like we one since we talked about installing a package.

00:53:33.000 --> 00:53:45.000
If you're unsure of how to install Python package, I believe we have instructions on the data science boot camp website that you can get to through the first steps button.

00:53:45.000 --> 00:53:52.000
This is an outdated line because I think the M one is probably fine, but I now I know now that there's also an M 2.

00:53:52.000 --> 00:54:01.000
So if you have an M 2 chip, the standard instructions may not work for you. And also if you have an M one chip and a Mac, it's possible that they don't work for you.

00:54:01.000 --> 00:54:14.000
So. You will probably need to do a web search to find relevant instructions if you're unable to install and you think it's because you have either the Apple M 2 chip or possibly the Apple M one chip.

00:54:14.000 --> 00:54:23.000
So why might we use XG boost? So XG boost code for fitting boosting models is faster than the SK Learn version.

00:54:23.000 --> 00:54:30.000
It also tends to outperform the SK Learn version. So it's a slight modification of the gradient boosting algorithm that I don't quite remember because I never bothered to look into it.

00:54:30.000 --> 00:54:41.000
But you can check out the documentation if you'd like to see like what they're doing to improve upon the standard gradient boosting algorithm.

00:54:41.000 --> 00:54:46.000
I believe there's some, it's like an extra step where like we were sort of approximating the first gradient and then I think extra boost is something maybe approximating the second derivative as well.

00:54:46.000 --> 00:55:01.000
But I'd have to dive into the documentation to remember. But the big takeaway is that it's a faster version of gradient boosting that some tends to outperform regular gradient boosting.

00:55:01.000 --> 00:55:11.000
It also offers the ability to train the model in parallel, which at the time of writing this notebook, SK Learn did not do for regular gradient boosting.

00:55:11.000 --> 00:55:17.000
So that's another reason why you might want to use it.

00:55:17.000 --> 00:55:23.000
So we're gonna use the same exact data set from the previous notebook to show you how to do everything that we did in the last notebook using XG Boost version of, gradient boosting.

00:55:23.000 --> 00:55:41.000
So we're gonna do this so they have a couple different ways to implement things. We're gonna do the way that is most similar to SK Learn, but that's sort of just scratching the capabilities of what XG Boost can do.

00:55:41.000 --> 00:55:47.000
So if you're really interested in this and want to use it, I encourage you to dive into the documentation and look it up there.

00:55:47.000 --> 00:56:01.000
So we're gonna import, and then remember this won't run if you, If you haven't installed it and then another thing we should check it's been a while since I've updated.

00:56:01.000 --> 00:56:16.000
So my version of 1.7 point 4 and they're probably beyond that. Let's see.

00:56:16.000 --> 00:56:19.000
I'm not seeing where the most recent version is but they're probably beyond that I think I installed this a little over 2 years ago.

00:56:19.000 --> 00:56:31.000
So if there's something in the code that's working for me but doesn't work for you, it's probably because there's a different version.

00:56:31.000 --> 00:56:32.000
Okay, so Brooke says he has 1.7 point 5. So maybe I'm not that far behind.

00:56:32.000 --> 00:56:38.000
Okay.

00:56:38.000 --> 00:56:50.000
So how do we create an XG boost version? We will define it basically follows the same workflow, this version of their algorithm follows the same workflow as an SK Learn.

00:56:50.000 --> 00:57:05.000
So you first create a model object. So you do xg. Boost dot xg regressor and I'll point out we could have just installed x or imported XGB regressor directly, but I decided not to.

00:57:05.000 --> 00:57:12.000
I don't know why. So then we set the learning rate. And this is going to mimic.

00:57:12.000 --> 00:57:25.000
I just want to mimic. This plot. So I'm just showing you how to make this plot, but now with XG boost, instead of so learning rate will be point 1.

00:57:25.000 --> 00:57:33.000
My maximum depth. Will be one and then the number of estimators will be 10.

00:57:33.000 --> 00:57:42.000
Then you would just fit it like normal so that fit. X, dot reshape negative 1 one and then Y.

00:57:42.000 --> 00:57:47.000
And then I'm gonna copy and paste just like I did in the last notebook. And then I'm going to replace my learning rate from point one to one.

00:57:47.000 --> 00:57:59.000
This will be like my bigger learning rate and then, dot fit. X dot reshape negative one comma one.

00:57:59.000 --> 00:58:03.000
Why? Okay.

00:58:03.000 --> 00:58:10.000
And here's what. The plots look like. And so if we were to reference back.

00:58:10.000 --> 00:58:15.000
They're basically the same, right? There's, they're slightly different, but they're more or less the same.

00:58:15.000 --> 00:58:24.000
So, XG Boost, S. Kaler, and extra boost. . Okay

00:58:24.000 --> 00:58:36.000
So. A nice feature about XG boost in addition to the stuff I told you about before that it's faster and typically has better accuracy or or MSC.

00:58:36.000 --> 00:58:45.000
Is that it will automatically record validation set performance as it's going through as opposed to needing to use like a stage to predict.

00:58:45.000 --> 00:58:51.000
So here I make my validation set. And what you'll do is You define the regressor like you would.

00:58:51.000 --> 00:59:04.000
And then you when you call that fit in addition to X and Y. You can provide this argument called eval underscore set.

00:59:04.000 --> 00:59:09.000
So EVA, L, underscore set. And then to that you provide a list.

00:59:09.000 --> 00:59:20.000
And, in that list, you'll have tuples. So those tuples will have like whatever sets you'd like to get, the performance on.

00:59:20.000 --> 00:59:30.000
So since we had only have a single validation set, you would do a tuple with. First of the features followed by the, y values.

00:59:30.000 --> 00:59:36.000
Okay. And then if we had sometimes, you know, people maybe have more than one set they'd like to get their performances on and then they would provide that like you could provide.

00:59:36.000 --> 00:59:45.000
Another. X. Another Y. Okay.

00:59:45.000 --> 00:59:48.000
But we only have the one.

00:59:48.000 --> 00:59:53.000
And so you can see here all of this is printed to the screen. We're providing the validation set RMSE.

00:59:53.000 --> 01:00:02.000
And now, you know, we have this on here, but you might reasonably be like, well, this like am I gonna have to go through and copy and paste everything?

01:00:02.000 --> 01:00:09.000
No, you're not. So you just do, what did I call this? XGB, underscore, so XGB.

01:00:09.000 --> 01:00:26.000
Underscore reg. Result and so this creates a dictionary. And within that dictionary is gonna be the stuff from all of your different validation sets, but because I only have a single validation set.

01:00:26.000 --> 01:00:30.000
I only have validation 0.

01:00:30.000 --> 01:00:36.000
Okay, so we can access the stuff there.

01:00:36.000 --> 01:00:46.000
At validation. Underscore 0. So this is also a dictionary. We can look at the keys for it.

01:00:46.000 --> 01:00:50.000
And so the only key here is the RMSE. So we can get the RMSE.

01:00:50.000 --> 01:01:02.000
By doing square brackets a string And so now we have the root mean squared errors for all of the different, number of week learners.

01:01:02.000 --> 01:01:04.000
And so we could, use this, which, we could use this in our plot.

01:01:04.000 --> 01:01:19.000
So here I've plotted that and then I provide the minimum. Okay, so for this, version of it, the minimum, number is Somewhere between 200. 300.

01:01:19.000 --> 01:01:22.000
But you can kind of see like there's really not that much of a difference in performance once you get past 100, but this is the app like this is the minimum.

01:01:22.000 --> 01:01:39.000
For the week learners. So to do, another nice thing is, we can also do like, we don't have to do that weird warm start loop thing.

01:01:39.000 --> 01:01:48.000
You can just provide a number of rounds for early stopping. So here I'm defining my XG boost progresser again.

01:01:48.000 --> 01:01:55.000
But now when I call fit. I first put in my training, my training data, which I just called x.

01:01:55.000 --> 01:02:04.000
Dot reshape, negative 1 one, then I provide my why. Then you can provide this argument early underscore stopping.

01:02:04.000 --> 01:02:22.000
Underscore rounds and we'll set this equal to 10 like we did in the last notebook and then we can provide when you do this you have to provide an evaluation set and so I'm gonna and a cheat and just copy and paste the previous one.

01:02:22.000 --> 01:02:32.000
So, you have to provide at least one validation set. I actually don't know what it would do if you provide more than one, like if it would, how it determines early stopping from that.

01:02:32.000 --> 01:02:46.000
But we can scroll down and see that it stopped. Instead of going all 500, it stopped at 2 29 and we can see that in this plot as well.

01:02:46.000 --> 01:02:56.000
And so then here is just I'm, I'm redefining the model with the optimal number of week learners according to this validation set.

01:02:56.000 --> 01:03:05.000
And then this is what it looks like. So this is just really a surface level introduction into the XG boost package, but I think it's a good.

01:03:05.000 --> 01:03:17.000
It's, good to look at. And like, see like, okay, this is a pretty, like, it already has some nice features over Escalar and like you don't have to do warm start for early stopping.

01:03:17.000 --> 01:03:21.000
You don't have to do staged predict for the validation set. All that is already built into just the regular model.

01:03:21.000 --> 01:03:33.000
And you can go to the documentation page. And see all of these different tutorials on like, you know, looks like you can use a GPU.

01:03:33.000 --> 01:03:38.000
You can use something called Pi Spark. So it, this is maybe worth looking into if you would like to learn.

01:03:38.000 --> 01:03:59.000
More about this particular model.

01:03:59.000 --> 01:04:12.000
Hi, just, confused about the. So the n estimators and these arguments that you're putting into your model, if it's a regressor, then, cause I thought these represent like the, you know, the number of trees.

01:04:12.000 --> 01:04:23.000
Or the number of decision. Boundaries are, you know, but how does that make sense in the context of a aggressor?

01:04:23.000 --> 01:04:48.000
All right, I thought these were.

01:04:48.000 --> 01:04:57.000
Through the features. Here we only have a single feature and finds a cut point that minimizes like something like the MSE, instead of the impurity.

01:04:57.000 --> 01:05:12.000
And so, Here it's determined that this cut point should happen here and then what it does is in one of the node in any node it will just its prediction for that node will just be the average value of y within that node.

01:05:12.000 --> 01:05:23.000
So to the left of the cut point, we are predicting the average value here. Of all of these observations and then to the right of the cut point, we're taking the average value of all these observations.

01:05:23.000 --> 01:05:33.000
And so that's how a decision tree regressor works. Now you could include like a more depth if you'd like, but because this is a week learner, it's just, we're just using one.

01:05:33.000 --> 01:05:38.000
And then like the way this works is the number of estimators is like the number of decision, stumps you're using.

01:05:38.000 --> 01:05:43.000
So on the left is like each individual decision stump but on the right is like our final prediction for why.

01:05:43.000 --> 01:05:55.000
And it's just the addition of all these decision stumps together. So at 2 week learners, it's little H one plus little H 2.

01:05:55.000 --> 01:05:59.000
So it's this curve. Plus this curve and that's how you end up with this.

01:05:59.000 --> 01:06:10.000
And then it just keeps going. So at the end for This example, if we made it all the way to the end, it would be H one plus H 2 plus H 3.

01:06:10.000 --> 01:06:22.000
All the way up to H 500. But you know here we stopped at 2 29 because of the early stopping.

01:06:22.000 --> 01:06:25.000
And thanks.

01:06:25.000 --> 01:06:39.000
Any other questions?

01:06:39.000 --> 01:06:44.000
Okay.

01:06:44.000 --> 01:06:58.000
So our last notebook is voter models and this one is another type of ensemble model. We're gonna sort of go back to, I think I just like the visuals for classification a little bit better maybe.

01:06:58.000 --> 01:07:10.000
But we're going back to our classification, setting. So. The idea behind a voter model is you have a few maybe like after you go through and you fit a couple of different algorithms.

01:07:10.000 --> 01:07:19.000
So in this setting, classification, like let's say you have a logistic regression that you're happy with, okay nearest neighbors you're happy with.

01:07:19.000 --> 01:07:27.000
I support vector machine you're happy with and maybe a random forest model you're happy with. So the voting classifier.

01:07:27.000 --> 01:07:36.000
Is then going to. Combine all of these together and to make its predictions, it's going to vote.

01:07:36.000 --> 01:07:42.000
Like using the individual classifiers as the voters. So for instance, a voting classifier of these 4 models will say for an individual observation, it will say what does the logistic regression model say?

01:07:42.000 --> 01:07:57.000
What does the K nearest neighbor's model say? What does a support vector machine model say and what does the random forest model say and then it will just take.

01:07:57.000 --> 01:08:03.000
In hard voting which we'll talk about in a second it would just take the majority class and then again if there's a tie I think it just randomly decides.

01:08:03.000 --> 01:08:18.000
So that's what's called the voting model. And then with like regression, the regression version, I think would just take probably the average of the predictions, instead of doing like a vote.

01:08:18.000 --> 01:08:27.000
So that's what we mean when we say voter models. Does anyone have any like just before we show you how to fit everything in S. K.

01:08:27.000 --> 01:08:57.000
And are there any questions about like. How a voter model works. Is that clear enough? If like feel free to ask if it's still unclear and I can try and explain it again.

01:08:59.000 --> 01:09:14.000
Yes, you would have to fit all these models, but I mean, like, so the idea would be in your course of trying these different models you would have at least you would have already fit them right so you've already invested the time and you would have at least you would have already fit them right so you've already invested the time into checking these models out right

01:09:14.000 --> 01:09:21.000
so Like you've already spent the time fitting them, if that makes any sense. You wouldn't like from the gate.

01:09:21.000 --> 01:09:25.000
Like say like, alright, the first model I'm gonna try is a voter model that uses 5 different models.

01:09:25.000 --> 01:09:37.000
Like typically like a voter would be something you try after you're already happy with a series of other models.

01:09:37.000 --> 01:09:47.000
Okay, so like maybe it makes sense sort of the reason why you think this might be good. Is these models are in some sense like different from one another.

01:09:47.000 --> 01:09:54.000
And so it's possible that the errors they make are also different from one another. So like the types of errors the logistic regression model might be making is maybe different from the other 3.

01:09:54.000 --> 01:10:14.000
And so then the hope is that by combining them all together together in a voter model. It's going to sort of all the individual errors that one model makes with the others don't will hopefully be canceled out by this voting process.

01:10:14.000 --> 01:10:20.000
So that's sort of the hope of why we're doing a voting classifier.

01:10:20.000 --> 01:10:37.000
Okay. So we'll show you again, we're gonna go back to this really, I just think I wanted to keep it simple with these algorithms and stick with one basic data set where we've got upper left hand corner of a square and then the bottom right hand corner of a square.

01:10:37.000 --> 01:10:45.000
So the way this works in SK Learn is you need to have both the just like sort of Jacob alluded to with with his question.

01:10:45.000 --> 01:10:54.000
You need both the base classifiers and the voting classifier object. So, for us, we're gonna Test my memory to import everything.

01:10:54.000 --> 01:11:04.000
So from SK Learn dot linear underscore model will import. Logistic regression.

01:11:04.000 --> 01:11:14.000
From S. Neighbors. We'll import, K neighbor's classifier.

01:11:14.000 --> 01:11:26.000
From S. K. Learn dot, linear or dot SVM, will import linear SVC.

01:11:26.000 --> 01:11:32.000
From S. K.

01:11:32.000 --> 01:11:44.000
Ensemble. We'll import random forest classifier. And then for the voting classifier, we would do from.

01:11:44.000 --> 01:11:54.000
SK Learn dot ensemble. Import. Voting classifier. And then at the end, I'm just gonna import my accuracy score.

01:11:54.000 --> 01:11:57.000
So.

01:11:57.000 --> 01:12:02.000
From S scalar and dot metrics. And import. Accuracy score. I'm sure you're watching me type all this was very exciting.

01:12:02.000 --> 01:12:18.000
Okay. So. When you make a voter model You do not have to train the individual models on their own.

01:12:18.000 --> 01:12:27.000
But what I'm doing here is I would like to show you a comparison between the individual models and then the voter model.

01:12:27.000 --> 01:12:34.000
So what I'm gonna do is I'm going to go through each of these 4 models. And then make one of them.

01:12:34.000 --> 01:12:43.000
And then when you do the voter model, you'd have to. Make a fresh version and so hopefully that will become clear as I do it.

01:12:43.000 --> 01:12:52.000
So I'm first gonna start off by making model objects of all 4 model types. So and I have my names laid out for me here.

01:12:52.000 --> 01:13:02.000
So I'm gonna do, log is equal to logistic. Regression and then I'll make a note notes.

01:13:02.000 --> 01:13:08.000
I've kept in the.

01:13:08.000 --> 01:13:16.000
Regularization. Just It doesn't necessarily matter. I've just decided to keep it in here for no particular reason.

01:13:16.000 --> 01:13:25.000
Okay. Kn N is gonna be equal to K. Nearest or K neighbor's classifier.

01:13:25.000 --> 01:13:34.000
And, Let's do. 7, 7 neighbors. Again, like this is just for demonstration purposes.

01:13:34.000 --> 01:13:44.000
So like typically you would have gone through and done some sort of like cross validation for these models. Sv.

01:13:44.000 --> 01:13:53.000
SVM is gonna be equal to linear SVC. And let's say C equal to 10.

01:13:53.000 --> 01:14:03.000
And then finally, our F will be random forest classifier. Let's say maximum depth is equal to.

01:14:03.000 --> 01:14:10.000
3 and my number of estimators is equal to, I don't know, 2 50. Okay, so now these are going to just be just be for comparison purposes.

01:14:10.000 --> 01:14:17.000
So I'm just going to keep these so I can finally compare them to the voting classifier.

01:14:17.000 --> 01:14:39.000
Now for the voting classifier, you have to put in fresh. Never been you know fresh never frozen for in twenties fresh never been fitted before, versions of the, of the,

01:14:39.000 --> 01:14:44.000
Of the classifiers. So you're gonna put in a list so it's very similar to a pipeline.

01:14:44.000 --> 01:14:50.000
You put in a list of 2 pools. So we're gonna do the first one will be log and then.

01:14:50.000 --> 01:14:54.000
We do a logistic regression.

01:14:54.000 --> 01:15:06.000
Next will be the KNN and then that will be the K. Neighbor's classifier of 7 neighbors.

01:15:06.000 --> 01:15:14.000
Then we'll have the support vector machine, which will just be a linear SVC with C equal to 10.

01:15:14.000 --> 01:15:29.000
Then we're gonna have the random forest. Which again, I'm gonna copy and paste.

01:15:29.000 --> 01:15:40.000
So you'll also notice that I have this argument here voting equals hard. So this means that voting works like the way you think of like just counting voting.

01:15:40.000 --> 01:15:48.000
Another argument you can have is voting equals soft and I'll talk about that, after we go through this example.

01:15:48.000 --> 01:16:04.000
Okay, so what this loop is gonna do is it's going to loop through these different lists of the name and then the model type, it will fit the model, predict on the model, and then print out the accuracy and then be mindful that this is the training set.

01:16:04.000 --> 01:16:09.000
So, you know, ultimately, if it does well, it doesn't really matter, right? Cause we'd want it.

01:16:09.000 --> 01:16:15.000
Per like cross validation. Hey.

01:16:15.000 --> 01:16:24.000
Okay, so now we'll just go through and compare. So this was like the logistic regression.

01:16:24.000 --> 01:16:31.000
This was the random forest.

01:16:31.000 --> 01:16:38.000
This was the support vector machine.

01:16:38.000 --> 01:16:44.000
This was the K nearest neighbors.

01:16:44.000 --> 01:16:48.000
And this was the voting classifier. So you can kind of see if we go through and compare like how the voting classifiers boundary is.

01:16:48.000 --> 01:17:01.000
Sort of determined by like an averaging of the previous boundary. So like if you notice in the random forest and the KAYNERS neighbors, this boundary was.

01:17:01.000 --> 01:17:05.000
Was blue, right? It was blue.

01:17:05.000 --> 01:17:18.000
Was very blue and so you can kind of see how like that carries over from the other 2 models, and then how some of the intrusions into the other side of the actual boundary, right?

01:17:18.000 --> 01:17:22.000
Also coming from these other models. So that's sort of the idea. So I see that I have a question.

01:17:22.000 --> 01:17:31.000
So why do we have to put in classifiers that are not previously fit. Does the new fit not override the previous one?

01:17:31.000 --> 01:17:41.000
So I didn't want the fits to be overridden. So I wanted in this particular example, I wanted these to be fit independently of the voter model, which wouldn't have made a difference for anything but the random forest classifier, right?

01:17:41.000 --> 01:17:51.000
But I wanted them to be fit, independently. So I didn't want a fit as the base model to be impacted when I then went to go fit this model.

01:17:51.000 --> 01:18:02.000
So like when I fit the voter model. It would then go through and refit all of these individual models.

01:18:02.000 --> 01:18:13.000
So if I just put, you know, instead of logistic regression. If I had put log here, when I call fit to the voter model, it would have fit log.

01:18:13.000 --> 01:18:21.000
And I, and I'm not entirely, I'd have to double check. I'm not sure if I'm able to access the individual models like in voting and get the predictions that way.

01:18:21.000 --> 01:18:31.000
I can't remember. I'd have to check.

01:18:31.000 --> 01:18:39.000
Okay, so I mentioned this like hard versus soft. So hard voting, which is the argument that I had.

01:18:39.000 --> 01:18:51.000
Is sort of just like how you think a voting works if you're coming from a place where voting is we count up the number of yeses and the number of knows and whichever one has more as the winner.

01:18:51.000 --> 01:18:59.000
So that's what hard voting means. So for example, if 3 out of 4 say we're gonna have a one, that means you get a one.

01:18:59.000 --> 01:19:03.000
Now if there's a tie I'm pretty sure it's like just randomly decided.

01:19:03.000 --> 01:19:16.000
So that's how I think ties work. And then the other option is voting equals soft. And so voting equals soft means that the predictions are sort of weighted according to the probabilities.

01:19:16.000 --> 01:19:29.000
So for instance, Instead of doing a hard vote, you're going to, for each class that's possible, you're going to sum up the probabilities across the different voter model.

01:19:29.000 --> 01:19:33.000
So here would be from one to 4. And so you're going to sum up the probabilities across the different voter model.

01:19:33.000 --> 01:19:43.000
So here it'd be from one to 4. And so you'd get the probability from logistic regression, the probability from cane nearest neighbors, the probability from support vector machine, and the probability from the random forest and then whichever class from that has the highest sum of probabilities would be the class that gets predicted.

01:19:43.000 --> 01:19:58.000
You can also perform for either case you can perform weighted voting, which will then you have to provide an argument of weighted voting, which will then you have to provide an argument of weights.

01:19:58.000 --> 01:20:03.000
And so a typical thing you could do I think is. Well, never mind. I actually don't know.

01:20:03.000 --> 01:20:09.000
But you can provide, you could provide weights. So I think these would be weights. So I think these would be weights on the individual model.

01:20:09.000 --> 01:20:10.000
So you could maybe do weights on the individual model. So you could maybe do something like by waiting by the performance.

01:20:10.000 --> 01:20:15.000
So you could maybe do something like by waiting by the performance. So like if your voter, if your one model has like, by the performance.

01:20:15.000 --> 01:20:22.000
So like, if your voter, if your one model has like a 99% accuracy, maybe you wait that more, if your one model has like a 99% accuracy, maybe you weight that more highly than the others that maybe are lower.

01:20:22.000 --> 01:20:34.000
So as a quick note, if you use voting equals soft. It won't work if the particular algorithm doesn't have the ability to provide a probability.

01:20:34.000 --> 01:20:44.000
So I believe in this example, if we were to change this to be voter equals soft.

01:20:44.000 --> 01:20:55.000
We should eventually get an error. And so why do we get an error? That's because, linear SVC doesn't provide a way to get predict prob with the, default arguments.

01:20:55.000 --> 01:21:08.000
So I think. I know in support vector machines the general one there is I'm not sure if there is a linear SBC so I'd have to check the documentation to see what argument I'd want to use to make predict prob, available.

01:21:08.000 --> 01:21:18.000
Okay, so that's an important thing to remember with soft is if you want to use soft voting, the algorithms you use as your voters better have the ability to provide probabilities.

01:21:18.000 --> 01:21:38.000
So for voter models for regression. You can do an ensemble of independent regression models and then I think it's just the average of the, of whatever the model predicts or weighted average depending on if you're like waiting by maybe inverse and MSC or something like that.

01:21:38.000 --> 01:21:51.000
So I want to point out for this one in particular, this does not mean like you build a bunch of different linear regression models with slightly different features and then feed that into the voter model.

01:21:51.000 --> 01:22:00.000
So those would all still be like pretty dependent and might make the same types of errors. So the idea behind voter models is you're getting models that are like sort of fundamentally different from one another.

01:22:00.000 --> 01:22:09.000
So if they're making predictions in the same type of way, the various models are probably gonna make the same types of errors.

01:22:09.000 --> 01:22:22.000
And so in the voting, that's sort of just gonna get compounded. The idea with the voter model is like this model makes these types of errors this model makes slightly different types of errors and the third model makes even other slightly different types of errors.

01:22:22.000 --> 01:22:30.000
And then the hope is by voting together, like individual errors will get wiped out for an overall better performance.

01:22:30.000 --> 01:22:41.000
And so for example, you could do. A voter model with linear regression, K nearest neighbors regression, a support vector regressor and a random forest progresser as an example. Okay.

01:22:41.000 --> 01:22:47.000
So I see I have some questions.

01:22:47.000 --> 01:22:50.000
Okay, so the only question is Pedro. Pedro is asking, could you use boosting with voter models.

01:22:50.000 --> 01:22:58.000
Yes, so you could use a booster. You could put like an add a boost, or an add a boost classifier in.

01:22:58.000 --> 01:23:13.000
You could do that as well and just be mindful like if you were That might make similar predictions to like a decision tree or to a random forest because it's using a decision tree as its base.

01:23:13.000 --> 01:23:28.000
Are there other questions about voter models?

01:23:28.000 --> 01:23:34.000
Okay, so that's it for Ensemble Learning, before we sign off for today.

01:23:34.000 --> 01:23:42.000
I wanna make a quick note. So we're gonna start on our last day is not, you know, we're just gonna have the last day of last lecture day tomorrow.

01:23:42.000 --> 01:23:50.000
But for this last day, we're gonna do. Neural network stuff. So the package we're using is the package I know, which is Keras.

01:23:50.000 --> 01:23:57.000
There are other packages, but we're gonna use Keras. So This is also not installed.

01:23:57.000 --> 01:24:02.000
By default, so I just want you to be aware that if you want to be ready for the problem session tomorrow, so problem session 11 and for live lecture day 12.

01:24:02.000 --> 01:24:13.000
You want to go through and make sure you have Keras installed and can sort of import things. So this could be a good one to look at.

01:24:13.000 --> 01:24:17.000
As you're getting ready, try and follow these instructions to make sure you have it installed.

01:24:17.000 --> 01:24:29.000
So it's Keras, it's also like a package within TensorFlow, so you may also have to go through TensorFlow depending on your computer.

01:24:29.000 --> 01:24:30.000
Okay, so that's gonna be it for today. I will hang around for any questions.

01:24:30.000 --> 01:24:46.000
Remember to check out this Keras thing for tomorrow's stuff. But if not, I will see you tomorrow and have a great rest of your Wednesday.