AdaBoost Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna continue to learn about ensemble learning with an algorithm known as adaptive boosting or at a boost. Let me go ahead and share my Jupiter notebook and we'll get started. So at a boost, as I just said is short for adaptive boosting, it trains a series of weak learners uh in a way that is sequential where each subsequent classifier or regress uh pays more attention to the training instances that were misclassified by its predators uh predecessors. Um So for the way this is working is we're gonna have a, a sequence in this notebook of decision stumps. Uh decision stump one, we'll learn something. Uh decision stump two, we'll look at the things that decision stump one got and focus on them a little bit more and it keeps going in that way. So in this notebook, we're going to discuss sort of the theoretical foundation of the out of a a boost algorithm, uh which I think we've mainly covered in the previous video. So I'm gonna get rid of this point. Uh We're not going to do that because we cover that in the boosting video we are going to demonstrate how to implement at a boost and S K learn and we'll provide references for you uh to learn more about add a boost at the end. And so maybe what I should add because another thing we are gonna do is um show how add a boost works. There we go. OK. So how does it work? So add a boost, which again short for adaptive boosting uh builds a hopefully strong learner with an iterative sequence of weak learners. So here's a more formal setup. And if this all is hard to follow, uh we have some nice pictures below where I break it down step by step. So uh and at a boost, you're gonna have a s sequence of weak learners. The first week learner is trained on the training data like it normally would. So for a decision stump, it's gonna run that cart algorithm and then we're gonna have uh probably pretty bad decisions off of that. After you train the first one, the weak learner at step J or step J in the sequence will uh be training using weights on the training sample determined by the performance of the weak learner J minus one. So basically the training sample is gonna have weighted instances. And so the higher the weight, the more likely that thing uh that observation has been misclassified or has a very bad regression prediction in the past uh by all the previous um uh we learners. So weak learner J will pay more attention to it. After you train capital, W total weak learners, you make a final prediction by doing a weighted vote among all of the weak learners where the voting weight is determined by each particular weak learner's accuracy. And this is the case for classification uh in regression, it's determined by um the uh mean square error. So we're gonna explicitly set this up and see what we mean. This is more of an overview. Um We're gonna let Y superscript I denote the class of observation. I why subscript J superscript I with a hat is the prediction of obs uh on observation I of weak learner J and then W superscript I is the current weight assigned to observation I for classifier, uh whatever classifier you're training. So after you uh have trained the J three classifier, the learning, uh the learner's weighted error rate is calculated. And so this weighted error rate is R sub J and then this very hard to say out loud sum. But basically, it's one minus the weighted accuracy of the classifier. So remember regular accuracy is just the total number you get right divided by the total number you get wrong. Weighted accuracy is going to be, each observation has a weight associated with it. So each one you get correct and weighted accuracy is going to contribute uh that observations weight to the upper sum and then the lower sum is just so that the um accuracy, the weighted accuracy is between zero and one. So this RJ is a measure of the error rate of the classifier at step J. This is large when you have a, a bad weak learner. So the worse the weak learner is at step J, the, the larger the value of RJ is RJ is small. If you have a good weak learner, why is that? Because if you have a good weak learner, your weighted accuracy is closer to one. So the next step after you have these RJ s, which maybe you're wondering what these are, we're gonna see soon. Uh You have to compute the weight assigned to the weak learners. Remember we said after we've trained however many, we want to train uh decisions of the A A boost algorithm uh are determined by a weighted vote. And so how do we get the weights for each classifier in the sequence while we calculate them using these RJ S? And so the weight associated to weak learner J is denoted as alpha J. This is a times the of one minus RJ divided by RJ. And so this a is a learning rate. This is a hyper parameter that you have to set. Again. The best hyper parameter is probably going to be decided with a cross validation. Um It's not uh there's not like a golden bullet of a silver bullet to tell you which one you should choose for every instance. So a hyper parameter do to do. So this alpha J here, the uh the weight that each the vote of weak learner J gets is small. When you have a bad weak learner, when RJ is large and large, when you have uh a good weak learner, meaning RJ was small. And then after that, we update the training sample weights for weak learner J plus one. And so the weight for uh on observation, I is the same if you did a good, if you're weak learner J did a good job and it got it correct. Uh And it's inflated by an exponential uh if it got it incorrect. OK. And so this increases, as I said, that it will increase the weight on incorrectly predicted uh samples ... change this typo ... incorrectly predicted samples, ... uh assuming your alpha J S are greater than zero, which is usually a safe assumption. Uh OK. So that was a lot of formulas and maybe it didn't make a lot of sense. Let's look at a toy example. So let's hypothetically say I have this data where I have six observations. Three of them are blue circles and they tend to live up here for uh three of them are orange triangles and they tend to live down here. So at the start, uh we're training our first week learner like it's just a regular decision stump. So each of these observations has a weight of 1/6 we train our decision stump on it. Maybe it produces something like this where to the left in this shaded blue region. Uh We classify as a blue circle, meaning that observations one and two are correct to the right uh we shade as an orange. So meaning that we would predict an orange triangle, meaning that observations 45 and six are correct, but observation three is incorrect. So what do we do? After we get our predictions, we have to go through and calculate the R one, which is the um I forget what we didn't give a name to it, but we have to calculate R one which we can then use to calculate the weight that weak learner one gets in the final decision. So for all of these R ones, we have to go through. Which ones do we get correct? Which ones do we get incorrect? Things that we got in or things that we got correct? Contribute zero to the uh numerator of R one. Sorry about that. Uh contribute zero to the numerator of R one. So one contributes a 02, contributes a 0456, contribute to 03 is the one that we got incorrect. So in the numerator, we contribute the weight associated with three at this step which is just 1/6. And then in the bottom, we have to sum up the weights of all the examples. So what we do is we end up with 1/6 over one. So our one is just 1/6 and then we can calculate the weight of uh the alpha one just by plugging in 1/6 and we end up with log of five. Now we have to update the weights. So all of the weights are the same except for the one associated with the one incorrect point, which is W three. And we update according to that rule above which would give us the weight associated with observation three is now 56. So now, uh we use these new weights and maybe the second learner due to this increased weight now produces a cut point like this where I've correctly classified observation three. But now I've incorrectly classified observation four. So we can calculate R two and alpha two in much the same way. So for uh R two, we can look at this picture, we see that the observations 1235 and six are correct. So the sum in the numerator has zeros in the 1235 and six spot. The weight of observation four from the previous step, right was 1/6. So that's in the numerator. But now our denominator changes because remember the weight of observation three is different. So this gives R two. If you work it out, uh the value of 1/10, you then can calculate alpha two using the formula and we'll end up with log of nine, then we update the weights. We can see that the weights for 1235 and six stay the same. Uh So notice that number three carries over that weight, that updated weight from the first incorrect classification until we get it wrong again. And now observation four gets an updated weight because we got it wrong this time. And this updated weight is 3/2. So if we were to stop predictions here, uh we would have uh a uh an ensemble, a sequence of two week learners. Weak learner one would get a voting weight of 1/6. Right. I'm sorry. We learner one, we get a voting weight of log of five and weak learner two would get a voting weight of log of nine. And so what we could tell is that we would know that we would get 125 and six correct because they're correct for both of them. And then, um depending on, let's see. So log of nine is bigger than log of two or sorry log of five. Uh So four would be incorrect and three would be incorrect if we stopped at two or three would be correct if we stopped it too. So if we stopped it too, I think four would be incorrect. Three would be correct. So if we, but if we kept going, you know, these updated weights would fix it probably. Ok. So that's sort of an illustration. Now, that we have that in mind. Why don't we show how to use this in S K learn? So for a classifier, it's called add a boost classifier. It's stored in S K learn dot ensemble. And then if you want to do a uh regression, it's called at a boost regress. So we're gonna use it for this um which we used a lot in our classification examples where we use the line Y equals X as a decision boundary below. On the right, we have ones and above to the left, we have zeros as our weak learner. We are going to use the decision stump. Um So now it's first import at a boost. So from S can't learn dot ensemble import at a boost classifier. And then we also need to import our base weak learner. So from the, from S K learn dot tree import decision tree classifier, and now what we're gonna do here is instead of, you know, you watching me build a tree by typing the code. Uh I've already typed the code and the whole point is we're gonna go through a loop where I sequentially add in more weak learners. The first time through the loop, our ABOT classifier will have one weak learner and then we're gonna draw the decision boundary that comes from that the second time through the loop, I'm gonna make an ada boost classifier with two weak learners. Then I'll draw the associated decision boundary then I'm gonna have an ada boost classifier with three weak learners and then I'll draw the associated decision boundary and we're gonna do this. Uh We're gonna go through the loop six times. So by the end, uh we'll be drawing a decision boundary associated to an abo classifier with six weak learners in the sequence to make an A Abott classifier. We'll go through it right now together, it's this highlighted chunk of code. So you call a, a boost classifier. And then the first argument that you have to give, which I'm now outlining with my cursor uh is the weak learner that you wanna use. So it doesn't have to be a decision stump, you could use something else. But for our intent for this um problem, I'm gonna use a decision stump. So I called decision tree classifier and I set my max steps equal to one that's a decision stump, a decision stump. There we go. Uh Then you set the number of estimators you want. So this is the number of weak learners you want in your sequence. Uh Remember for this one as I loop through I is gonna be the number of weak learners. So I just set N estimators equal to I algorithm equals S AM M E dot R. This is the algorithm that S K learn runs to fit the model. And so there's two options. There's S AM M E dot R and S AM M E So the reason I wanna do this is uh I chose S AM M E dot R because it allows for the calculation of probabilities. So I could run a, a uh boost classifier, fit it and then run um add a underscore C O F dot Predict proba to get the probabilities. If you only use S AM M E, you're going to end up with a hard classification of 01 uh for binary and then whatever your multi class options are. So S AM M E dot R gives you the probabilities associated with each observation. And then I said a random state um because there is the uh I believe a random guess at the beginning and then sort of like a gradient descent thing going on, I think in the background. OK. So now that we're done with that explanation, let's show you the results. So here one week learners and two week learners look pretty similar. But then when we get down to three, we now have this additional cut that's trying to accommodate the things that we kept getting wrong above, right. So we kept getting all these, these four blue ones wrong. Now, it tries to correct what we did wrong before and the three weak learners and uh moves the cut point over to the left. ... And now the only thing we're left getting uh predicted incorrectly on the training data is this loan uh blue. ... And so we are now left with this one. And after five, like we would stabilize after this because we've correctly predicted everything. OK. And so as you might notice when you increase the number of weak learners you use, you tend to over fit. So if we had more observations in here, we would start to over fit a lot more. Uh And that's why we kind of end up with this very blocky one because it does the sort of the bare minimum it needs to do to uh correctly classify all of the observations in the training set. So you can mess around with the number of estimators you use. You can also do something called early stopping, which we'll see in our gradient boosting uh notebook, which is the notebook number six and the ensemble learning folder. Um But this is the a boost algorithm. We showed you sort of the theoretical setup behind it. We showed you how to implement it with S T learn. Um You know the number of correct estimators, the depth of your Abott classifiers uh sequence, the length of the sequence you're using probably depends upon the data set you're fitting. OK. So here are some nice references to learn more about. Um I guess this first one is for bagging, I should probably move that over to the bagging notebook. Uh But the next three are all about boosting. OK. So in your version of the notebook, this bagging one will have moved over to the bagging and pasting notebook. But these last three uh are good references to learn more about a, a boost and boosting in general. Ok. So I hope you learn and you enjoy learning about a, a boost. Uh I enjoyed having you watch this video and I can't wait to see it till next time. Have a great rest of your day. Bye.