Gradient Boosting Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're going to learn about gradient boosting another boosting algorithm in our ensemble learning section. ... So in this notebook, we're going to learn the concept behind gradient boosting. See why it's called gradient boosting. That that will happen at the end. Uh We'll demonstrate the method with some nice plots and we'll implement the algorithm can learn along the way. We'll learn about something known as early stopping. So gradient boosting is another boosting technique uh where we're going to iterate it. Oh My gosh. It, it iterative build an ensemble of weak learners to hopefully create a strong learner. Uh The approach for this boosting algorithm is to directly train weak learner J plus one on the to model the weak learner J's error. So essentially um what you're going to be doing is you look at the residuals or whatever the errors were from the previous week, weak learner. And instead of trying to train weak learner J plus one on the original training data, you instead um train weak learner J plus one to model these errors. So in a regression setting, which is what we're going to work on uh you will start off with a um a single weak learner regression algorithm. So once again, a decision stump uh to predict Y. So this is weak learner one. So weak learner one imagines that it's just doing a regular regression problem. And you train uh to fit the model to predict Y using X. Uh And then you calculate the residuals. Uh And in this setting, we're gonna call weak learner one uh the predictions from week learner one. We're gonna use this notation H one of X. So you, you calculate the residuals of weak learner one which are the actual value. Uh The actual values. Why minus these predicted values, which I'm calling H one of X, which will become why we're using H one will become clear soon. So this is the uh you know, remember why hat this is the prediction of the first week learner. And these are ones are the residuals from the prediction for weak learner. What now for step J which comes after any time after step one, you train a weak learner to predict the residuals from step J minus one. Uh which following the notation from above will be R underscore J minus one. So for instance, uh the second weak learner is going to predict R one. So you're gonna use the features X to build a linear or not a linear regression model but to build a regression model uh predicting R one. So then you set H sub J of X. This is gonna be the uh fitted value, the predicted value of uh whatever your prediction was. So, for us, because the, we are predicting the residuals from step J minus one, we want the predicted values for those residuals. So our hat J minus one where again, the subscript here denotes what step of the weak learner uh sequence we're on, then you calculate the residuals for this weak learner which are gonna be the residuals for the previous one minus your predicted value at this step. So this is RJ minus one minus R hat J minus one. And so RJ is the residuals of the residuals. And if you kept going and worked all the way out, it's kind of like the residuals of the residuals of the residuals and so forth. Uh And then you'll stop when J plus one finally equals capital J, which was a predetermined stopping point um that you set before you started running the algorithm. Uh So all at the end, the prediction for Y remember that's what we're ultimately interested in the prediction for Y. Then at step J is found by doing H of X which is the sum from H of H one all the way up to H J. So all these H J S that we're calculating at the end when we get to make a prediction of Y, we just sum them all up. So H one was the original prediction for Y H two was the prediction for the residuals from uh the first weak learner H three would be the prediction of the residuals for the second weak learner and so forth. When you add those all together, the hope is that these will be close to the actual value uh for any given value of X. So it may make sense to visualize this. Here's the data set, we have X on the horizontal axis Y on the vertical. And we can see here this isn't a linear regression. But maybe if we're doing what we learned in the regression notebook, we might fit a um a polynomial regression problem to this what we're going to do. And the next step is do a bunch of uh sequential steps that we determined and laid out and step one step J up here. And then we're going to uh show you what does it look like when I have H one? What does it look like when I have H two? And, and on the right panel, we'll show you what it does look like H one plus H two. So we're going to go ahead and do this by hand before we show you how to use gradient boosting regress or gradient boost regress and S K learn. We're gonna show you how to do it by hand using just a single decision tree regress or object um each time. So we're gonna use the first tree, which is uh technically just a stump because we're only having a max step of one, we fit the original data set, then we get our predictions H one on the original data set which we use to calculate our residuals. So this is the first step. And then on the left, we're gonna plot the original training data along with H one. And on the right, it'll be the same exact plot. But the idea is on the right hand side, we're gonna plot the running predictions. H and on the left hand side, we're gonna plot H one H two, H three and H four. So for the second step to an ingredient boost algorithm, we now fit a decision stump to predict the residuals from the first step, we get H two which are those predictions. And then we calculate the residuals for this step which were the previous version, there are the previous steps, residuals minus our current prediction for those residuals. So now on the left hand plot for the second row, we're going to have H two plotted on the left and then H which is the sum of H one and H two plotted on the right, then we learner three is we work in the same way. So we're gonna fit now, a decision stump to predict uh the residuals from the second week learner and then calculate those residuals. And again, we're gonna plot each three on the left and H which is the running sum H one plus H, two plus H three on the right. And then the same thing for weak learner for. So we're gonna run this now. And so on the left, we're gonna have, you know, H whatever H I H J and its training data and then on the right, we're gonna have H and the original training data. And so for the first step, both pots are the same because H one was trained on X and Y uh you know, regressing Y onto X. And then on the right, remember this is just our running H and at this point H is just equal to H one, then we went and we calculated the residuals. So now in the left-hand plot, we see R one plotted and if we looked at this from our residuals, we'd expect, well, we do reasonably well on the right hand side of this data, but very poorly on the left hand side. So uh each two actually has a flat line here and then the cut point happens over on the left hand side to sort of accommodate these increasing residuals as we get further and further to the left. And then when we add that into the H H one plus H two, we actually get a somewhat decent prediction. OK. So on this right hand plot, we have Y on the vertical X on the horizontal. Now, the next step is it takes the residuals from this step, which on the left hand side are plotted on the vertical axis. And on the horizontal, we once again have X and now it goes through and it tries to predict those residuals. And then on the right hand side, we can see how that gets added into the running sum for H H one plus H two plus H three. And then finally, we'll have R three on the left hand side, on the vertical axis being predicted with H four. And then the running sum uh for H is now H one plus H two plus H three plus H four, which is starting to more and more resemble if we kept going. Maybe it would have uh start to resemble sort of like a blocky approximation of A. So that's the idea behind gradient boosting. How do we implement it in S K? Learn while we use the gradient boosting regress? So from S K learn dot gradient sorry dot ensemble, we import gradients boosting regress and uh I'll point out while we are using this for regression, there's also a gradient boosting classifier which could be used for classification problems. So gradient boosting uh unlike with a, a boost, I don't think we set and we can go ahead and uh I think we can check the documentation ... so we can check and see if there's like a an estimator or maybe it makes sense to go down to the bottom and look for the examples ... to do. ... And if we look when it gets, ... when they implement it here, they don't input the decision tree. So for S K learns gradient boosting regressors, unlike with its a boost regress, we don't set a baseline regression object. We use the decision stump or the decision tree as the default. Uh And I don't know that there's a way to change it without doing a deeper dive into the documentation. So gradient boosting regress or uses a decision uh by automatically will use a series of decision trees. This is in contrast to a a boost regress which will take in any base estimator you'd like to use. So uh by default, these decision trees actually though are not stumps, they have a maximum depth of three, but you can change it to have a maximum depth of one which is what we're going to do. And the number of trees is determined by the number estimators and underscore estimators argument which has a default value of 100. Uh It also has uh different things for the learning rate. OK. And so do do do ... this learning rate was not demonstrated in our set up. Um But I believe it's probably related to maybe they multiply the residuals by the learning rate. We can actually go back to the documentation and then see what it says. Although I I question how useful it will be. Ah OK. So learning rate shrinks the contribution of each tree by learning rate. And so, for instance, here, in the example, above, we use a learning rate implicitly of one because each has AAA contribution of one uh but learning rate will probably multiply this. So if it was a learning rate of alpha H would be alpha times H one plus alpha times H two plus alpha times H three and so forth. And then that would probably impact how we calculate the uh residuals on the left. OK. So you can control uh how quickly each week learner, uh how much, what contribution each week's learner has in the final prediction using this learning rate. And so we'll quickly demonstrate the impact of that. Uh So we'll call gradient boosting regress. We're gonna have a number of estimators uh equal to 10. Just for this one, we're gonna have a maximum depth equal to one. So their decision stumps and then the difference between the two is we're just gonna demonstrate what the learning rate does. So for the first one labeled small rate, we're gonna set that learning rate equal to 10.1. And then for the second one labeled large rate, we're gonna have a learning rate of one. OK. And then I just plot so on the left, this is what you get with the learning rate of 0.1. You can see how the smaller learning rate is uh less over fit to the data and less likely to make sort of remember in the original, right, we had this very large, but with the smaller learning rate, uh it's let's pronounce the contributions to each one. Uh with the larger learning rate though, you can see, you can start to see sort of an overfitting occurring. And so there's a tradeoff between what the size of the learning rate is. Uh just like with uh an a, a boost classifier and the larger the learning rate is the more quickly you'll over fit. And so it's preferable to use a smaller learning rate. And then let the algorithm train more and more observe or more and more weak learners. So if you have to do this tradeoff where you have a small learning rate, you can just increase the number of estimators and let it train for longer. And then you should eventually, for instance, uh this 0.1, if we let it run longer would eventually approach what you see with the learning rate of one. Another thing you can control under fitting or overfitting is by changing this number of estimators. So for instance, here we're going to go ahead and show um this feature called staged predict. And so what I did here is now I have a validation set which I randomly generated because the initial data is just randomly generated. So I don't have to take a splitt, I can just do another quick, I can imagine that I'm going out into the world and drawing another sample uh because I used random numbers to generate it. So what we're gonna do is we're gonna show the mean square error of this linear of this regression, not linear regression of this regression from the gradient boosting regress. We're going to uh have it fit uh 200 weak learners in sequence. And then I'm gonna show you how we can track the mean square error along each step of the weak learner. So for each week learner that gets added, we can uh get their predictions, which were what we are plotting over here on the right, we can get these predictions record the mean square error from them. And then essentially what we would do is we'd go through and see where was the mean square error, the smallest, that's how long we want to train the final fitted model for. And then uh that's what we would use in the actual model that we're going to implement, say in some production environment. So how can we do that? Well, we can use something called stage predict. So what we can do is we would say, um ... first let's demonstrate what it does. So GB dot Staged ... predict X dot reshape ... negative 11. And so what we see here is we get a generator object, which we can think of a generator object is just something that you're gonna loop through. And so when we have something like this, that we're going to loop through that. We'd like to record all of the things from that loop. You can use a list comprehension. Um Another way to think about it is what if we did a four loop. So for thing in here, we can print out the thing and we can see that what it's giving us is the predictions. And I think actually we want not X but X vowel the predictions on this validation set for uh what the loop is going through is all the different depths of weak learners. So the ... first entry here that gets printed is the prediction when we only have a single week learner. And then if we kept going this one, that's now highlighted would be all the predictions for when we have two week learners. This one would be all the predictions for when we have three week learners and so forth. OK. And so we can use a list comprehension to store the mean squared error on the validation set for each time through. So we want mean squared error, the actual value for the validation set. Uh And then the predicted value for the validation sets four predicted in GB, which is our gradient boosting regression object uh stage predict ... if I can spell correctly. Uh And then X vowel dot reshape negative one comma one. So here are all the mean squared errors for all of those predictions that I printed out above. So for instance, at zero, we have a mean square error of 2.25. And then at the, you know, after we've done 200 weak learners, we have a mean square error of 1.125. And then here I just plot that. So here's the validation error on the vertical axis and the number of weak learners in our ensemble on the horizontal. And we can see that the minimum value for this occurs somewhere between 100 and 100 25 looks like around 112. What is the exact value? It's exactly 100 and 12. So the minimum value of these M E S on the validation set occurs at the 112th weak learner added. OK? And so now this is the best number of weak learners. So if we are doing this in practice, we would go through and say, OK, let's train everything from the beginning. Uh We train it all the way through to the 112th week learner and that's where we'll stop. And then this is what the, the best, the lowest MS E looks like. ... Now, there's this concept where maybe you don't wanna go through, maybe it takes a decent amount of time to train, right? So maybe it takes a decent amount of time to train through 200 week learners. And so you don't want to go through and do this. Uh you can implement something called early stopping. So early stopping is essentially gonna go through and do the same thing. But then once it sees here is the minimum and then as it's gonna go through, it's gonna keep going and say that we go out to 10. And if we didn't decrease lower, if we didn't go lower than whatever that minimum value was, after 10 steps, we'll stop and then we'll tell you, well, here's where I stopped and that minimum value that caused me to stop happened at 1 12. OK. So that's the idea of early stopping is you essentially do this, but instead of running it for however long you want to run it, you're just gonna run it until you hit a minimum. Uh And then after 10 points or after however many points of not going lower than that minimum, the algorithm stops and returns to you what has been fit already. And so we can implement this with something called a warm stop. And so why do we do this? Well, if you use the warm stop argument, which is just gonna be a true or false, this forces S K learn to keep all the older layers when the fit mot method is called. Uh So what do we mean by that? We're gonna show you. So we have gradient boosting regress decision tree, maximum depth, I've set max warm start equal to true. Uh And then I'm going to set a minimum validation error. And just so that this will always be beaten by no, no matter what uh the error is. I said it to be infinite. So as long as I am a finite number, I will be better. And this is just an initial value that's going to be guaranteed to be beat and then going to record my validation errors in a list. And then I keep track of whether or not my error went up. So I'm gonna check, did my error go up? And for instance, in this example, if I went up, I think 10 times if the number of times I've gone up in a row for my MS E, that's when I'm gonna stop. ... So what I'm gonna do now is then I set up a loop that may potentially go through 500 times and I'm gonna train the number of weak learners I want. And so how can I do that? Well, uh the gradient boosting regress as well as classifier has this feature where you can set, all right GB which is the variable for my gradient boosting regress dot N estimators. So that's the number of estimators. No, I did not set that at the beginning here. I said it equal to the number of estimators I want from the loop and then go through and I fit the model ... and then I append the validation errors. ... Uh So the error on the validation set for that number of estimators. OK. So this is the true value of the validation set. This is the predicted value from that number of estimators. And then what do I do? I check to see did the validation error that I just recorded? Is it lower than what my minimum was? And so the first time through that's guaranteed, uh And because I initially said it to be infinite and hopefully I get a finite number, right? And so if it is, I record what the new minimum val uh validation error was and then I reset the counter. And so why do I reset the counter? Well, if I have a new error, that means that my error did not go up in the previous step, it went down if that error went up. So let's say I go through and I have a new error that's pretty low. And then in the next step, it goes up even a little bit. I just add the number of times it went up and once I hit 10 times in a row, I will say, all right, I'm no longer going to go through this loop. That's the early stopping. And so as we go through, we see week learner, one week learner two, I kept going. I kept going and then I stopped at training week learner 122 which means that my early stopping happened 10 times after 112. And here I just plot that, OK. So this is the same plot as before. But we can see that now we stop at time 122 because we went, the error went up or at least did not get lower than the minimum error at step 112. And so instead of continuing to train all the way up to 200 I stopped after that. So this is early stopping. We did it here with gradient boosting, you could also implement something like this with a, a boost. ... So the last thing we're gonna cover is sort of like a why is this called gradient boosting? I don't see gradients in here everywhere anywhere. So uh we're gonna introduce a little bit more notation to sort of get the idea of why we're gonna call it gradient boosting. So if we let our predictions for Y, which is just the, remember at step J is the sum of all the little H sub J S. Uh We're gonna let this be known as capital H subj at X and then here's that sum of the little H. So for step J plus one, you have the following estimate where Y is approximately equal to capital H J sub uh J plus one, this is capital H J plus lowercase H J plus one. So just adding on the new uh H on to the running sum alternatively, you can rearrange some things and think, well, this means that little H J plus one is approximately equal to Y minus the previous running sum. OK. And then for a regression problem, uh typically you try and minimize the mean square error of this estimate, which for simplicity, you can maybe denote kind of like this at the J plus one step. OK. Uh And then if you take the negative gradient of this thing here with respect to H subj uh what you actually end up with is that this is the gradient on the left hand side, this is the gradient of the above step. That means that this is approximately equal to um the uh thing that we predicted at the J plus one step. OK. So gradient boosting is in some sense roughly equivalent to doing a gradient descent algorithm. And that's why it's called gradient boosting. OK. So I believe what happened was this approach happened first. So sort of building up this approach of, well, I'm just gonna sequentially predict the re the residuals and then have a running sum uh to predict what the actual value is. And then there was some showing that well, this is actually roughly equal to the gradient. So that's why it's called gradient boosting. OK. All right. So uh that's it for this notebook. I hope you enjoyed learning about gradient boosting as well as early stopping in the next notebook number seven, we learned about X G boost which is a nice package that was set up to run gradient boosting very uh in a very optimal way. So I look forward to showing you that video, but I hope you enjoyed this video and I, because I enjoyed having you watch it. Uh and I will see you next time. Have a great rest of your day. Bye.