Gradient Descent Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. 12:04:18 Hi! Everybody! Welcome back in this video. We're going to go over supervised learning, using gradient deskt. 12:04:27 So in this video, in particular, we will talk about a sort of multi-purpose algorithm that's often used to find coefficient estimates for other more specific algorithms. 12:04:37 So gradient, descent is a very general approach. It gets used all the time. 12:04:42 We're going to remind ourselves of a little bit of calculus as to why gradient descent gets used, and then we'll have an example where we implement gradient descent and then show you by hand using coding. 12:04:54 Obviously, and then show you us. Sk. Learns Sgd stochastic gradient descent model. 12:05:03 So remember from calculus. If we have a differentiable function. 12:05:07 So a function that has it has derivatives. F from the end dimensional space to R. 12:05:14 So think of like modeling, predictor, something we'd like to predict. 12:05:19 Why, by modeling it, using different features, then the gradient is given by the vector of the partial derivatives of F, with respect to the different dimensions, so it is a known fact that for you know, from calculus probably calculus 3 that for any point X star the direction of the vector gradient F 12:05:44 of evaluated at X. Star is the direction in which F. 12:05:48 Most rapidly increases, and there's a proof here. 12:05:51 So that means also that the direction of the vector negative of the gradient at X star is the direction in which F most rapidly decreases so we're going to exploit this later fact. 12:06:03 Or latter fact, in order to build an algorithm to optimize coefficient estimates. 12:06:10 So remember, for instance, with regression. We were trying to minimize this loss function of one over N, the average means square the mean, squared error. 12:06:20 We were trying to minimize the mean, squared error. So while we had an actual explicit formula, this, the normal equations, there are reasons why you wouldn't want to use the normal equations. 12:06:34 So for example, the normal equations involve taking the inverse of a matrix taking the inverse of a matrix can be very computationally expensive, so that if we had a different way that works just as well, it may be preferable to use that different way, and so that different way here is going to 12:06:50 be gradient descent. So how do I use gradient descent to minimize this function? 12:06:57 You first make an initial guess for Beta. The coefficient beta hat we're gonna call this Beta star. 12:07:04 Usually this is just a random estimate, so some random estimate, Beta star. 12:07:10 You then calculate the gradient at Beta Star, and then you update your guess by calculating the current guests for Beta hat minus a little multiple of the gradient at Beta Hat. 12:07:26 And so you'll repeat these steps 2 and 3, where you calculate the gradient, and then make your adjustment until either you've done it a certain number of times, like maybe 5,000 or 50,000, or if the difference between the update and the old value is less, than and some tolerance so it's make point 0 12:07:43 0 0 one or something. Alpha. Here is known as the learning Rates. 12:07:48 This is a hyper parameter that you'd have to set before running the algorithm. 12:07:53 So we're going to implement this for linear Regression. 12:07:55 By hand. So here I am. J generating some data as a reminder. 12:08:00 Here's the gradients of that loss function. So here's my data. 12:08:06 Very, very nice data for linear regression, very easy, clear relationship. 12:08:11 It should be easy to fit so for my learning rate, I'm gonna choose point O 5. 12:08:16 Say, and for my initial guess, I'm just going to do random dot random, 2 comma one for my tolerance. 12:08:26 I'm gonna do point. 0 0 0 0 one. And then I'm gonna do a 50,000 maximum number of iterations. 12:08:36 So, for my new beta star. It's going to be Beta star my current minus alpha times. 12:08:44 My gradient, which is 2 divided by. Let's just do the they write down what it was. 12:08:54 Let's do to make it easier for myself. I'm just gonna do n up top. 12:08:58 So N. Equals 200. 12:09:02 So this will just make it easier for typing the formula. 12:09:06 So an is 200. So 2 divided by N. Times x. 12:09:11 Dot transpose dot x dot beta star. 12:09:23 Minus y dot reshape negative 1, one. Okay? And so then this will then take that new. 12:09:33 It'll check to see if the norm of the difference is less, than the tolerance, and if it is, then that's beta hat. 12:09:41 Otherwise we update. 12:09:44 Okay, so we went through. We could run it again if we wanted. 12:09:48 Okay. And then we can just check it to what we'd get with the normal equations. 12:09:52 So slightly different. But you know, basically the same. If we set a different tolerance. 12:10:00 For example, you know we'd be closer. We could even get closer, probably, if we did that. 12:10:06 Okay, so that's gradient descent. We've implemented it. 12:10:11 It can be implemented for regression problems, classification problems anywhere. 12:10:14 You have a cost function that you can write down the gradient, for even in situations where you can't write down a gradient there are tricks, but that's gradient descent. 12:10:24 So? What are some common adjustments that are made to the base level gradients? 12:10:28 Descent. The first is known as Mini batch gradient descent, so this involves randomly spitting, splitting, your training set into smaller subsets known as Mini batches, and then what you would do is you would calculate the gradient on those different mini batches cycling your way 12:10:44 through until you get all the way through, and that's known as an appic, and then you'll just start all the way back over and work your way through the Mini batches. 12:10:52 So many batch descent is used to cut down on computation time. 12:10:57 There's also something called stochastic gradient descent, where instead of using a constant learning rate like we did above, you would use a random learning rate. 12:11:06 The idea here is a lot of the cost functions are not nice costs functions, so they have very bumpy surfaces, which means that you could get stuck in a local minimum. 12:11:16 If you have a constant learning rate whereas if you have a stochastic or random learning rate, you may randomly work your way out of that minimum and get closer the local minimum and get closer to the global minimum, that way, so stochastic gradient descent can 12:11:34 be used. This is actually a an algorithm that's implemented in sk, learn with Sgd regressor for regression and Sgd classifier for classification. 12:11:47 So we'll go to the regressor documentation, hey? 12:11:51 So you can see here. So there's 2 things we're going to point. 12:11:56 Or I guess just one thing we're going to point out. 12:11:58 I wanna note that by default there's a penalty. 12:12:01 And so penalty here means a regularization term. So by default, this is doing ridge regression instead of just classic, regular linear regression. 12:12:11 So what we're going to do is set the penalty to none to do. Regular regression. 12:12:14 Just so we can compare it with what we had. We're also going to set the Max Iter and the tolerance to be the same as what we did to try and make it more comparable. 12:12:23 So from sk, learn dot linear model, we're going to import Sgd regressor. 12:12:32 Then I'm gonna make my model object. So Regg is equal to Sgd. 12:12:39 Progressor. I'm gonna do penalty equal to none fit intercept. 12:12:49 I don't need that, because my data is like the constant column. 12:12:54 I have a constant a of ones, so fit, intercept equals false my tolerance is that, and then let's go ahead and check what's the wording for Max? 12:13:02 Iter. It's just Max iter Max. 12:13:07 Iter equals Max Iter and then I'm gonna fit the model reg dot fit x comma y. 12:13:15 And then reg.co, OP, okay. And so we could see here if we go back up. 12:13:21 Not exactly the same, but comparable basic momentum. Okay? 12:13:25 So that's gradient descent. The theory how to implement it for linear regression, using numby. 12:13:32 And then how to implement it for regression, using sk, learn. 12:13:36 So I hope you enjoyed learning about gradient descent, and you got a better understanding of what's going on when we say gradient descent.