Logistic Regression Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. 12:17:30 Hi! Everybody! Welcome back in this video we continue to learn about classification by talking about the logistic regression algorithm. 12:17:39 Let me go ahead and share my Jupiter notebook, and we'll get started. 12:17:41 So in this notebook, we're going to learn about the logistic regression algorithm. 12:17:44 We'll show you how to interpret the output of that algorithm. 12:17:48 Talk about things like classification cutoffs, and then talk about predicting probabilities instead of a hard and fast classification. 12:17:58 So in this algorithm, we're gonna be using logistic regression for binary classification. 12:18:04 Now classification are problems that are binary, classification problems are those that have a 2 classes. 12:18:11 So typically, these are coded as 0 or one. You'll also see these coded up as negative one and one. 12:18:17 Normally, the class denoted as one is something we want to identify. 12:18:21 So this could, for instance, be somebody that has a certain type of disease, or somebody that qualifies for a loan something like that. 12:18:27 So logistic regression can be adapted to a multi-class classification problem. 12:18:34 But we'll talk about that in a different notebook. 12:18:35 So logistic regression is called logistic regression, because it is a regression algorithm from the statistical point of view. 12:18:43 So we are regressing something onto something else. So you may see various machine learning cheat sheets. 12:18:51 Those sorts of things that say, no, this is classification, so it is a regression algorithm that can be used for classification problems. 12:18:58 The reason it can be used for classification problems is because what we're regressing is a probability we're regressing a probability onto some features. 12:19:06 And when you have a probability of something being a one, you can then turn that into a classification algorithm. 12:19:11 So just be aware it is a statistical regression, but it is used for classification purposes from time to time. 12:19:19 So we're gonna look at this data set called random Binary. 12:19:25 So it has these with a single feature and it has whether or not something's class 0 or class one and so while we're looking at this block while this says currently class on the vertical axis, this vertical axis label could also just as easily say, probability observation 12:19:42 is one. Why is that? Well? Because we know the labels for this data the probability that all of these observations up here at the top are, one is equal to one. 12:19:52 The probability that all these observations down here at the bottom are equal to one are 0. Why? 12:19:58 Because we know the labels, the labels down here are 0. 12:20:01 So they have no chance of being one. The labels down here are one, so they have all chance of being one. 12:20:07 That's the only thing they can possibly be. So the idea here, then, is, we're going to try. 12:20:13 And instead of do, a regular linear regression, we're going to do a regression where we sort of think of regressing the probability that an observation is one given that you observe this feature. 12:20:27 So we would write this as the conditional probability that little Y is equal to one. 12:20:33 Given the features, you observe, and we'll denote this even further, as little P. 12:20:36 Of capital X. And so what we want, instead of a str line function, we're going to want a function that's similar to this. 12:20:44 It's sort of stays as close to 0 as it can until it has to slowly go up to be one. 12:20:50 So the way to do this is to take a functional form, that is the sigmoid curve. 12:20:55 So this is what that curve looks like, one over one plus E to the x and you could see how this would quite nicely fit on top of our data. 12:21:05 So if we made this, the P. Of x, and then put it on top of our data, adjusting it appropriately, that would be nice. 12:21:13 And so for a logistic regression, the function that we're trying to to fit is one over one plus e to the negative. 12:21:21 X beta. So x is our matrix of features, and Beta is a vector of coefficients. 12:21:28 So this vector may include a constant. So beta 0, so X would have a column of one. 12:21:34 So the way this is fit is with maximum likelihood, estimation, which we're not going to talk about. 12:21:41 So we're just going to dive straight into how to fit it with sk, learn! 12:21:43 So an S can learn you fit this model with logistic regression in the classification or in the linear underscore model sub package so I wanted to take a quick second to look at the documentation here so you'll see the first argument here is this penalty. 12:21:59 Equals. L. 2. So what this means is by default, the Sk learn linear regression model is using L. 2. 12:22:08 Regularization. So it's doing a Ridge regression as opposed to just regular logistic regression. 12:22:14 So this isn't a bad thing. It's something that you may want to implement to and may improve your algorithm but what we're going to do is we're going to make sure that we only use it without the penalties. 12:22:26 So you're looking at classical logistic regression. 12:22:28 So what we're going to do is we're going to do from Sklearn. 12:22:31 That linear underscore model import logistic regression. 12:22:39 And now we're gonna make our model object. So we're gonna do logistic regression penalty equals none then we're going to fit the model log. 12:22:52 Greig fit X. So what I called the features. 12:23:02 X Train. 12:23:05 That reshape negative 1 one y train. 12:23:17 Okay, so let's go. I think I need to change this. 12:23:21 So I'm looking at an older picture, and maybe for your version of the notebook I'll update it to the current picture. 12:23:26 This should just be penalty equals to the python. 12:23:30 None. Object, so then you can use predict just like before. So log rig dot predict. 12:23:37 X trained that reshape negative, 1 one. So here are all our predictions. 12:23:44 But remember, I said, what we're regressing here is a probability. 12:23:47 So how am I getting these zeros and ones? So the way that's happening is it's making a cutoff. 12:23:54 So if we want to get the the thing that the regression problem was actually finding these probabilities, we would do dot predict underscore pro buffer probability X train reshape negative. 12:24:05 One comma one, and this column here the 0 costumn is the probability that each observation is a 0, and the value in the one column gives the probability that the observation is a one. 12:24:20 Okay. And so here's what that curve looks like. 12:24:24 So we're plotting the probability that's been predicted as a function of the feature along with the training data. 12:24:31 And you can see that it fits the training data quite well. 12:24:35 So how do I go from these probabilities to these classification? 12:24:39 So this default, one that you get from Sk learn is using a cutoff of 0 point 5. 12:24:44 So if the probability that it's a one is greater than 0 point 5, then it's classified as a one. 12:24:49 If it's less than point 5, it's classified as a 0. 12:24:53 And so we can kind of check here. We've got 99%. 12:24:58 So that's a 1, 99%. So that's a 1 point. 12:25:02 Oh! Oh! Oh, oh, oh, oh, 6%! I think maybe not percent point 006. And so that's a 0. 12:25:12 So that's how that works. So how can in general, we do this. 12:25:16 Well, we can set this cut off ourselves. So the default is point 5, but a better cutoff might work. 12:25:22 Who knows? So if we set the cutoff to be, for instance, point 7, we can get the the prediction so we could do one times log reg dot predict proba of X trained dot reshape. 12:25:43 We want the one column, and then, if it's greater than or equal to point 7, we classify it as a one. 12:25:53 Oh, sorry we're doing that in the next step. 12:25:59 Okay. And then in the next step, we'll do this one times. 12:26:03 Y. Probability greater than or equal to point 7. 12:26:10 What is wrong with my syntax? 12:26:23 Oh, okay, I'm missing a parentheses. That's why. 12:26:28 Okay. So if we had a cut off of point 7, where everything was the probability of being one greater than or equal to point 7, we would get a training accuracy of point 9 3 2 5, and so what we can do now is show you how this accuracy changes for the training set as a function of the 12:26:46 Cutoff, so you can see we can quickly get to a pretty high accuracy just by, with a very small probability cut off, and we can see how this cutoff function works. 12:26:58 To how it impacts the accuracy you get on the training set. 12:27:02 It would also impact on the test set. But what we're lawyer, what we're showing here is the training set. 12:27:08 So again in practice, if you're trying to tune a model, you would use either a validation or cross validation. 12:27:17 But this was just to demonstrate the cutoff in in a quick and easy way. 12:27:22 So I also said, We are gonna learn how to interpret logistic regression. 12:27:26 So remember that the statistical model we're using is that this probability that Y equals one. 12:27:32 Given the features as one over one plus e to the negative. X times. 12:27:35 Beta. So with a little bit of algebraic manipulation, you can get that. 12:27:40 This is the same as the walk of P. Of X. Divided by one minus p. 12:27:44 Of X is equal to x beta. So this p. Of x, divided by one minus p. Of x. 12:27:50 This is known as P. Of x. This is known as the odds of the event. 12:27:54 Y equals one statistical model for logistic regression is a linear model for the log. 12:28:01 Odd of being class one. So this will allow us to interpret the coefficients of the model. 12:28:06 So if we look at this, we have log odds using the model. 12:28:11 We just fit law guides as beta 0 plus beta one x. 12:28:15 So the odds of being X. If we exponentiate both sides, is some constant times, E. 12:28:19 To the Beta, one x C is a constant. We're not gonna care about. 12:28:24 So if we increase acts by a one unit, increase then we can see how our odds change by just working this out and plugging things in so the odds given X equals d plus one divided by the odds of X equals D well, that we can see is multiplying our odds by 12:28:41 a factor of E to the beta one. So this is what allows us to see the impact of the coefficients. 12:28:49 You can say that for every unit increase in the feature X you're multiplying your odds by a factor of E to the Beta. 12:28:57 Whatever. So we can show you this we can do. We can get these coefficients just like with linear regression. 12:29:02 With.co. F. 12:29:05 Hey? And at 0 at 0, apparently. And so then we can interpret it. 12:29:12 So point one unit increase in our feature multiplies the odds of being classified as a one by 10.1, which is pretty big, so here are some assumptions about the algorithm. 12:29:23 While we were explaining the concept of logistic regression, we didn't mention any assumptions that's because the assumptions are a little less than with linear regression. 12:29:32 So they have to be independent. You need when you have multiple predictors, it's good for them not to be correlated with one another. 12:29:40 We're assuming that the law gods are linearly dependent on the predictors. 12:29:43 That's this part that we talked about. And then usually want to have, like a larger data set, not too small of a data set. 12:29:50 If you're using logistic regression. So the we didn't talk about them because the data was randomly generated. 12:29:57 And in the real world you may wanna check these sorts of things. 12:30:00 But again, like with predictive modeling, as long as the you know, the cross foundation, those sorts of things are good. 12:30:07 We tend not to care as much about these assumptions. 12:30:10 Okay, so that's it. I hope you enjoyed learning about logistic regression. 12:30:14 I enjoyed teaching you about it, and I hope to see you in the next video.