WEBVTT 00:00:00.000 --> 00:00:02.000 Okay, so I'm gonna start recording. Now. 00:00:02.000 --> 00:00:07.000 Hi, everybody! Welcome back! This is lecture number 8 of the May 2023. 00:00:07.000 --> 00:00:11.000 Data science boot camp. So today, we're gonna start with our clification. 00:00:11.000 --> 00:00:12.000 I guess we started with it in yesterday's lecture, but we're going to follow up with that today. 00:00:12.000 --> 00:00:25.000 So let me go ahead and get my chat situated. 00:00:25.000 --> 00:00:37.000 So yesterday we talked about the adjustments you have to make to data splits for classification, namely, you just have to do a stratified split stratifying on the outcome that you're trying to predict the why? 00:00:37.000 --> 00:00:43.000 So today, we're going to start diving into algorithms and then performance metrics for classification problems. 00:00:43.000 --> 00:00:45.000 We're going to start with notebook number 2 and try and work our way to notebook number 6. 00:00:45.000 --> 00:00:58.000 I don't know how far we'll get. But I it's conceivable that we would be able to at least start notebook number 6 today. 00:00:58.000 --> 00:01:04.000 So the first algorithm that we're going to learn for classification is called K, nearest neighbors. 00:01:04.000 --> 00:01:05.000 So we're going to introduce what this is sort of give you an idea of how it looks. 00:01:05.000 --> 00:01:10.000 We'll talk about our very first classification performance metrics. 00:01:10.000 --> 00:01:25.000 We'll learn more soon. And then we're going to learn about the Iris data set, which is a very popular data set used for classification algorithms, both in teaching and sort of a serving as a benchmark. 00:01:25.000 --> 00:01:26.000 So let's dive right into what the algorithm is. 00:01:26.000 --> 00:01:27.000 So for K. Nearest neighbors. That's what the K. 00:01:27.000 --> 00:01:37.000 And N. Stands for the way that you make predictions from the training set is relatively straightforward. 00:01:37.000 --> 00:01:43.000 So you're first going to select a number K, so this is another hyper parameter that you can choose. 00:01:43.000 --> 00:01:44.000 You'll input a point that you would like to predict on X Star. 00:01:44.000 --> 00:02:07.000 So I don't sure if I said it. But here we're again in the situation where we have a matrix of Features X and some outputs that we'd like to predict why, here are the outputs are going to be classes and the specific examples are going to look at below in the pictures. 00:02:07.000 --> 00:02:08.000 It's Binary classification. And the Irs data set. 00:02:08.000 --> 00:02:14.000 It's multi-class classification. The features are just a matrix can have categorical or continuous. 00:02:14.000 --> 00:02:23.000 And then for us here we say, X. Star. It is a particular observation. 00:02:23.000 --> 00:02:24.000 So imagine that we have a new observation, that we'd like to predict on. 00:02:24.000 --> 00:02:29.000 So you put your X star, yeah. 00:02:29.000 --> 00:02:37.000 Matthew. Sorry to call you. Could you mind zooming it a little bit if you find if you yeah, thank you. 00:02:37.000 --> 00:02:39.000 Okay. 00:02:39.000 --> 00:02:52.000 So you input the point you're trying to predict and then what the algorithm is going to do is it's going to try and find the K closest points to what you've input within the training set. 00:02:52.000 --> 00:02:55.000 So inside the training set, it's going to calculate the distance from this input to all the points in the training set. 00:02:55.000 --> 00:03:02.000 Then it's going to try and find the K. 00:03:02.000 --> 00:03:03.000 That are closest remember, K. Is a hyper parameter, that you choose ahead of time. 00:03:03.000 --> 00:03:12.000 So it just does this by calculating the distances which can take a long time. 00:03:12.000 --> 00:03:18.000 If you have a very large training set. So the categories of each of the nearest neighbors are tabulated, meaning they're just like counted up. 00:03:18.000 --> 00:03:21.000 So you want to see how many of class 0, how many of class one, how many of class 2, etc. 00:03:21.000 --> 00:03:29.000 And then the category that receives the most votes. 00:03:29.000 --> 00:03:36.000 So that's what's being tabulated. We can think of the number of neighbors of each class as votes for that class. 00:03:36.000 --> 00:03:40.000 The category with the most votes is what is predicted for X star. 00:03:40.000 --> 00:03:41.000 Anytime. There's a tie between 2 or more categories. 00:03:41.000 --> 00:03:48.000 The prediction is chosen randomly from the tide classes. 00:03:48.000 --> 00:03:55.000 So if you had a type between zeros and ones essentially you're just flipping a coin to decide whether it's 0 or one. 00:03:55.000 --> 00:04:00.000 So this is a lot of words. I think it's easier to kind of understand what's going on with pictures. 00:04:00.000 --> 00:04:01.000 So imagine we're in a setting where we have 2 features that are both continuous variables. 00:04:01.000 --> 00:04:12.000 And we are setting K equals to 4. The black X is going to represent where we're trying to predict the red circles are of one class and the green triangles are another class. 00:04:12.000 --> 00:04:18.000 So if you're black. X was here, it's 4 close to neighbors. 00:04:18.000 --> 00:04:24.000 Are these red circles, the way that I drew this was with the Google Slide so it's not like exact, like straight line from the center of X to the center of the red Circle. 00:04:24.000 --> 00:04:32.000 So just imagine it. Is for the purposes of understanding what's going on. 00:04:32.000 --> 00:04:38.000 So these 4 points are the closest, because all 4 of the neighbors are red circles. 00:04:38.000 --> 00:04:44.000 The algorithm would then predict that the X would also have to be a red circle that's what the algorithm would guess. 00:04:44.000 --> 00:04:55.000 And then again, remember, these are the training points. So let's now I guess that are let's say that are the thing we're trying to predict is placed here in the data space. 00:04:55.000 --> 00:05:08.000 Now it would count up. So I have 3 of my 4 closest neighbors are green triangles, one of my 4 closest neighbors is a red circle, so 3 out of 4 has a majority. 00:05:08.000 --> 00:05:12.000 So the green triangle would be my prediction, and then the final example case. 00:05:12.000 --> 00:05:13.000 We'll look at is I have this situation. Where of mine? 00:05:13.000 --> 00:05:23.000 4 neighbors. I'm evenly split between red circles and green triangles. So this is a tie. 00:05:23.000 --> 00:05:30.000 And so in this situation, my algorithm would just randomly choose between a red circle and a green triangle. 00:05:30.000 --> 00:05:31.000 There's no reason with unweighted voting. 00:05:31.000 --> 00:05:39.000 Everybody gets an equal vote. There's no reason to prefer the Red Circle or the green triangle. 00:05:39.000 --> 00:05:46.000 It will just randomly choose. So in this example we seemingly use what's known as Euclidean distance. 00:05:46.000 --> 00:05:52.000 So if that sounds weird, think of the distance formula you learned in high school or junior high. 00:05:52.000 --> 00:05:59.000 I forget when you first learn it, where it's basically just the square root of the sum of the squares, of the differences. 00:05:59.000 --> 00:06:03.000 So just your typical distance metric, but you could use any distance. 00:06:03.000 --> 00:06:04.000 Metric you like. You'll get slightly different results each time. 00:06:04.000 --> 00:06:12.000 That's another thing you could choose about the algorithm and see if it gives you a better performance on your data set. 00:06:12.000 --> 00:06:17.000 I also mentioned that in these examples we're using equally weighted votes. 00:06:17.000 --> 00:06:26.000 So the way you think of casting a vote in not every American State anymore, but a lot of American States for everybody's vote counts the same. 00:06:26.000 --> 00:06:38.000 You could also wait the votes so very standard way of waiting the votes is sort of an inverse of the distance, so points that are closer to where you're trying to predict have a have a bigger weight in the vote. 00:06:38.000 --> 00:06:41.000 So. For instance, here I think alright. It's kind of hard to tell from this. 00:06:41.000 --> 00:07:00.000 They do kind of look similar, but if, like, for instance, the 2 green triangles were closer, than the 2 red circles, it's possible that it wouldn't be a tie with the weighted vote, okay, so before we show you how to do this and sk where and I'm gonna pause. 00:07:00.000 --> 00:07:05.000 to see questions. And so, okay, so Pedro's question was he asked. 00:07:05.000 --> 00:07:15.000 Only numbers are counted, not distances. And then Pedr said that I answered that earlier, so just wanted to make it clear for those of you watching later that don't have access to the chat. 00:07:15.000 --> 00:07:35.000 Okay, are there any other questions about the theory like the setup of the algorithm? 00:07:35.000 --> 00:07:36.000 Okay. And then maybe I'll make one more note. So remember, we talked about a lot. 00:07:36.000 --> 00:07:45.000 The supervised learning, free work, where you assume Y is equal to F of x plus epsilon. 00:07:45.000 --> 00:07:52.000 So that framework is still working in the background. Here the difference with this approach, then, like other models, we've learned, so far is we don't have an explicit functional form that we're trying to estimate here. 00:07:52.000 --> 00:08:04.000 We're sort of taking what's what's known as a non-parametric approach, where we're not gonna have like here, we're just not gonna have a function. 00:08:04.000 --> 00:08:07.000 We're trying to estimate, but they're still isn't in the background. 00:08:07.000 --> 00:08:15.000 This assumption that why is some function of X plus error? 00:08:15.000 --> 00:08:20.000 Okay. So to see this in action with Sk, learn, we are going to use the Irs data set which we actually talked about during our data collection lecture. 00:08:20.000 --> 00:08:30.000 But if you want to look at it, this is it on the Uc Irvine machine Learning Repository. 00:08:30.000 --> 00:08:34.000 So this is in Iris it's a type of flower. 00:08:34.000 --> 00:08:47.000 In this data set, there are 3 types of irises. There is a a versa color and a Virginica, and then each observation has these 4 measures. 00:08:47.000 --> 00:08:58.000 Sequel length, cpil, width, pedal, length, and peal width, and so we're going to use these to try and predict the class of the Irs. 00:08:58.000 --> 00:09:06.000 So let's get back. So this is, we don't have to go to the machine learning Archive and download it, or anything. 00:09:06.000 --> 00:09:15.000 The data set is inside. Sk, learn. And so you're gonna say, from Sk, learn dat sets import load Iris. 00:09:15.000 --> 00:09:22.000 Then we're going to run. Load Iris, and so this will just load it and we can take a look at what this looks like. 00:09:22.000 --> 00:09:32.000 After. So, for instance, Iris here is, I believe, an array or a dictionary, so here we have the data which serves as the features, and then, after that, we have the target, which is 0 one and 2. 00:09:32.000 --> 00:09:39.000 So the Zeros are the the ones are the versat colors, and the two's are the Virginica's, and then we have additional information about the data set. 00:09:39.000 --> 00:09:55.000 So we have additional information about the data set. So if we want to look at it, I turn this into a data frame just to make it easier to look at. 00:09:55.000 --> 00:09:58.000 So here are the first 5 rows, and I guess here's like a sample. 00:09:58.000 --> 00:10:06.000 Maybe let's stick with the sample. So here's a random sample. So we've got sequo lengths. People with pedal length pedal width. 00:10:06.000 --> 00:10:07.000 And then Iris class, which is an integer. 00:10:07.000 --> 00:10:15.000 So we can always go back and remind ourselves I think it's Satursa versus color. 00:10:15.000 --> 00:10:20.000 And then Virginia, okay, so we're gonna make our train test split. 00:10:20.000 --> 00:10:23.000 So just to get some practice. With doing stratified train test splits. 00:10:23.000 --> 00:10:26.000 So we still run train, test, split in exactly the same way as before. 00:10:26.000 --> 00:10:34.000 But now we add this extra argument of stratify where the thing I'm stradifying on is the class of my Irs. 00:10:34.000 --> 00:10:43.000 Okay, so this will make sure that the training set and the test set have relatively equal splits between zeros ones and twos. 00:10:43.000 --> 00:10:47.000 Okay. And then here are the first 5 observations of my training set. 00:10:47.000 --> 00:10:51.000 So to get a sense of what this looks like. I've gone and made a plot. 00:10:51.000 --> 00:10:55.000 So we're here we're plotting sequel with against sequel length. 00:10:55.000 --> 00:11:01.000 And so we've got our blue circles, which are the Zeros, or orange or yellow triangles, which are the ones and then are green X's, which are the twos. Okay? 00:11:01.000 --> 00:11:11.000 So this is what we, you know, a subset of the data space so Cpa length against sequel width. 00:11:11.000 --> 00:11:14.000 This is what we're gonna be looking at to try and use to train our K. 00:11:14.000 --> 00:11:18.000 Nearest neighbors, algorithm. 00:11:18.000 --> 00:11:30.000 So, are there any questions about the data? 00:11:30.000 --> 00:11:35.000 Alright, so in s K. Lern, you can build a K nearest neighbors, classifier model with K. 00:11:35.000 --> 00:11:38.000 Neighbors Classifier, and the documentation can be found here. 00:11:38.000 --> 00:11:46.000 You might be wondering. Why is it, K. Neighbors, Classifier? 00:11:46.000 --> 00:11:51.000 So just like you have the classifier. You can do sort of the same pricing and make regression models where you'll take the average value of the observations that are your K closest neighbors. 00:11:51.000 --> 00:11:59.000 So that's sort of the idea there. So you have a classifier version, and then you also have a regression version. 00:11:59.000 --> 00:12:13.000 So all I'm pretty sure, almost let's just say almost all of our classification algorithms have a regression counterpoint. 00:12:13.000 --> 00:12:22.000 So after we're done with the classification stuff, there is a notebook in the regression folder that goes through all the similarities and shows you how to do them with regression. 00:12:22.000 --> 00:12:28.000 So feel free to check that out if you'd like to see other regression models beside linear regression. 00:12:28.000 --> 00:12:35.000 Okay. So we're gonna do from sk, learn, this is stored in the neighbors. 00:12:35.000 --> 00:12:36.000 Module of sk learn, so you can tell that by looking at the link S. 00:12:36.000 --> 00:12:50.000 Learn dot neighbors. So from Escalarn dot neighbors, we're going to import K neighbors, classifier. 00:12:50.000 --> 00:12:57.000 Then we are so familiar with this. Maybe we need a refresher because we've been doing time series for a couple of days. 00:12:57.000 --> 00:12:58.000 But remember, from linear regression, the pattern is you make your model objects. 00:12:58.000 --> 00:13:08.000 Okay, neighbors classifier, and then the number, you input a positive insider. 00:13:08.000 --> 00:13:12.000 So for us, for this example, we're going to choose K equals 5. 00:13:12.000 --> 00:13:32.000 Then we will fit the model on the training set. So Iris train, and then we want with and then we'll use, even though I picture just to. 00:13:32.000 --> 00:13:36.000 We can go ahead and. 00:13:36.000 --> 00:13:42.000 Just put all 4 in. So I think it's, is it pedal or pedal? P. 00:13:42.000 --> 00:13:47.000 E, T. A. Okay, petal link. 00:13:47.000 --> 00:13:57.000 Pedal with, and then we put our Y Iris train dot target. 00:13:57.000 --> 00:14:06.000 Right Iris, class, not target Iris under store underscore class. 00:14:06.000 --> 00:14:15.000 And then we'll do predict. So dot predict. And then we're just gonna do this on the training set. 00:14:15.000 --> 00:14:19.000 Okay. So here, we can see, like, what's going on. 00:14:19.000 --> 00:14:21.000 So K. Nearest neighbors isn't actually fitting anything in the sense of like we're not estimating any parameters. 00:14:21.000 --> 00:14:36.000 All we're doing is storing the triging set as a part of the model object, and then, when it calls, predict that's actually where all the work gets done. 00:14:36.000 --> 00:14:50.000 So when you call predict, you have to calculate all of those distances and then make the prediction based on the voting procedure. 00:14:50.000 --> 00:14:55.000 So Zack's asking what helps decide. The recommended number of neighbors naively. 00:14:55.000 --> 00:15:00.000 I would have expected greater than 10 here I just chose 5 so it's a hyper parameter. 00:15:00.000 --> 00:15:10.000 So just like with every other hyper parameter, you do some sort of cross validation or validation set, and you would set up like a grid of values. 00:15:10.000 --> 00:15:11.000 And for her it would just be like a list. So you do like a list, and you could go from like K equals. 00:15:11.000 --> 00:15:25.000 One all the way up to like K. Equals 50 or more, depending on the size of your data set, and then see which one gives you the best cross validation, metric. 00:15:25.000 --> 00:15:32.000 So maybe this is a good lead in so one validation metric that you all use for classification problems is called accuracy. 00:15:32.000 --> 00:15:49.000 So accuracy measures the proportion of all predictions that you made, that are correct, and so we could define this by hand. 00:15:49.000 --> 00:15:57.000 So to do this by hand, we're going to define a function accuracy that takes in the true values along with your predicted values. 00:15:57.000 --> 00:16:01.000 And then to do, to get the accuracy. 00:16:01.000 --> 00:16:04.000 You want to see? How many of those values did you predict correctly divided by the total number of observations? 00:16:04.000 --> 00:16:16.000 So the total number of predictions you made, and so then we could do accuracy. 00:16:16.000 --> 00:16:20.000 And then we would do the true values which again, we're gonna use. The training set. 00:16:20.000 --> 00:16:33.000 So iris train dot iris class dot values, and then we'll do the prediction. 00:16:33.000 --> 00:16:37.000 And I guess I probably don't need values. But I'll just leave it there. Hey? 00:16:37.000 --> 00:16:41.000 And so on. The training set. We got a 98.3 repeating, so like 98, and a third percent accuracy. 00:16:41.000 --> 00:16:51.000 So it's gonna look like this. But this is a percentage so that's the accuracy. 00:16:51.000 --> 00:16:56.000 So that's one. And I also want to point out, in addition, you don't have to define your own accuracy, and we'll learn this in the next notebook. 00:16:56.000 --> 00:16:59.000 But there's sk learn just like it has mean squared error. 00:16:59.000 --> 00:17:09.000 It has a number of classification, metrics that you can just use. 00:17:09.000 --> 00:17:14.000 So the one for accuracy is from Sk. Learn dot metrics. 00:17:14.000 --> 00:17:25.000 It's called the accuracy underscore, and then the word score, and so we could redo this whole thing. 00:17:25.000 --> 00:17:35.000 And just say, accuracy score the it works the same way where you put in the true values followed by the predicted values. 00:17:35.000 --> 00:17:38.000 Oh, no! What did I do not scored accuracy, score! 00:17:38.000 --> 00:17:45.000 There we go, and so I have a link for the documentation in the code here. 00:17:45.000 --> 00:17:49.000 So we could go here. 00:17:49.000 --> 00:17:56.000 And then there you go. So you put it in the true, followed by the predicted. 00:17:56.000 --> 00:18:04.000 Alright, and then you can learn more about the different other different arguments on your own time. 00:18:04.000 --> 00:18:06.000 Okay. So before I show you this last bit, are there any questions about anything? 00:18:06.000 --> 00:18:20.000 So far, I know Kate, nearest neighbors is pretty straightforward for an algorithm, but I just want to make sure there's room for anybody who has questions to ask questions. 00:18:20.000 --> 00:18:23.000 I had a question. 00:18:23.000 --> 00:18:28.000 So, if I understand the algorithm correctly, you're basically you pick a random point in your data set. 00:18:28.000 --> 00:18:42.000 And then you look at things that are closest to it. And then, if it's, for example, surrounded by red points, you would label that as red, and if it's surrounded by green points, it would label that point as green, that's kind of the general idea. 00:18:42.000 --> 00:18:43.000 Yeah, so, and it. But it's not a randomly selected point. 00:18:43.000 --> 00:19:03.000 So in the visual sort of description I had up here like these red and green points. 00:19:03.000 --> 00:19:04.000 Hmm! 00:19:04.000 --> 00:19:05.000 These are the points in your training set so like. If we had only used these 2, these 2 features, it would look like this for us, but we used all 4 features, and so then the points that we're trying to predict on like they look random. 00:19:05.000 --> 00:19:09.000 Here because I'm just showing you an example of the different outcomes. 00:19:09.000 --> 00:19:17.000 But these points, like the black X's in our example, down below. 00:19:17.000 --> 00:19:21.000 If the Irs data would be the observations we're trying to predict on. 00:19:21.000 --> 00:19:30.000 So like you're going to input like you have in your head or in your computer, a list of values that you'd like to get predicted. 00:19:30.000 --> 00:19:31.000 And so the black X's are those values that you want. 00:19:31.000 --> 00:19:35.000 Predicted. 00:19:35.000 --> 00:19:46.000 Right? So I guess the question I was at the point I was trying to get out. It's the algorithms assuming that you're surrounded by similar things like, so if you're red, you'd be surrounded by red things. 00:19:46.000 --> 00:19:48.000 If you're green border. 00:19:48.000 --> 00:19:55.000 Yeah. So it will only predict like, so the algorithm when it makes a prediction, it doesn't know what your label is. 00:19:55.000 --> 00:20:00.000 So it's trying to guess that. And so it does that by looking at your K closest neighbors in the training set. 00:20:00.000 --> 00:20:08.000 And then counts up like what class shows up the most in those neighbors. 00:20:08.000 --> 00:20:14.000 Right? So would it work on data that's not clustered together. 00:20:14.000 --> 00:20:23.000 Yeah. So like, if the if the if, the green like, let's say, if for these 2 features say, if the red points and the greeting points were like overlapping with one another, yeah, it wouldn't work very well. 00:20:23.000 --> 00:20:26.000 Exactly. Yeah, yeah. 00:20:26.000 --> 00:20:34.000 But then, you know none of like the like, you'd have to do some sort of pre-processing, if like, if that situation sort of happened. 00:20:34.000 --> 00:20:39.000 I don't think there are any classification algorithms that would work well in that kind of data. 00:20:39.000 --> 00:20:43.000 If, like, the 2 classes are indistinguishable from each other in the data set. 00:20:43.000 --> 00:20:46.000 Hmm, okay. Thanks. 00:20:46.000 --> 00:20:52.000 Yeah. 00:20:52.000 --> 00:20:53.000 So Kirtha is asking. Seems like there would be a normal Gaussian type distribution for K. 00:20:53.000 --> 00:20:57.000 It would have to have an upper limit beyond which accuracy would drop. 00:20:57.000 --> 00:21:00.000 Is that true? So the once you get to like, if K. 00:21:00.000 --> 00:21:24.000 Is just the cardinality of the training set. It's just going to be predicting the majority class every time, which is a baseline model I don't know that it necessarily I don't know that like if you did cross validation for the accuracy it would follow some sort of 00:21:24.000 --> 00:21:48.000 Gaussian distribution. I don't think there's anything, any theorem or theory that says that it has to follow any distribution for every problem. 00:21:48.000 --> 00:21:59.000 Any other questions? 00:21:59.000 --> 00:22:05.000 Okay, so before we move on to the next notebook, I wanna talk about this feature called predict underscore proba. 00:22:05.000 --> 00:22:12.000 So this algorithm, like it made like just a prediction when you call dot predict, it just makes a prediction of the class. 00:22:12.000 --> 00:22:31.000 This is not always advantageous, so sometimes, instead of making a straight per prediction, you want to get a probability, a predicted probability of it being that class. 00:22:31.000 --> 00:22:39.000 So, for almost all of the algorithms that we'll learn, you can do the name of the algorithm. 00:22:39.000 --> 00:22:43.000 So whatever variable, so for us it was Knn. 00:22:43.000 --> 00:22:59.000 Dot predict underscore pro bus. So pro by here stands for probability. And then you input your data set. 00:22:59.000 --> 00:23:11.000 And then what's gets returned is an array where each entry so each row corresponds to an observation, and then each column corresponds to the algorithms. 00:23:11.000 --> 00:23:13.000 Estimated probability that that observation is a member of that class. 00:23:13.000 --> 00:23:41.000 So, for instance, in this 0 row, the algorithms predicting that this observation has a 0% probability of being class 0, a 1% or a hundred percent probability of being class one and a 2% probability of being classes 0 and so you can scrroll through the rest and see a lot of these are being predicted 00:23:41.000 --> 00:23:45.000 as like one of these are being predicted as like 100%, one of the classes. 00:23:45.000 --> 00:23:48.000 So the way that this works is for K. Nearest neighbors. 00:23:48.000 --> 00:23:49.000 This probability is just the fraction of the neighbors that are of a class. 00:23:49.000 --> 00:23:59.000 So, for instance, what this is telling us is that all 5 of this observation's neighbors are of of class 2. 00:23:59.000 --> 00:24:12.000 All 4 of these. This observations, neighbors are of class 2, and then one of them are of class, one. 00:24:12.000 --> 00:24:14.000 And so forth. So that's what's going on here. 00:24:14.000 --> 00:24:15.000 And so here's another example that's slightly different. 00:24:15.000 --> 00:24:22.000 So here you have 2 of the 5 are of class 2, and then 3 of the 5 are of class, one. 00:24:22.000 --> 00:24:35.000 If you were doing a weighted voting, it would be like the fraction of the weights instead of just the fraction of the neighbors. 00:24:35.000 --> 00:24:46.000 So sometimes you want to have probabilities instead of a hard cutoff, and we'll see some examples as to why we want that in the coming notebooks. 00:24:46.000 --> 00:24:57.000 Alright any questions about the probability stuff. 00:24:57.000 --> 00:25:02.000 Okay. So the next notebook we're gonna look at is notebook number 3 in classification. 00:25:02.000 --> 00:25:06.000 So we looked at notebook number 2. Now we're on notebook number 3. 00:25:06.000 --> 00:25:22.000 So this notebooks called the confusion matrix. So in the last notebook we talked about accuracy but as you're gonna see, that's not always the one you wanna go with in terms of performance metrics, so in this notebook, we're gonna introduce a number of different metrics, show 00:25:22.000 --> 00:25:41.000 you something called the confusion matrix. And really, it's more so, maybe about how we can get confused with all these different matrix than like the algorithms getting confused. And then we'll sort of give you like a lay a link to a useful summary table to help you keep all this straight. 00:25:41.000 --> 00:25:46.000 So with the K nearest neighbors. Notebook, we defined accuracy, which is the number of correct predictions you make divided by the total number of predictions you make. 00:25:46.000 --> 00:25:54.000 So sometimes this can be a misleading metric. 00:25:54.000 --> 00:26:06.000 So, for instance, if you have a data set where the vast majority of your observations are of one class, and then a very few of the observations are of the other class, you can misleadingly get like what seems to be a really good algorithm with very silly models. 00:26:06.000 --> 00:26:25.000 So let's say. For instance, I had a data set where 10% of my data was of class one and 90% of my data was of class one and 90% of my data was of class one turned out to be an infectious disease. 00:26:25.000 --> 00:26:29.000 Or any kind of disease that is deadly, but is treatable. 00:26:29.000 --> 00:26:36.000 If you detect it in time, and so we could get a very silly algorithm that has 90% accuracy. 00:26:36.000 --> 00:26:39.000 If we just say, Okay, no matter what you show me, classify it as a 0. 00:26:39.000 --> 00:26:46.000 And so based on this distribution, we know that okay, this algorithm should be about 90% accurate. 00:26:46.000 --> 00:26:54.000 And if you were to tell somebody that they did a 90% like if somebody got a 90% on a test, they think that's really good. 00:26:54.000 --> 00:26:58.000 But here it's sort of misleading because we're 90% accurate. 00:26:58.000 --> 00:27:11.000 But we haven't identified any of the ones so for everything, you know, like, if this was the sort of thing where we're, you know, like, if this was the sort of thing where we wanna use it to try and detect this sort of deadly disease that we could treat and to detect this sort of deadly disease that we could treat and 00:27:11.000 --> 00:27:14.000 cure if we knew about it. Then this is a terrible model. 00:27:14.000 --> 00:27:15.000 And so that's sort of why we want to start developing. 00:27:15.000 --> 00:27:20.000 Some additional matrices for classification problems to give us a sense of like, how our models are correct. 00:27:20.000 --> 00:27:26.000 So with regression and Time series, we kind of relied heavily on the mean square error. 00:27:26.000 --> 00:27:39.000 Here. We have to be a little bit more careful about what metrics we use, because they tell us it different ways that our models are correct or incorrect. 00:27:39.000 --> 00:27:43.000 So we're gonna work in the world of binary classification. 00:27:43.000 --> 00:27:45.000 So you have 2 classes that are depending on the algorithm. 00:27:45.000 --> 00:27:50.000 You're working with 0 and one. And so I say that because I've algorithms are kind of like developed by different academic fields. 00:27:50.000 --> 00:28:05.000 And so in fields like statistics and probability, your 2 classes are tend to be 0 on one, but in fields like computer science, the 2 classes tend to be negative one and one. 00:28:05.000 --> 00:28:08.000 So for the confusion matrix setup, you know, the labels don't really matter. 00:28:08.000 --> 00:28:18.000 But for the confusion matrix setup, we're gonna keep it as 0 and one, so they can see matrix. You set it up in the following way, the rows represent the actual classes of your observations, and then the columns represent what your algorithm predicts. 00:28:18.000 --> 00:28:33.000 So you can go through the different entries and for things that are actually a 0, that your algorithm predicts is a 0. 00:28:33.000 --> 00:28:41.000 Those are called the true negatives, or t ends for things that are actually zeros, that your algorithm predicts to be ones. 00:28:41.000 --> 00:28:45.000 Those are false positives, because you're falsely predicting a positive case. 00:28:45.000 --> 00:28:50.000 For things that are actually ones that are are predicted to be zeros. 00:28:50.000 --> 00:28:55.000 Those are called false negatives, because you're falsely predicting that it's not a one. 00:28:55.000 --> 00:29:06.000 And then finally, for things that are actually ones that you predict are ones those are called true positives, because you're correctly predicting that they are, in fact, a positive case. 00:29:06.000 --> 00:29:10.000 So it's true that your prediction is correct is a positive. 00:29:10.000 --> 00:29:16.000 So we can use these. So actually, what is contained in these entries? 00:29:16.000 --> 00:29:17.000 I just went over what the names mean, but like when you do it like, what's actually shown in there? 00:29:17.000 --> 00:29:27.000 These are the counts. And so basically like, for all the zeros, you put every observation that you correctly predict as a 0, you put in here. 00:29:27.000 --> 00:29:32.000 So if you let's say you had 30 total zeros, and then 20 of them. 00:29:32.000 --> 00:29:38.000 You predict correctly or 0, this would have a 20, and therefore this would have a 10 in it. 00:29:38.000 --> 00:29:44.000 So the entries of the confusion matrix count up the number of the types of like classifications that you just made. 00:29:44.000 --> 00:30:02.000 So before we go ahead and dive in. Oh, and also, if you're working in like a public health or Statsy field, these are sometimes referred to as contingency tables, as opposed to as opposed to confusion, matrix, so if you're if you're familiar with contingency, tables. 00:30:02.000 --> 00:30:07.000 it's the same concept. If you're familiar with frequentist statistics, you can kind of think of like follows negatives as type 2 errors and false positives as type, one errors. 00:30:07.000 --> 00:30:16.000 If you're not familiar with frequentist statistics, then you don't have to worry about that. 00:30:16.000 --> 00:30:22.000 It's just sort of trying to get this to relate to all the different backgrounds everybody has. 00:30:22.000 --> 00:30:31.000 Okay. But before we dive into how to do this with sk, learn, and then how to derive metrics from this. 00:30:31.000 --> 00:30:45.000 Are there any questions about like just the definition of the confusion matrix? 00:30:45.000 --> 00:30:49.000 Okay. 00:30:49.000 --> 00:30:53.000 So? What are some metrics derived from the confusion? 00:30:53.000 --> 00:31:02.000 Matrix. So the confusion matrix is great. I can give you a sense of like how your matrix is going wrong, or how you're algorithm is, you know, doing wrong. 00:31:02.000 --> 00:31:08.000 But typically people will want to see like metrics. So like just something like accuracy as opposed to trying to digest the entire matrix. 00:31:08.000 --> 00:31:21.000 At the same time. So we're gonna show 6 different metrics that tell to be popular when looking at the performance of an algorithm. 00:31:21.000 --> 00:31:22.000 So it gets really confusing because a lot of these are more or less the same metric or slightly different. 00:31:22.000 --> 00:31:30.000 And then they have different names, and so on. On top of the formula. 00:31:30.000 --> 00:31:32.000 You have to remember the name. So to me. That's where the confusion part comes in. 00:31:32.000 --> 00:31:41.000 I think it's called the confusion Matrix, because it's trying to measure what your algorithm is confusing. 00:31:41.000 --> 00:31:42.000 So, but for me it's just because then it's always a lot of vocab. 00:31:42.000 --> 00:31:50.000 To remember. Okay, so the first 2 that we're gonna talk about are called precision and recall. 00:31:50.000 --> 00:31:53.000 And so the formula, using the entries of the confusion matrix are both precision is going to be the true positives plus the true positives. 00:31:53.000 --> 00:32:15.000 False positives, false, positive. So you want to think of this like out of all the things that you've predicted to be class one we're actually Class One. And so we're adding along the column of the confusion matrix the one 00:32:15.000 --> 00:32:16.000 column, and then getting the fraction that are correctly predicted. 00:32:16.000 --> 00:32:27.000 So you can kind of think of this as like, how much should I trust my algorithm when it says something is class one another metric that gets looked at at the same time as precision. 00:32:27.000 --> 00:32:34.000 Is called the recall. And so here the numerators, the same. 00:32:34.000 --> 00:32:37.000 But now the denominator is summing along the bottom row. 00:32:37.000 --> 00:32:43.000 So here you're sort of saying, Okay, out of all of my positives, my actual positives. 00:32:43.000 --> 00:32:47.000 What? Fraction of them did I correctly predict? And so you can sort of think of this as the probability, the algorithm correctly detects a class. 00:32:47.000 --> 00:32:54.000 One data point, so thinking of like sort of a conditional thing like, given that, the observation is actually a one. 00:32:54.000 --> 00:32:58.000 What is the probability? I predicted it to be a one? 00:32:58.000 --> 00:33:08.000 So we're gonna sort of show you how you can calculate this in sk, learn or using sk, learn, we're going to use the iris dataset. So we're going to make a slight tweak to it. 00:33:08.000 --> 00:33:19.000 Because remember, the Irs data set has 3 possible classes. 00:33:19.000 --> 00:33:22.000 We're going to turn this into a binary classifier has 3 possible classes. We're going to turn this into a binary classifier by just saying, I only want to predict Virginica's so instead of using the fact that I have 3 different classes. 00:33:22.000 --> 00:33:31.000 I'm gonna reduce everything to be a virginica. 00:33:31.000 --> 00:33:32.000 So instead of using the fact that I have 3 different classes, I'm gonna reduce everything to be a vir classifier. 00:33:32.000 --> 00:33:36.000 So every observation that is, a Virginica. 00:33:36.000 --> 00:33:46.000 I'm just gonna start off by labeling everything as a 0 and then locate the ones that are virginicas and replace that column with one. 00:33:46.000 --> 00:33:51.000 Okay, so just to make it very clear. 00:33:51.000 --> 00:33:57.000 Like here we can see. 00:33:57.000 --> 00:34:05.000 No, I didn't include the target call. Maybe it makes better sense to do. 00:34:05.000 --> 00:34:08.000 Well, there should be a target call. Yeah, I don't know why that's in there. 00:34:08.000 --> 00:34:15.000 Oh, it's because I said, Iris, okay, so if we do, Iis at Target, okay? 00:34:15.000 --> 00:34:18.000 So here's Iris at Target, and then we can compare it to. 00:34:18.000 --> 00:34:29.000 I we can compare it to Y, okay? And so you can see that it's 0 everywhere that the target is 0 or a one, because these are not the Virginica's. 00:34:29.000 --> 00:34:30.000 And then the Y that we're interested in is a one everywhere. 00:34:30.000 --> 00:34:37.000 The target to 2, because Virginico wasn't coded as a 2. 00:34:37.000 --> 00:34:40.000 So, I hope I didn't just like make it very more confusing than it needed to be. 00:34:40.000 --> 00:34:43.000 The basic idea is these are binary clification metrics. 00:34:43.000 --> 00:34:46.000 And so I just wanted to take a data set we were familiar with, and then turn it into a binary classification problem. 00:34:46.000 --> 00:34:55.000 It could have been any of the other 2 items, but I just chose Virginia. 00:34:55.000 --> 00:34:59.000 Okay. So now I'm make a train test split. 00:34:59.000 --> 00:35:12.000 And then we saw this last notebook. So I just fit and get the predictions on the training set for A. K nearest neighbors. Classifier with K equals 5. 00:35:12.000 --> 00:35:15.000 The value of K. Doesn't really matter here, because I'm just trying to demonstrate how to calculate performance metrics. 00:35:15.000 --> 00:35:24.000 And then in like, if I wanted to build a model, I would do cross foundation or something. 00:35:24.000 --> 00:35:25.000 Okay. So the first thing we're going to do is show you a quick way to calculate the confusion matrix. 00:35:25.000 --> 00:35:41.000 Using. Sk learn. So sk. Learn has a metric. So from sk learn dot metrics, we're going to import confusion. 00:35:41.000 --> 00:35:46.000 Underscore matrix. And then, just like with Msc. 00:35:46.000 --> 00:35:51.000 And with accuracy. Score you call confusion, matrix. 00:35:51.000 --> 00:35:54.000 And then the first thing you do is input the actual values. 00:35:54.000 --> 00:35:56.000 So? Why train? Followed by the predicted values which I've stored in a variable Y train predict? 00:35:56.000 --> 00:36:06.000 So why underscore trade and underscore P. R. E. D. 00:36:06.000 --> 00:36:18.000 So this tells us that we have 77 true negatives, 38 troop positives, 3 false positives, and 2 false negatives. 00:36:18.000 --> 00:36:26.000 So we could then calculate the recall and the precision using the formulas. 00:36:26.000 --> 00:36:36.000 So the remember, the recall is true. Positives, divided by false negatives plus true positives and then I want to comment. 00:36:36.000 --> 00:36:45.000 There and then. The precision is true. Positives, divided by false positives plus true positives. 00:36:45.000 --> 00:36:54.000 Oh, and I actually think I wanna move this down here. 00:36:54.000 --> 00:36:58.000 Here we go. Okay. So we have a 95% training recall. 00:36:58.000 --> 00:37:00.000 And a 92.6 8 training precision. 00:37:00.000 --> 00:37:06.000 So like an a vacuum like these are sort of meaningless to us in terms of like knowing whether or not this is a good model. 00:37:06.000 --> 00:37:13.000 These become useful when we compare them to other models. 00:37:13.000 --> 00:37:17.000 So typically you wanna have high precision and high recall. 00:37:17.000 --> 00:37:21.000 And you wanna choose, like the model. Are you to choose a metric? 00:37:21.000 --> 00:37:25.000 And then you want to choose the model that has the best metric that you're looking for. 00:37:25.000 --> 00:37:31.000 Alternatively. Instead of calculating them by hand, like I did here. 00:37:31.000 --> 00:37:39.000 Sk. Learn has functions called precision, score, and recall score that calculate the precision, and recall for you. 00:37:39.000 --> 00:37:48.000 So you would do from S. Learn dot metrics import precision score comma. 00:37:48.000 --> 00:37:55.000 Recall score, and then you'll just input the functions like we've seen before. 00:37:55.000 --> 00:38:00.000 So precision, score, and then we want y train. 00:38:00.000 --> 00:38:09.000 Y train predict, but that actually should go down here. Because this is where I wanted the precision. 00:38:09.000 --> 00:38:25.000 And then this one should be the recall. Okay? And so we can see that our formula that we did by hand kind of is the same as the one from Sk, learn. 00:38:25.000 --> 00:38:47.000 Okay, are there any questions about precision and recall before we move on to the next metrics? 00:38:47.000 --> 00:38:52.000 Okay. So the next set of metrics are called the rates. 00:38:52.000 --> 00:39:00.000 The various rates. So we have 4 of them. And it's basically all 4 entries of the confusion metric. 00:39:00.000 --> 00:39:02.000 But as a rate. So we have true positive rate, false positive rate, true negative rate and false negative rate. 00:39:02.000 --> 00:39:25.000 And so basically, what you're saying is a sort of a series of series of conditional probability estimates basically so. 00:39:25.000 --> 00:39:35.000 I think here I have a slight table. I should just say, instead of true positive, I think what I say true, positive here I'm pretty sure I mean actual instead of true. 00:39:35.000 --> 00:39:36.000 So let me change that, because that is slightly misleading. 00:39:36.000 --> 00:39:46.000 So let's say, actual here. Okay, so given that an observation is actually positive. 00:39:46.000 --> 00:39:51.000 What is the probability that we correctly predicted as positive? 00:39:51.000 --> 00:39:57.000 Is the true positive rate and it just says a note, this is the exact same thing as recall same formula. 00:39:57.000 --> 00:40:03.000 Given that our observation is actually positive. What is the probability that we incorrectly predicted as a negative? 00:40:03.000 --> 00:40:17.000 So this is the false negative rate, then we have the other 2 given that an observation is an actual negative, what is the probability that we correctly predicted as a negative? 00:40:17.000 --> 00:40:22.000 That's the true negative rate. And then given that an observation is actually negative. 00:40:22.000 --> 00:40:30.000 What is the probability that we incorrectly predict it as a possibility that's known as the false positive rate so the formulas for that are given below? 00:40:30.000 --> 00:40:31.000 So you take the numerator for that are given below. So you take the numerator, and then you're dividing by the row. 00:40:31.000 --> 00:40:47.000 That numerator is found in. Okay, so true, positive is divided by all actual positives, false, negative, divided by all actual positives. 00:40:47.000 --> 00:40:53.000 True negative is divided by all actual negatives, and so forth, so other than the true positive rate. 00:40:53.000 --> 00:41:04.000 Which is the same as recall these ones. You have to calculate by hand, using the confusion matrix function. 00:41:04.000 --> 00:41:15.000 So here I've calculated all of the rates for our for our training set. 00:41:15.000 --> 00:41:19.000 Okay. 00:41:19.000 --> 00:41:25.000 And then the last 2 that we're gonna look at, and you just to introduce to you is sensitivity and specificity. 00:41:25.000 --> 00:41:29.000 So these have a long history of use in the field of public health when it comes to the sort of understanding the performance of various screening and diagnostic tests. 00:41:29.000 --> 00:41:44.000 So the sensitivity of a classifier is the probability that it correctly identifies a positive observation. 00:41:44.000 --> 00:41:50.000 So once again. This is the exact same thing as true positive rate, and recall, and then the other is specificity. 00:41:50.000 --> 00:42:02.000 This is the probability that your classifier correctly identified a negative observation so this is the same as true negative rates. 00:42:02.000 --> 00:42:06.000 So sensitivity is tp over Tp plus Fn. 00:42:06.000 --> 00:42:20.000 Specificity is tn over tn, plus fp, so once this one sensitivity, you can calculate either by hand using confusion, matrix, or the reconstruction matrix or the recall score. 00:42:20.000 --> 00:42:33.000 But specificity. That's no like as of, to my knowledge, they may have added one, as of to my knowledge they may have added one, but my knowledge there is no function like specificity score, so you'd have to compute this by hand, so this is a lot. 00:42:33.000 --> 00:42:37.000 Very aware that it's a lot so to help you like. 00:42:37.000 --> 00:42:45.000 You know I don't remember. I usually don't remember any of these formulas other than precision is is the column and recall is the row. 00:42:45.000 --> 00:42:47.000 So I always have to look it up to remind myself before I like. 00:42:47.000 --> 00:42:53.000 Try and use them. So to make it easier for you guys I've created a confusion matrix. 00:42:53.000 --> 00:42:56.000 Cheat sheet which should be in the repository. 00:42:56.000 --> 00:42:57.000 If it's not, you can let me know, and I'll make sure to upload it later tonight. 00:42:57.000 --> 00:43:18.000 So it has the picture of the confusion matrix, and then has the different metrics with their names, and then formulas on the right hand side okay, so there's there are some on here that we did not formally introduced like we didn't talk about total error rate. 00:43:18.000 --> 00:43:19.000 We didn't talk about type one error rate type, 2 error rate explicitly. 00:43:19.000 --> 00:43:31.000 But you can look at them here and then see the formula, and just so like this is just useful to have, even if you know a lot about data science, it's really easy to forget what these formulas are. 00:43:31.000 --> 00:43:51.000 And like, if you're going into like a job interview or something it's useful to maybe use this to try and brush up on what these are, because you may be asked like, oh, so what's the difference between precision and recall and if you you know if you can't remember 00:43:51.000 --> 00:43:52.000 it's good. It's just good to know it, and then you can reason out like what the differences from looking at the formula. 00:43:52.000 --> 00:44:03.000 I don't think any job interviews gonna penalize you if you draw the confusion matrix to reason out what it is. 00:44:03.000 --> 00:44:10.000 Yeah, so this is, for you know. So you don't have to open the Jupiter notebook anytime. 00:44:10.000 --> 00:44:14.000 You get just refer back to the cheat sheet in the real world. 00:44:14.000 --> 00:44:21.000 It's really important. So like in these sort of like, a lot of the problems we've been working on aren't like real world problems that have impact on anybody because we're just trying to learn. 00:44:21.000 --> 00:44:31.000 But in the real world you want to give it your metrics that you use really care careful consideration. 00:44:31.000 --> 00:44:40.000 So the metric you use is going to help determine what model is seen as best so you can have models again like that have really good accuracy. 00:44:40.000 --> 00:44:59.000 But if what your problem requires is precision or recall having, like the best accuracy, maybe does not lead to giving you the best precision or the best recall, so, as an example like public health will often focus on sensitivity, and specificity, because they can be translated into real-world 00:44:59.000 --> 00:45:08.000 health impacts. So like, let's say, in the case of a deadly disease, we have successful regimens for treating. 00:45:08.000 --> 00:45:15.000 We may wanna have high sensitivity there if we we may opt for high specificity. 00:45:15.000 --> 00:45:16.000 If the disease or condition in question does not tend to cost severe outcomes, and the test or treatment is like highly invasive. 00:45:16.000 --> 00:45:31.000 So, maybe, like this condition or disease just provides like a mild inconvenience, like a common cold, tends to do. 00:45:31.000 --> 00:45:40.000 But like maybe, like the test that we would have to give to check whether or not this person has that disease or that condition is very invasive. 00:45:40.000 --> 00:45:44.000 We may opt to have one that has a high specificity. 00:45:44.000 --> 00:45:49.000 So it's just like, you know, sort of considering. 00:45:49.000 --> 00:45:53.000 I guess the takeaway here is not necessarily to remember public health stuff. 00:45:53.000 --> 00:46:06.000 But consider what the impact of like different types of incorrects and correct predictions are for your particular models, and then choose the metric that will help either maximize that impact. 00:46:06.000 --> 00:46:09.000 If it's positive or minimize that impact, if it's a negative impact when choosing your classification algorithms. 00:46:09.000 --> 00:46:20.000 Another thing I'll stress is a lot of times in the business and business settings. 00:46:20.000 --> 00:46:25.000 People like to have these metrics, because you can readily interpret them in terms of like outcomes. 00:46:25.000 --> 00:46:32.000 So there are other, metrics that people will look at that we'll talk about, maybe a little bit later in notebook number 5. 00:46:32.000 --> 00:46:33.000 These, ones I personally appreciate these ones because I can like interpret them. 00:46:33.000 --> 00:46:47.000 What they mean in terms of the actual real world problem. There are other metrics that sort of try and take them and then combine them all into one like super metric that gives you some kind of average. 00:46:47.000 --> 00:46:51.000 Those are hard to interpret in terms of like, what's the impact in the real world problem? 00:46:51.000 --> 00:46:52.000 So I think a lot of people like wanna know the impact in terms of the real world. 00:46:52.000 --> 00:47:12.000 So I would encourage you to first appeal to these ones, where you can kind of disentangle what it means to to the actual impacts of your problem instead of like all encompassing metrics that are just saying, well, this model is the best because this number's lowest and then 00:47:12.000 --> 00:47:18.000 you can't really actually interpret that number in terms of the real problem. 00:47:18.000 --> 00:47:40.000 Okay. So before moving on to the next notebook, are there any questions about these metrics? 00:47:40.000 --> 00:47:42.000 Okay, so now we're gonna learn a different algorithm called logistic regression. 00:47:42.000 --> 00:48:00.000 So this is probably familiar to a large number of you, because it's a very commonly it's very commonly taught in statistics, courses so sort of like as a thing that we'll start off by talking about. 00:48:00.000 --> 00:48:05.000 This is technically a form of statistical regression. 00:48:05.000 --> 00:48:11.000 So that's where the regression comes from. It's part of a framework called generalized linear models. 00:48:11.000 --> 00:48:28.000 So it is a regression statistical model. It's not used to solve regression, supervised learning problems because the outcome that you're trying to predict or trying to understand is not a numeric outcome, but a binary by a Binary outcome. 00:48:28.000 --> 00:48:34.000 So it is a regression statistical algorithm. It is used for classification problems and supervised learning. 00:48:34.000 --> 00:48:35.000 So this can cause a lot of friction between statisticians and machine learning. 00:48:35.000 --> 00:48:49.000 People, I'm sure if you've been trying to learn data science, you've seen these sort of massive cheat sheets that I remember like growing up in high school. 00:48:49.000 --> 00:49:02.000 The kids that would like, Oh, I get a cheat sheet and then try and cram like literally everything from the course into one sheet of paper you've probably seen those sorts of things online about data science. 00:49:02.000 --> 00:49:03.000 And then they say, like logistic regression, is a classification algorithm. 00:49:03.000 --> 00:49:13.000 This is, causes a lot of friction with like stats, people. So just be aware, it's technically a statistical regression algorithm. 00:49:13.000 --> 00:49:19.000 But it gets used in classification problems so I don't go hard in either way. 00:49:19.000 --> 00:49:38.000 Just be understanding of what it actually is. So based on the fact that it is a statistical regression problem like we need to be regressing, not to some sort of continuous measure like the thing that we're trying to understand right regression problems need a continuous measure of some kind is the idea 00:49:38.000 --> 00:49:41.000 so like, how do we get that from a binary 0? 00:49:41.000 --> 00:49:46.000 One problem. Well, let's look at an example. So to make it really nice for us, I just have this random data that's going to have a binary problem. 00:49:46.000 --> 00:49:57.000 I'm going to make a train test split, and then we'll look at the data. 00:49:57.000 --> 00:50:05.000 So here the data has a single feature, and then I've plotted the class of the observations on the vertical axis. 00:50:05.000 --> 00:50:06.000 So it looks like things that tend to be closer to 0. 00:50:06.000 --> 00:50:16.000 Are classified as a 0. And but here I'm at 0 on the feature classified as a 0. 00:50:16.000 --> 00:50:22.000 And then, as you start to get larger, more, most of your observations tend to be of class One. 00:50:22.000 --> 00:50:33.000 And so, while this vertical axis says, class as the label, we could have very easily said probability that the observation is a one. 00:50:33.000 --> 00:50:39.000 So why could we say that we know the labels for all of these observations? 00:50:39.000 --> 00:50:49.000 We know that anything down here is a class 0 observation we know that everything up here is a class, one observation. 00:50:49.000 --> 00:50:55.000 So the probability that any one of these points up here is Class One. 00:50:55.000 --> 00:51:08.000 Well, that's equal to one. So the probability up here is one, and then down here the probability that any of these observations is classes is class one is 0, because we know for a fact that they are not class one. 00:51:08.000 --> 00:51:24.000 So that's sort of the idea. Here is we're replacing the 0 one label as like a class, and then thinking of it instead as a probability that it is class one so that's what we're sort of looking at. 00:51:24.000 --> 00:51:29.000 And so that's what we're sort of the continuous measure that we're going to regress on. 00:51:29.000 --> 00:51:39.000 And then we're gonna use in the linear regression setting the functional form that we assumed was a linear combination of terms. 00:51:39.000 --> 00:51:44.000 Here we're going to use for logistic regression a sigmoidal curve which in general, for a single variable, looks like one over one plus E to the negative. 00:51:44.000 --> 00:51:49.000 X. And so here's just what this is. Just plotting that function. 00:51:49.000 --> 00:52:02.000 This is little X is not our data. This is just plotting the function one over one plus E to the negative. 00:52:02.000 --> 00:52:06.000 X, okay? And so you could see, though, how this shape of function sort of fits, how we might wanna model this probability right? 00:52:06.000 --> 00:52:22.000 So you could see how the sigmoid curve kind of fits the shape of our data very nicely, and why we might be inclined to use it. 00:52:22.000 --> 00:52:29.000 So for us the function that we're gonna try and estimate is gonna be little P of X. 00:52:29.000 --> 00:52:35.000 And I think I forgot to say this. But little P. Of X is going to be. 00:52:35.000 --> 00:52:36.000 The probability that y is equal to one given the features that we've observed. 00:52:36.000 --> 00:52:47.000 So it's the probability Y is equal to one conditional on the features that you're observing. 00:52:47.000 --> 00:52:51.000 Okay, so this little P of X, we're gonna assume that it's following a function of the form one over one plus e to the negative. 00:52:51.000 --> 00:53:04.000 X times, beta Beta just like in linear regression, is a column vector of coefficients. 00:53:04.000 --> 00:53:11.000 And then X is a matrix of feature where I've included a column of one set, the front. 00:53:11.000 --> 00:53:29.000 So in general, this model is fit, using this statistical method of maximum likelihood, estimation, so for a dairation of like how that works, you can check out the practice problems, we're not gonna go over it here because we're gonna use sk learns logistic regression 00:53:29.000 --> 00:53:35.000 model and it's a little bit more complicated to fit this than it was for the linear regression stuff, at least to write it out and go over it. 00:53:35.000 --> 00:53:38.000 Okay. So sk, learn has a logist regression as a model object. 00:53:38.000 --> 00:53:42.000 As we might assume. So we're gonna say from Sk, learn dot linear model so that's where it's stored is in linear model. 00:53:42.000 --> 00:53:51.000 So that's where it's stored is in linear model. 00:53:51.000 --> 00:54:05.000 We're going to would just regression. Okay? And now we're gonna go to the package or the model object documentation. 00:54:05.000 --> 00:54:06.000 So I wanna point something out. So if you look here, there's this term, penalty equals L. 00:54:06.000 --> 00:54:17.000 2, if you remember, back to when we learned linear regression last week, or maybe it might have actually just been a Monday this week. 00:54:17.000 --> 00:54:21.000 L. 2 was the Ridge regression? Right? So it was. The Ridge regression. 00:54:21.000 --> 00:54:26.000 Norm. So what this is saying is by default, sk, learn, and pl. 00:54:26.000 --> 00:54:33.000 Ug. It uses the Ridge Regression version of logistic regression. 00:54:33.000 --> 00:54:43.000 And so because I just want to demonstrate regular logistic regression, we're going to set the penalty equal to none. 00:54:43.000 --> 00:54:44.000 And so this is going to go back to regular old school. 00:54:44.000 --> 00:54:51.000 Logistic regression. Just be aware that by default sk! 00:54:51.000 --> 00:54:59.000 Learn employments, regularized logistic regression. 00:54:59.000 --> 00:55:08.000 Okay. So when I define my model object, I'm going to call logistic regression input penalty equals to none. 00:55:08.000 --> 00:55:16.000 And so you might also, if the if you haven't seen none before, this does a special python object? 00:55:16.000 --> 00:55:20.000 It kind of just means. It literally just means nothing like there is no argument. 00:55:20.000 --> 00:55:21.000 It's different from the string of the word. 00:55:21.000 --> 00:55:29.000 None. Okay? So then I'm gonna fit my model and X train. 00:55:29.000 --> 00:55:33.000 And then this is a one dimensional vector or array. 00:55:33.000 --> 00:55:40.000 So I have to use reshape, followed by y train. 00:55:40.000 --> 00:55:45.000 Logistic regression. Object is not callable. 00:55:45.000 --> 00:55:52.000 Let's just copy and paste. 00:55:52.000 --> 00:55:57.000 Oh, I forgot, that's it! That's why got it. 00:55:57.000 --> 00:56:05.000 Here we go. And so, just like with K nearest neighbors, we can call dot predict. 00:56:05.000 --> 00:56:08.000 And we input. 00:56:08.000 --> 00:56:11.000 Our features, so you can see now we've got zeros and ones. 00:56:11.000 --> 00:56:14.000 But remember, I said, we're trying to model of probability. 00:56:14.000 --> 00:56:18.000 And so you know what are these zeros and ones? 00:56:18.000 --> 00:56:24.000 How do I get the probability? Well, just with the predict provo that we talked about in K nearest neighbors? 00:56:24.000 --> 00:56:37.000 So we do law, Greg, and instead of predict, we do predict, underscore, pro bus, or praba. I guess I don't really know how you should say it and then we're gonna input our features. 00:56:37.000 --> 00:56:56.000 And now you can see we have 2 columns. And so the 0 column here is the probability that the observation is of Class 0 and the one column here is the probability that it's of class One. 00:56:56.000 --> 00:57:02.000 And so we can use this predict proba to plot the fitted logistic regression model. 00:57:02.000 --> 00:57:03.000 And so we have our training data as blue circles and our red dotted line is the fit of the model. 00:57:03.000 --> 00:57:18.000 So this represents the probability that y is equal to one given x as fit by the the logistic reviewression model on the training data. 00:57:18.000 --> 00:57:28.000 Okay. 00:57:28.000 --> 00:57:39.000 So, Keira Thon, the reason that you're so so I get this error when I run dot fit value error, colon logistic regression supports only penalties, and L one l. 00:57:39.000 --> 00:57:43.000 2 elastic net and lowercase, none as a string. 00:57:43.000 --> 00:57:48.000 So the reason that you have this is you have an earlier version of Sk. 00:57:48.000 --> 00:57:52.000 Learn. So in the earlier version of Sk. Learn, they use the string. None. 00:57:52.000 --> 00:58:08.000 They later updated that to be the python object known with a capital N, and not as a string. So if you haven't that error, you have to actually use the string, none. 00:58:08.000 --> 00:58:23.000 And so this is just a good place to remind everybody a lot of times if you try and run the code that I run and it doesn't work first check for a typo, and then, if you can't see a typo, it's probably because the version of the package that i'm, using is different 00:58:23.000 --> 00:58:24.000 from the version of the package that you're using. 00:58:24.000 --> 00:58:34.000 So check what version you have, and then see if you can find that version's documentation. 00:58:34.000 --> 00:58:45.000 Yahweh is asking, is it worth understanding? The overlapping region between class One and Class 2 so. 00:58:45.000 --> 00:58:58.000 It's possible that like if this was a real world dataset you might be interested in like, is there something about this region of the feature that is like, why, we see sort of this overlap. 00:58:58.000 --> 00:59:02.000 Because this is like synthetic data. That's it's not real. 00:59:02.000 --> 00:59:07.000 It's just randomly generated. This just happens because I wanted to not have a very easy problem of oh, just put a dividing line at like point 6 or something. 00:59:07.000 --> 00:59:21.000 So that's why I would say that, like probably the amount of overlap determines sort of the steepness. 00:59:21.000 --> 00:59:24.000 Is that the right word, the steepness of the curve, so like? 00:59:24.000 --> 00:59:34.000 If there was very little overlap, there would probably be a much steeper transition, and if there was more overlap it would be kind of less steep. 00:59:34.000 --> 00:59:42.000 So that that's what I would say in like a real world problem you might be interested in, like, you know, understanding. 00:59:42.000 --> 00:59:49.000 Is there something about the observations in the feature that explain like that sort of like an actual phenomenon going on? 00:59:49.000 --> 00:59:53.000 As to why, there's overlap here. 00:59:53.000 --> 01:00:03.000 Brantley's asking. It seems strange that logistic regression is regularized by default and ask Kaylor. And do you know why that is? I don't know why that is, I kind of thought it was strange to. And it's also like this weird thing. 01:00:03.000 --> 01:00:18.000 Like, I taught this notebook this is probably like my 6 or 7 time teaching this notebook, and it wasn't until like my third time that somebody asked like the curve used to look really weird compared to the data because it was regularized. 01:00:18.000 --> 01:00:22.000 And somebody finally asked and I looked at the documentation, and it had. 01:00:22.000 --> 01:00:25.000 It was imposing that regularization. And so that was why. 01:00:25.000 --> 01:00:27.000 So, if I don't think a lot of people just know like it's doesn't seem natural to me that you would impose the regularization from the get. 01:00:27.000 --> 01:00:39.000 Go, so I don't know why. I'm sure that they had some reason why. 01:00:39.000 --> 01:00:52.000 So it's also a good message that you should read your read the documentation of the functions that you're trying to implement because they might not be working in the way that you're assuming they're working in the way that you're assuming they're working. 01:00:52.000 --> 01:01:00.000 Okay, so earlier, we talked about earlier in this notebook, we talked about that dot predict automatically gives you zeros and ones. 01:01:00.000 --> 01:01:03.000 And you're like, well, what the heck's going on here? 01:01:03.000 --> 01:01:19.000 I thought this was supposed to give me probabilities so what's happening is when it goes through all of the observations, and it checks out for which of the 2 columns you have a probability greater than a half, and then whatever column that is it assigns the corresponding class so 01:01:19.000 --> 01:01:23.000 if column one had probability greater than a half, it assigns a one. 01:01:23.000 --> 01:01:31.000 If column 0 had a probability greater than a half, it assigns a 0 and so that's what's going on, which this means. 01:01:31.000 --> 01:01:37.000 This allows you to set different probabilities, cut offs, and so one thing you might end up doing is all of the Sk. 01:01:37.000 --> 01:01:57.000 Learn almost all of the sk learn classification. Algorithms will set this sort of cut off of point 5 being the default, or if it's a multi-class, whichever one just gets, the majority, what you could do then, is play around with if setting different probability, cutoffs like maybe i'm going to be a little bit 01:01:57.000 --> 01:02:16.000 stingier, and say, I only want to get observations that have a probability of like point 7 like you really want to be sure that those observations are a one before you're gonna say that the classification is a one so you can set your own cutoffs and then that will alter 01:02:16.000 --> 01:02:20.000 things like the accuracy or the precision or the recall. 01:02:20.000 --> 01:02:28.000 And this cut off sort of becomes a different type of a different type of thing that you can tune with. 01:02:28.000 --> 01:02:32.000 Cross validation. So, for instance, we could set a different cutoff. 01:02:32.000 --> 01:02:38.000 Maybe we wanna be a little bit stingier about what we call one, and we set it to be like point 6 3. 01:02:38.000 --> 01:02:45.000 And so then we can say, All right, so give me all of the observations where my predicted probability. 01:02:45.000 --> 01:02:50.000 So I'm gonna first store. All my predicted probabilities in an array, so I can keep using them. 01:02:50.000 --> 01:02:51.000 Not shape, dot, refresh negative negative one comma one. 01:02:51.000 --> 01:03:04.000 And so I'm just gonna get the first, the one column, the column that corresponds to class one, and then I'm going to do to get my predictions. 01:03:04.000 --> 01:03:14.000 I'm going to say one times the probability greater than or equal to the cutoff. 01:03:14.000 --> 01:03:32.000 And so what this does is it produces an array of 0 of trues and falses, and then multiplying it by a one, will change the array to zeros and ones, and then I calculate the the accuracy, so if I have a cut off of point 6 3 my training accuracy. 01:03:32.000 --> 01:03:34.000 Is now 92.7 5%. And we could play around with this and just see, like, okay, what if I make it point 4 3? 01:03:34.000 --> 01:03:42.000 And you can see now it's a point 4 3. 01:03:42.000 --> 01:03:47.000 I have a slightly lower accuracy. So let's go back to point 6 3. 01:03:47.000 --> 01:03:54.000 And so what you could do. I'm doing it with the training set, but in practice you would use a validation set or cross validation. 01:03:54.000 --> 01:04:01.000 You can change the cutoffs to be different values, and then see how that impacts whatever metric you're using. 01:04:01.000 --> 01:04:07.000 So here, for instance, is the training, accuracy, again, I'm just using the training because it's easiest in practice. 01:04:07.000 --> 01:04:15.000 You want to use cross validation or a validation set, and you can see how the cutoff impacts the accuracy. 01:04:15.000 --> 01:04:23.000 And if this was a cross-validation or a validation set, you may be want to choose the cutoff that has the highest accuracy. 01:04:23.000 --> 01:04:27.000 If that is the metric you've decided to go with. 01:04:27.000 --> 01:04:40.000 Okay, are there any questions about the probability cut off? Cause? 01:04:40.000 --> 01:04:46.000 Okay, so we're gonna quickly go through how to interpret logistic regression. 01:04:46.000 --> 01:04:50.000 So just like you can interpret linear regression by looking at the coefficients. 01:04:50.000 --> 01:04:56.000 You can also somewhat interpret logistic regression by again looking at the coefficients. 01:04:56.000 --> 01:05:05.000 So if you take the log of both, if you do some rearranging and a little bit of algebra, you can find out that the log of P. 01:05:05.000 --> 01:05:06.000 Of X, divided by one minus p. Of x is equal to x times. 01:05:06.000 --> 01:05:15.000 Beta! The expression of P. Of x, divided by one minus p. 01:05:15.000 --> 01:05:25.000 Of X. This is the odds of the event. Y equals one so the probability of an event divided by the probability of the event not happening is known as the odds and for us the event is Y equals one. 01:05:25.000 --> 01:05:40.000 I guess it would be technically the conditional odds conditional on the features, the statistic model for logistic regression is a linear model of the log odds of being class One. 01:05:40.000 --> 01:05:44.000 And so this allows you to interpret the coefficients of the model. 01:05:44.000 --> 01:06:02.000 So if you look at the model we just fit, we have that the log odds as equal to beta 0 plus beta one x, or rather the odds given X is equal to some constant C times E to the beta one x so x is our feature C is some constant that we're ultimately 01:06:02.000 --> 01:06:09.000 not gonna care about this allows you to interpret. Sort of what does a one unit increase in the feature due to the odds that we're going to be class One. 01:06:09.000 --> 01:06:19.000 And so if you go through this, you can see that for every one unit increase in your feature. 01:06:19.000 --> 01:06:24.000 You're then going to see A an E to the Beta. 01:06:24.000 --> 01:06:28.000 One multiplication of your odds. 01:06:28.000 --> 01:06:32.000 Okay. And so we're gonna go through and show you how to do this. 01:06:32.000 --> 01:06:36.000 And I also just as a another, thing of this penalty. 01:06:36.000 --> 01:06:37.000 This only works if your penalty is equal to none. 01:06:37.000 --> 01:06:46.000 So you have to be doing logistic regression, not regularized legislation. 01:06:46.000 --> 01:06:52.000 So here's our coefficient. So you access it again with.co F. 01:06:52.000 --> 01:06:59.000 So our coefficient is 23.1 2, and so then we can interpret the coefficient. 01:06:59.000 --> 01:07:09.000 Just like I said so instead of a one unit increase to make it more, you know, like one unit is the whole span of of our feature. 01:07:09.000 --> 01:07:11.000 So we're gonna go off a point, one unit increase. 01:07:11.000 --> 01:07:16.000 So for a point, one unit increase in our feature. 01:07:16.000 --> 01:07:22.000 Our odds are multiplied by a factor of 10.1. 01:07:22.000 --> 01:07:26.000 So finally, we have some assumptions for the algorithm. 01:07:26.000 --> 01:07:32.000 So we didn't mention any of these, because the data was generated to be to follow these assumptions. 01:07:32.000 --> 01:07:36.000 The first assumption is that your samples need to be independent. 01:07:36.000 --> 01:07:39.000 If you use multiple predictors, you don't want them to be correlated. 01:07:39.000 --> 01:07:50.000 Similarly with linear regression, you have the. Your assumption is that your log odds are a linear function of of your your data. 01:07:50.000 --> 01:07:53.000 And then finally, you typically want to have like a larger data set. 01:07:53.000 --> 01:08:14.000 So if you have a really small data set. And again, this is dependent upon, like the number of features you're including in your model. That sort of thing. If you have a really small data set, logistic regression won't be the best model for your data. 01:08:14.000 --> 01:08:27.000 So Yahweh is asking is odds in some sense of odds we don't want it to happen so odds like the way you've I don't know what the proliferation of sports gambling you've probably heard odds and some sort of 01:08:27.000 --> 01:08:34.000 advertisement recently, so odds is essentially like the way you probably assume odds like work in the real world. 01:08:34.000 --> 01:08:35.000 So like the odds are the probability that something will happen. 01:08:35.000 --> 01:08:50.000 Divided by the probability that it will not happen. So it's supposed to give you some sort of sense of how likely something is to happen compared to not happening so like this expression right here. 01:08:50.000 --> 01:08:56.000 The P. Divided by one minus p. That those are odds. 01:08:56.000 --> 01:09:13.000 So like. If something has 2 to one odds, it's 2 times more likely to happen than not happen. 01:09:13.000 --> 01:09:15.000 Okay. 01:09:15.000 --> 01:09:20.000 So what'll see how far away get through this? I expect that we'll be able to finish the diagnostic curves. 01:09:20.000 --> 01:09:30.000 I don't know that we'll be able to start notebook number 6 today, but I'm pretty happy with our progress, with the amount of lectures we have left. 01:09:30.000 --> 01:09:34.000 So we're gonna use this data set that we just use for logistic regression. 01:09:34.000 --> 01:09:40.000 So I'm just going through a refitting everything and reminding ourselves of what we literally just looked at. 01:09:40.000 --> 01:09:44.000 And let's double check. Okay? Good penalty was none. 01:09:44.000 --> 01:09:57.000 So just like we had a different metrics, those metrics can also allow us to find different curves to allow us to diagnose our model and the compare it to other models. 01:09:57.000 --> 01:10:06.000 So remember our confusion. Matrix, the nice thing. And like, we literally just talked about this, which is really nice about being able to do this notebook today. 01:10:06.000 --> 01:10:11.000 We talked about how different probability cutoffs lead to different predictions. 01:10:11.000 --> 01:10:12.000 Right. And so basically literally, every probability cutoff we could choose. 01:10:12.000 --> 01:10:28.000 Is going to give us a different confusion. Matrix, right? So, for instance, again, this is all in the training data, just for the simplicity of a. 01:10:28.000 --> 01:10:42.000 Of the lecture. So, if we chose a probability cutoff of point 4, our confusion matrix would look like this versus a probability cutoff of 0 point 6 gives us a confusion matrix that looks like this. 01:10:42.000 --> 01:10:56.000 And so basically, this means there's a wide range of possible uhcisions possible recalls possible specificities possible sensitivities that we could get just by choosing different probability cutoffs. 01:10:56.000 --> 01:11:02.000 So this allows us to develop a series of curves that we can use to sort of look at differences and compare models. 01:11:02.000 --> 01:11:07.000 And sort of give us a feel for okay, if we chose this cutoff versus this cutoff, what precision and recall are possible? 01:11:07.000 --> 01:11:17.000 So the first type of curve that you may have heard of before is called the precision. 01:11:17.000 --> 01:11:21.000 Recall curve. So the precision, recall, curve. 01:11:21.000 --> 01:11:28.000 How is going to plot the precision on the vertical axis and the recall, on the horizontal axis? 01:11:28.000 --> 01:11:33.000 And so you could if you're doing this on your own, you could try and do as an exercise. 01:11:33.000 --> 01:11:42.000 I'm just gonna do it for you. So we're gonna import our precision score and our recall score. 01:11:42.000 --> 01:11:52.000 And then what I'm gonna do is go through an array of cutoffs, and then I get the precision and the recall for each of those cut offs. 01:11:52.000 --> 01:11:53.000 So I've got logistic, or did I store this? 01:11:53.000 --> 01:12:05.000 I did not so let's go ahead. I'm gonna copy this and add an extra line here. 01:12:05.000 --> 01:12:14.000 So my probability, my pro, my probabilities. I'm going to store in this vector and then from my cut offs, I'm going to do one times. 01:12:14.000 --> 01:12:24.000 Y prob greater than or equal to the cutoff, so each time through the loop I'm looping through this array of possible probability cutoffs. 01:12:24.000 --> 01:12:30.000 Then I'm finding what the predictions would be if I use that cutoff and now I'm going to track the resulting precision and recall from that. 01:12:30.000 --> 01:12:40.000 So precision scores dot append precision score, and then I need my Y. 01:12:40.000 --> 01:12:50.000 So again, this is on the training data. Just for the simplicity of the lecture, and then predict. 01:12:50.000 --> 01:13:03.000 And then my recall, my rec underscore scores dot append, recall, scores Y train with the predicted values. 01:13:03.000 --> 01:13:08.000 Not scores. Just score. 01:13:08.000 --> 01:13:12.000 Okay. And so here you can see what a precision recall. 01:13:12.000 --> 01:13:19.000 Curve. Looks like it's supposed to give you a sense of what precision and recall combinations are available. 01:13:19.000 --> 01:13:32.000 And so there is sort of a trade off with precision, and recall typically the higher your precision, I if you try to raise your precision in general that will lead to a lowering of your recall and vice versa. 01:13:32.000 --> 01:13:38.000 So it's not always possible to get perfect, persistent, and a perfect recall by raising one you tend to lower the other, and vice versa. 01:13:38.000 --> 01:13:42.000 And so the the perfect classifier would have a precision. 01:13:42.000 --> 01:13:50.000 Recall, curve that hugs the upper right hand corner of the plot. 01:13:50.000 --> 01:13:56.000 So perfect, classifier would be one where you can get 100% precision and 100% recall. 01:13:56.000 --> 01:14:02.000 So anytime you make a prediction that it's positive it is actually positive. 01:14:02.000 --> 01:14:04.000 And you're able to capture all of the actual positives. 01:14:04.000 --> 01:14:10.000 That's the idea here. Okay? And so basically, the idea is, you're gonna want to like, have some sort of trade off in mind of like, okay, I would rather have a higher precision. 01:14:10.000 --> 01:14:23.000 But there may be some sort of business or implication of choosing, like one value versus the other. 01:14:23.000 --> 01:14:38.000 It really depends on the problem, like from the project you're working on, you may be able to then, like, sort of say, Okay, if I have a recall or precision of this, it has this implication on the real-world problem, I'm solving. 01:14:38.000 --> 01:14:48.000 And so you might have some sort of limited of okay, I don't want to go below this recall, or I don't want to go below this precision because of X Y and Z. 01:14:48.000 --> 01:15:08.000 In the and the business problem you're looking at. So these sorts of curves allow you to see, like what are the possible precision and recall scores for any given classifier you're looking at, and then you could plot multiple curves for multiple different models to see if one model could give you a 01:15:08.000 --> 01:15:12.000 better recall for an equivalent precision or something. 01:15:12.000 --> 01:15:27.000 Okay, so are there any questions about the precision? Recall, curve? 01:15:27.000 --> 01:15:33.000 Right. Another curve that tends to get looked at, and maybe you've heard of before. 01:15:33.000 --> 01:15:45.000 If you tried to learn classification before the boot camp is called the receiver operating Characteristic, or Roc curve, so these curves are arose in World War Ii. 01:15:45.000 --> 01:15:49.000 As a way to aid operators of radar receivers for detecting enemy objects in battlefield. 01:15:49.000 --> 01:15:52.000 So I think that's where the Roc comes from. 01:15:52.000 --> 01:15:54.000 I've also been told before by some friends of mine that Roc actually didn't have a meeting for a long time. 01:15:54.000 --> 01:16:00.000 I'm not sure which story is true or not. 01:16:00.000 --> 01:16:05.000 So what this does is, it plots the true possibility rates against the false positive rates for various cutoff values. 01:16:05.000 --> 01:16:13.000 So here's just a reminder of what these are. 01:16:13.000 --> 01:16:18.000 So it's the estimate that you're so true. 01:16:18.000 --> 01:16:21.000 Positive rate is Tp over Tp plus Fn. 01:16:21.000 --> 01:16:30.000 So it's the probability that you're producing a one given that you're actually a one. 01:16:30.000 --> 01:16:37.000 And then the false positive rate estimates the probability that you're predicting a one given that it's actually a 0. 01:16:37.000 --> 01:16:41.000 So let's see what I wrote before, and I think this makes sense. 01:16:41.000 --> 01:16:47.000 Okay, so one way to think of these metrics is sort of like, imagine you're in an oncology field. 01:16:47.000 --> 01:16:59.000 So, sometimes if you're somebody who has a tumor, a collection of cancer is cells potentially cancer cells, people will have surgery to remove that tumor. And so the goal of this surgery right? 01:16:59.000 --> 01:17:10.000 Is you want to maximize the number or the proportion or amounts of cancer cells that are removed and minimize the amount of normal cells that are removed. 01:17:10.000 --> 01:17:15.000 And so we can think of the removal of cancerous cells as a true positive. 01:17:15.000 --> 01:17:16.000 And then the removal of normal cells as a false, positive. 01:17:16.000 --> 01:17:40.000 And so your goal with your classifier, or in this surgery, right is to remove the total amount of cancer cells, predict as many actual ones as you can, while limiting the number of non like accidentally classified ones, or normals in this oncology example and so this is the idea is we typically want 01:17:40.000 --> 01:17:43.000 to maximize our Tpr. While minimizing our Fpr. 01:17:43.000 --> 01:17:44.000 Once again it turns out that it's not always possible to increase one without decreasing the other. 01:17:44.000 --> 01:17:56.000 So there's 2 ways to do this. You can just do a for loop like we did before. 01:17:56.000 --> 01:18:03.000 So you would calculate the confusion. And I'm gonna get rid of that. 01:18:03.000 --> 01:18:14.000 And put in my probability from before you put in the confusion matrix with the actual, with the predicted. 01:18:14.000 --> 01:18:17.000 And then you'll just for some simplicity of the formulas. 01:18:17.000 --> 01:18:21.000 I'm doing, extracting the different values here. 01:18:21.000 --> 01:18:25.000 So confusion. Matrix at 0, comma, one confusion matrix at one comma 0. 01:18:25.000 --> 01:18:26.000 And then the true positives are the confusion, matrix. 01:18:26.000 --> 01:18:34.000 At one comma one, so remember, Tprs are Tp. 01:18:34.000 --> 01:18:41.000 Divided by fn, plus tp, and then false positive rate is the Fp. 01:18:41.000 --> 01:18:46.000 Divided by Tn. Plus Fp. 01:18:46.000 --> 01:18:53.000 Okay. And so then you plot the true positive rate against the false positive rate. 01:18:53.000 --> 01:19:00.000 And so typically what you'll see along with your curve is this dotted line, or it doesn't necessarily have to be dotted. 01:19:00.000 --> 01:19:12.000 But you'll see the line y equals x. And so this often is a reference line that gets plotted on these roc curves. 01:19:12.000 --> 01:19:22.000 So these Roc curves, this, this line is supposed to represent what you would get if your algorithm just does random guessing. 01:19:22.000 --> 01:19:29.000 And so ideally, what you want to be is above, and the upper left hand diagonal over to the upper left hand triangle. 01:19:29.000 --> 01:19:41.000 So you wanna be above that line, because otherwise your algorithm is no better than random guessing. 01:19:41.000 --> 01:19:54.000 The best algorithm you could get would be one that has a point at the 0 comma one mark, because then it is possible for your algorithm to have. 01:19:54.000 --> 01:19:55.000 Huh? A false, positive rate of 0, and a true positive rate of one. 01:19:55.000 --> 01:20:05.000 So once again, you can use this curve to choose cutoffs that you know, provide, and like, show you the trade off for your algorithm, for true positive rate. 01:20:05.000 --> 01:20:29.000 And false positivity you could use it to compare what's possible for different algorithms, and maybe see if one algorithm allows you to get a true positive rate with the false positive rate that you're willing to accept based on the project you're working on and it also occasionally 01:20:29.000 --> 01:20:35.000 people will use the area under these curves as a matrix metric on its own. 01:20:35.000 --> 01:20:48.000 I would encourage not to do that, as your metric, mainly because that metric doesn't indicate whether your area is high because you're able to get like it doesn't. 01:20:48.000 --> 01:20:52.000 Tell you why your area is high, like it doesn't tell you if it's because you're able to get better true positives. 01:20:52.000 --> 01:21:00.000 You know what I mean like you're not able to get a good sense of like what it's actually like telling you you're obfuscating some of the information that the curve gives you. 01:21:00.000 --> 01:21:12.000 By just looking at the area. 01:21:12.000 --> 01:21:19.000 Yeah, so Lara just asked, would you compare the curves visually to determine which model is best, or is there a quantitative measure? 01:21:19.000 --> 01:21:25.000 So typically what people have done in the past is, they'll look at the area under the curve, and then they'll say, Oh, this one has a higher area. 01:21:25.000 --> 01:21:37.000 Therefore it must be better. One reason I don't like that is because by just having a single measure, you're sort of obsolete like, why is this area better? 01:21:37.000 --> 01:21:38.000 Is it because it tends to have better true, positives, or does it tend to have better false positives? 01:21:38.000 --> 01:21:48.000 So, if I had a choice, I would just look at the curves and then select, based off the curves. 01:21:48.000 --> 01:21:58.000 And not necessarily. I'm not a big fan of metrics that try and take a bunch of different measrics and then combine them together for, like sort of like a super, or I don't know. 01:21:58.000 --> 01:22:13.000 Met met metric for comparison purposes. I think it's better to get like, understand? Like the real world implications where you couldn't take the area under the curve and then translate to what that means in terms of the real world problem. 01:22:13.000 --> 01:22:14.000 Okay, so that was, how to do it with a for loop. 01:22:14.000 --> 01:22:25.000 This is such a popular thing that Sk. Laren has a function, for it's called Roc Curve. 01:22:25.000 --> 01:22:32.000 And so, Roc Curve takes in the true values followed by the predicted probability. 01:22:32.000 --> 01:22:39.000 So y prob! 01:22:39.000 --> 01:22:48.000 And you'll see it returns 3 things. So the first thing it returns is an array of the false positive rates. 01:22:48.000 --> 01:22:55.000 The second thing it returns is an array of the true positive rates. 01:22:55.000 --> 01:23:02.000 And then the last thing it returns is a list of the cutoffs for that probability the probability cutoffs. 01:23:02.000 --> 01:23:05.000 Now there's a slight difference in the cutoffs. 01:23:05.000 --> 01:23:06.000 You'll notice that the 0 entry is greater than one. 01:23:06.000 --> 01:23:21.000 The 0 entry and the cutoffs is a special thing that I would have to check the documentation to remind myself, like what it does so like if you're interested, you can click on it here and like read through the returns. 01:23:21.000 --> 01:23:29.000 Okay, so there it is. I think the 0 entry is just the maximum possible score plus one. 01:23:29.000 --> 01:23:33.000 So the maximum possible probability that's predicted, plus one. 01:23:33.000 --> 01:23:39.000 So that's the 0 entry of the cutoffs. 01:23:39.000 --> 01:23:48.000 Okay? And so then you can plot it just like you did above, okay? 01:23:48.000 --> 01:24:07.000 And then one reason why it looks different is sk learn automatically includes 0 comma 0 as an entry, whereas like when we did it as a for loop that didn't show up. And so that's why this one goes all the way down. To 0 comma 0 and ours did not. 01:24:07.000 --> 01:24:12.000 The very last chart type, and I think we have just enough time to explain it. 01:24:12.000 --> 01:24:16.000 It's called our gains and lift charts, so this is sort of like it's it's like a weird sounding chart. 01:24:16.000 --> 01:24:28.000 But just bear with me while I explain it. It's used a lot, and I believe in marketing and advertising. 01:24:28.000 --> 01:24:37.000 So the basic idea is, you will take your observations, arrange them in terms of descending probabilities. 01:24:37.000 --> 01:24:48.000 So look at the probability that it's class one, and then arrange your observations in order of that and then what you'll do is plot the true positive rate of your algorithm. 01:24:48.000 --> 01:24:54.000 If you are to only classify the V-th uppers percentile of predicted probabilities as a one. 01:24:54.000 --> 01:24:58.000 So basically let's just assume for a sake of argument, you had a hundred observations. 01:24:58.000 --> 01:25:03.000 You're predicting then what you would do is you would choose the top the 20 with the. 01:25:03.000 --> 01:25:11.000 If you wanted to look at the twentieth upper percentile, you would take the top 20 observations percentile. You would take the top 20 observations. 01:25:11.000 --> 01:25:12.000 The top 20 observations, with highest probability of being one, classify those as one, then you would calculate your true positive rate. 01:25:12.000 --> 01:25:19.000 You do this for every possible percentile, and then you plot the curves that that go along. 01:25:19.000 --> 01:25:29.000 And so sort of the idea of like, why would you ever do this is a lot of times in advertising and marketing. 01:25:29.000 --> 01:25:35.000 You have a limited amount of funds and so you can't advertise to everybody because you just don't have the money to do that. 01:25:35.000 --> 01:25:51.000 And so what you're gonna do is you're gonna allocate your advertising budget to market to V percent of your potential customers and so you want to do this in a way that you maximize the number of people who would see your ad and then become a customer or do it whatever process you're 01:25:51.000 --> 01:26:09.000 modeling, so if you take Class One to be someone who will become a customer after seeing an ad and class 0 to be someone who is not gonna be a customer by only marketing to those people that fall in the top these percent of predicted probabilities, you're only marketing to the people that you think 01:26:09.000 --> 01:26:20.000 are most likely to become a customer, and so the gains chart allows you to see this true positive rate as a function of the percent of observations you've classified. 01:26:20.000 --> 01:26:24.000 So as a function of V, and so similar to the Roc curve, you typically plot a baseline, which is the line Y equals x, which would just be randomly guessing. 01:26:24.000 --> 01:26:50.000 V percent, the lift chart is then basically sort of taking the gains chart and then dividing the line you get from doing this weird process by the ran guessing line to give you a sense of the lift that the algorithms giving you over just randomly advertising the people and so in order to use this you can do 01:26:50.000 --> 01:26:54.000 either. Pandas, quantile function or num pies. Quantile function. 01:26:54.000 --> 01:27:07.000 These allow you to take in a an array of probabilities, and then get the quantile that represents like this is the V percent upper v percentile, so that may sounded weird. 01:27:07.000 --> 01:27:17.000 So here, I'm just making an array of my predicted probabilities. 01:27:17.000 --> 01:27:24.000 So, okay, so like the actual compared with the probability that I got from my algorithm. 01:27:24.000 --> 01:27:30.000 And then what I'm gonna do is do a loop where I start at one, and I work my way down to 0. 01:27:30.000 --> 01:27:40.000 And then I calculate the quantile for for that entry, so like the first entry, will be one, and then point 9 9, then point 9 8, and then point 9 7, and so forth. 01:27:40.000 --> 01:27:51.000 Okay, so here, we can see like, these are the first 5 upper probability quantiles. 01:27:51.000 --> 01:28:13.000 Now that I have that I can write a loop where I go through and calculate my predictions, using those probabilities, my cutoffs, and then calculate my true positive rates as I go through, and then here I'm just making my lists for the lift part plot okay. 01:28:13.000 --> 01:28:18.000 so here's what shows up. This is my for this particular curve. 01:28:18.000 --> 01:28:32.000 My gains, plot, looks like this, so it gives you a sense of like ideally, the perfect algorithm would be a line that goes straight up as like a solid, like a regular line and then I can't. 01:28:32.000 --> 01:28:36.000 It's like hits. The number of actual ones, I think, is where it would hit. 01:28:36.000 --> 01:28:37.000 That would be a perfect algorithm. And then the lift would, you know, correspond to that? 01:28:37.000 --> 01:28:56.000 So this gives you a sense of like, how well your algorithm does versus random guessing in terms of, you know, following this sort of weird procedure. 01:28:56.000 --> 01:28:57.000 So I know a lot of times I know we're over. 01:28:57.000 --> 01:28:59.000 I know a lot of times like people like to know, you know. 01:28:59.000 --> 01:29:08.000 Oh, what's like! People like to have like a one size fits all approach that you know. 01:29:08.000 --> 01:29:16.000 You see, this type of problem, and you apply this approach, choosing the metrics or the diagnostic curves, you all you're gonna use for like choosing your classification algorithm is not always like a plug and chug type. 01:29:16.000 --> 01:29:39.000 Approach. You kinda have to put some thought into it into like, what are the implications for your real world problem like, what are like if this makes is low, what does it mean in terms of like the things you're trying to classify so this is a situation where it's helpful to put in some thought 01:29:39.000 --> 01:29:46.000 using the actual real-world context of your problem and then transate what those metrics would mean in a business setting. 01:29:46.000 --> 01:29:59.000 Maybe there are actual costs like financial costs to doing something incorrect in a certain way, and like public health settings, there are costs in terms of people's quality of life or lifespan. 01:29:59.000 --> 01:30:07.000 So it's important to make. Take careful thought into choosing diagnostic curves and performance metrics. 01:30:07.000 --> 01:30:09.000 Okay. So for the sake of time, I'm gonna just go ahead and stop the recording, and I'll hang back for like 5 to 10 min to answer questions. 01:30:09.000 --> 01:30:22.000 I hope you enjoyed today's lecture, and I hope you have a great weekend.