WEBVTT

00:00:00.000 --> 00:00:02.000
Okay, so I'm gonna start recording. Now.

00:00:02.000 --> 00:00:07.000
Hi, everybody! Welcome back! This is lecture number 8 of the May 2023.

00:00:07.000 --> 00:00:11.000
Data science boot camp. So today, we're gonna start with our clification.

00:00:11.000 --> 00:00:12.000
I guess we started with it in yesterday's lecture, but we're going to follow up with that today.

00:00:12.000 --> 00:00:25.000
So let me go ahead and get my chat situated.

00:00:25.000 --> 00:00:37.000
So yesterday we talked about the adjustments you have to make to data splits for classification, namely, you just have to do a stratified split stratifying on the outcome that you're trying to predict the why?

00:00:37.000 --> 00:00:43.000
So today, we're going to start diving into algorithms and then performance metrics for classification problems.

00:00:43.000 --> 00:00:45.000
We're going to start with notebook number 2 and try and work our way to notebook number 6.

00:00:45.000 --> 00:00:58.000
I don't know how far we'll get. But I it's conceivable that we would be able to at least start notebook number 6 today.

00:00:58.000 --> 00:01:04.000
So the first algorithm that we're going to learn for classification is called K, nearest neighbors.

00:01:04.000 --> 00:01:05.000
So we're going to introduce what this is sort of give you an idea of how it looks.

00:01:05.000 --> 00:01:10.000
We'll talk about our very first classification performance metrics.

00:01:10.000 --> 00:01:25.000
We'll learn more soon. And then we're going to learn about the Iris data set, which is a very popular data set used for classification algorithms, both in teaching and sort of a serving as a benchmark.

00:01:25.000 --> 00:01:26.000
So let's dive right into what the algorithm is.

00:01:26.000 --> 00:01:27.000
So for K. Nearest neighbors. That's what the K.

00:01:27.000 --> 00:01:37.000
And N. Stands for the way that you make predictions from the training set is relatively straightforward.

00:01:37.000 --> 00:01:43.000
So you're first going to select a number K, so this is another hyper parameter that you can choose.

00:01:43.000 --> 00:01:44.000
You'll input a point that you would like to predict on X Star.

00:01:44.000 --> 00:02:07.000
So I don't sure if I said it. But here we're again in the situation where we have a matrix of Features X and some outputs that we'd like to predict why, here are the outputs are going to be classes and the specific examples are going to look at below in the pictures.

00:02:07.000 --> 00:02:08.000
It's Binary classification. And the Irs data set.

00:02:08.000 --> 00:02:14.000
It's multi-class classification. The features are just a matrix can have categorical or continuous.

00:02:14.000 --> 00:02:23.000
And then for us here we say, X. Star. It is a particular observation.

00:02:23.000 --> 00:02:24.000
So imagine that we have a new observation, that we'd like to predict on.

00:02:24.000 --> 00:02:29.000
So you put your X star, yeah.

00:02:29.000 --> 00:02:37.000
Matthew. Sorry to call you. Could you mind zooming it a little bit if you find if you yeah, thank you.

00:02:37.000 --> 00:02:39.000
Okay.

00:02:39.000 --> 00:02:52.000
So you input the point you're trying to predict and then what the algorithm is going to do is it's going to try and find the K closest points to what you've input within the training set.

00:02:52.000 --> 00:02:55.000
So inside the training set, it's going to calculate the distance from this input to all the points in the training set.

00:02:55.000 --> 00:03:02.000
Then it's going to try and find the K.

00:03:02.000 --> 00:03:03.000
That are closest remember, K. Is a hyper parameter, that you choose ahead of time.

00:03:03.000 --> 00:03:12.000
So it just does this by calculating the distances which can take a long time.

00:03:12.000 --> 00:03:18.000
If you have a very large training set. So the categories of each of the nearest neighbors are tabulated, meaning they're just like counted up.

00:03:18.000 --> 00:03:21.000
So you want to see how many of class 0, how many of class one, how many of class 2, etc.

00:03:21.000 --> 00:03:29.000
And then the category that receives the most votes.

00:03:29.000 --> 00:03:36.000
So that's what's being tabulated. We can think of the number of neighbors of each class as votes for that class.

00:03:36.000 --> 00:03:40.000
The category with the most votes is what is predicted for X star.

00:03:40.000 --> 00:03:41.000
Anytime. There's a tie between 2 or more categories.

00:03:41.000 --> 00:03:48.000
The prediction is chosen randomly from the tide classes.

00:03:48.000 --> 00:03:55.000
So if you had a type between zeros and ones essentially you're just flipping a coin to decide whether it's 0 or one.

00:03:55.000 --> 00:04:00.000
So this is a lot of words. I think it's easier to kind of understand what's going on with pictures.

00:04:00.000 --> 00:04:01.000
So imagine we're in a setting where we have 2 features that are both continuous variables.

00:04:01.000 --> 00:04:12.000
And we are setting K equals to 4. The black X is going to represent where we're trying to predict the red circles are of one class and the green triangles are another class.

00:04:12.000 --> 00:04:18.000
So if you're black. X was here, it's 4 close to neighbors.

00:04:18.000 --> 00:04:24.000
Are these red circles, the way that I drew this was with the Google Slide so it's not like exact, like straight line from the center of X to the center of the red Circle.

00:04:24.000 --> 00:04:32.000
So just imagine it. Is for the purposes of understanding what's going on.

00:04:32.000 --> 00:04:38.000
So these 4 points are the closest, because all 4 of the neighbors are red circles.

00:04:38.000 --> 00:04:44.000
The algorithm would then predict that the X would also have to be a red circle that's what the algorithm would guess.

00:04:44.000 --> 00:04:55.000
And then again, remember, these are the training points. So let's now I guess that are let's say that are the thing we're trying to predict is placed here in the data space.

00:04:55.000 --> 00:05:08.000
Now it would count up. So I have 3 of my 4 closest neighbors are green triangles, one of my 4 closest neighbors is a red circle, so 3 out of 4 has a majority.

00:05:08.000 --> 00:05:12.000
So the green triangle would be my prediction, and then the final example case.

00:05:12.000 --> 00:05:13.000
We'll look at is I have this situation. Where of mine?

00:05:13.000 --> 00:05:23.000
4 neighbors. I'm evenly split between red circles and green triangles. So this is a tie.

00:05:23.000 --> 00:05:30.000
And so in this situation, my algorithm would just randomly choose between a red circle and a green triangle.

00:05:30.000 --> 00:05:31.000
There's no reason with unweighted voting.

00:05:31.000 --> 00:05:39.000
Everybody gets an equal vote. There's no reason to prefer the Red Circle or the green triangle.

00:05:39.000 --> 00:05:46.000
It will just randomly choose. So in this example we seemingly use what's known as Euclidean distance.

00:05:46.000 --> 00:05:52.000
So if that sounds weird, think of the distance formula you learned in high school or junior high.

00:05:52.000 --> 00:05:59.000
I forget when you first learn it, where it's basically just the square root of the sum of the squares, of the differences.

00:05:59.000 --> 00:06:03.000
So just your typical distance metric, but you could use any distance.

00:06:03.000 --> 00:06:04.000
Metric you like. You'll get slightly different results each time.

00:06:04.000 --> 00:06:12.000
That's another thing you could choose about the algorithm and see if it gives you a better performance on your data set.

00:06:12.000 --> 00:06:17.000
I also mentioned that in these examples we're using equally weighted votes.

00:06:17.000 --> 00:06:26.000
So the way you think of casting a vote in not every American State anymore, but a lot of American States for everybody's vote counts the same.

00:06:26.000 --> 00:06:38.000
You could also wait the votes so very standard way of waiting the votes is sort of an inverse of the distance, so points that are closer to where you're trying to predict have a have a bigger weight in the vote.

00:06:38.000 --> 00:06:41.000
So. For instance, here I think alright. It's kind of hard to tell from this.

00:06:41.000 --> 00:07:00.000
They do kind of look similar, but if, like, for instance, the 2 green triangles were closer, than the 2 red circles, it's possible that it wouldn't be a tie with the weighted vote, okay, so before we show you how to do this and sk where and I'm gonna pause.

00:07:00.000 --> 00:07:05.000
to see questions. And so, okay, so Pedro's question was he asked.

00:07:05.000 --> 00:07:15.000
Only numbers are counted, not distances. And then Pedr said that I answered that earlier, so just wanted to make it clear for those of you watching later that don't have access to the chat.

00:07:15.000 --> 00:07:35.000
Okay, are there any other questions about the theory like the setup of the algorithm?

00:07:35.000 --> 00:07:36.000
Okay. And then maybe I'll make one more note. So remember, we talked about a lot.

00:07:36.000 --> 00:07:45.000
The supervised learning, free work, where you assume Y is equal to F of x plus epsilon.

00:07:45.000 --> 00:07:52.000
So that framework is still working in the background. Here the difference with this approach, then, like other models, we've learned, so far is we don't have an explicit functional form that we're trying to estimate here.

00:07:52.000 --> 00:08:04.000
We're sort of taking what's what's known as a non-parametric approach, where we're not gonna have like here, we're just not gonna have a function.

00:08:04.000 --> 00:08:07.000
We're trying to estimate, but they're still isn't in the background.

00:08:07.000 --> 00:08:15.000
This assumption that why is some function of X plus error?

00:08:15.000 --> 00:08:20.000
Okay. So to see this in action with Sk, learn, we are going to use the Irs data set which we actually talked about during our data collection lecture.

00:08:20.000 --> 00:08:30.000
But if you want to look at it, this is it on the Uc Irvine machine Learning Repository.

00:08:30.000 --> 00:08:34.000
So this is in Iris it's a type of flower.

00:08:34.000 --> 00:08:47.000
In this data set, there are 3 types of irises. There is a a versa color and a Virginica, and then each observation has these 4 measures.

00:08:47.000 --> 00:08:58.000
Sequel length, cpil, width, pedal, length, and peal width, and so we're going to use these to try and predict the class of the Irs.

00:08:58.000 --> 00:09:06.000
So let's get back. So this is, we don't have to go to the machine learning Archive and download it, or anything.

00:09:06.000 --> 00:09:15.000
The data set is inside. Sk, learn. And so you're gonna say, from Sk, learn dat sets import load Iris.

00:09:15.000 --> 00:09:22.000
Then we're going to run. Load Iris, and so this will just load it and we can take a look at what this looks like.

00:09:22.000 --> 00:09:32.000
After. So, for instance, Iris here is, I believe, an array or a dictionary, so here we have the data which serves as the features, and then, after that, we have the target, which is 0 one and 2.

00:09:32.000 --> 00:09:39.000
So the Zeros are the the ones are the versat colors, and the two's are the Virginica's, and then we have additional information about the data set.

00:09:39.000 --> 00:09:55.000
So we have additional information about the data set. So if we want to look at it, I turn this into a data frame just to make it easier to look at.

00:09:55.000 --> 00:09:58.000
So here are the first 5 rows, and I guess here's like a sample.

00:09:58.000 --> 00:10:06.000
Maybe let's stick with the sample. So here's a random sample. So we've got sequo lengths. People with pedal length pedal width.

00:10:06.000 --> 00:10:07.000
And then Iris class, which is an integer.

00:10:07.000 --> 00:10:15.000
So we can always go back and remind ourselves I think it's Satursa versus color.

00:10:15.000 --> 00:10:20.000
And then Virginia, okay, so we're gonna make our train test split.

00:10:20.000 --> 00:10:23.000
So just to get some practice. With doing stratified train test splits.

00:10:23.000 --> 00:10:26.000
So we still run train, test, split in exactly the same way as before.

00:10:26.000 --> 00:10:34.000
But now we add this extra argument of stratify where the thing I'm stradifying on is the class of my Irs.

00:10:34.000 --> 00:10:43.000
Okay, so this will make sure that the training set and the test set have relatively equal splits between zeros ones and twos.

00:10:43.000 --> 00:10:47.000
Okay. And then here are the first 5 observations of my training set.

00:10:47.000 --> 00:10:51.000
So to get a sense of what this looks like. I've gone and made a plot.

00:10:51.000 --> 00:10:55.000
So we're here we're plotting sequel with against sequel length.

00:10:55.000 --> 00:11:01.000
And so we've got our blue circles, which are the Zeros, or orange or yellow triangles, which are the ones and then are green X's, which are the twos. Okay?

00:11:01.000 --> 00:11:11.000
So this is what we, you know, a subset of the data space so Cpa length against sequel width.

00:11:11.000 --> 00:11:14.000
This is what we're gonna be looking at to try and use to train our K.

00:11:14.000 --> 00:11:18.000
Nearest neighbors, algorithm.

00:11:18.000 --> 00:11:30.000
So, are there any questions about the data?

00:11:30.000 --> 00:11:35.000
Alright, so in s K. Lern, you can build a K nearest neighbors, classifier model with K.

00:11:35.000 --> 00:11:38.000
Neighbors Classifier, and the documentation can be found here.

00:11:38.000 --> 00:11:46.000
You might be wondering. Why is it, K. Neighbors, Classifier?

00:11:46.000 --> 00:11:51.000
So just like you have the classifier. You can do sort of the same pricing and make regression models where you'll take the average value of the observations that are your K closest neighbors.

00:11:51.000 --> 00:11:59.000
So that's sort of the idea there. So you have a classifier version, and then you also have a regression version.

00:11:59.000 --> 00:12:13.000
So all I'm pretty sure, almost let's just say almost all of our classification algorithms have a regression counterpoint.

00:12:13.000 --> 00:12:22.000
So after we're done with the classification stuff, there is a notebook in the regression folder that goes through all the similarities and shows you how to do them with regression.

00:12:22.000 --> 00:12:28.000
So feel free to check that out if you'd like to see other regression models beside linear regression.

00:12:28.000 --> 00:12:35.000
Okay. So we're gonna do from sk, learn, this is stored in the neighbors.

00:12:35.000 --> 00:12:36.000
Module of sk learn, so you can tell that by looking at the link S.

00:12:36.000 --> 00:12:50.000
Learn dot neighbors. So from Escalarn dot neighbors, we're going to import K neighbors, classifier.

00:12:50.000 --> 00:12:57.000
Then we are so familiar with this. Maybe we need a refresher because we've been doing time series for a couple of days.

00:12:57.000 --> 00:12:58.000
But remember, from linear regression, the pattern is you make your model objects.

00:12:58.000 --> 00:13:08.000
Okay, neighbors classifier, and then the number, you input a positive insider.

00:13:08.000 --> 00:13:12.000
So for us, for this example, we're going to choose K equals 5.

00:13:12.000 --> 00:13:32.000
Then we will fit the model on the training set. So Iris train, and then we want with and then we'll use, even though I picture just to.

00:13:32.000 --> 00:13:36.000
We can go ahead and.

00:13:36.000 --> 00:13:42.000
Just put all 4 in. So I think it's, is it pedal or pedal? P.

00:13:42.000 --> 00:13:47.000
E, T. A. Okay, petal link.

00:13:47.000 --> 00:13:57.000
Pedal with, and then we put our Y Iris train dot target.

00:13:57.000 --> 00:14:06.000
Right Iris, class, not target Iris under store underscore class.

00:14:06.000 --> 00:14:15.000
And then we'll do predict. So dot predict. And then we're just gonna do this on the training set.

00:14:15.000 --> 00:14:19.000
Okay. So here, we can see, like, what's going on.

00:14:19.000 --> 00:14:21.000
So K. Nearest neighbors isn't actually fitting anything in the sense of like we're not estimating any parameters.

00:14:21.000 --> 00:14:36.000
All we're doing is storing the triging set as a part of the model object, and then, when it calls, predict that's actually where all the work gets done.

00:14:36.000 --> 00:14:50.000
So when you call predict, you have to calculate all of those distances and then make the prediction based on the voting procedure.

00:14:50.000 --> 00:14:55.000
So Zack's asking what helps decide. The recommended number of neighbors naively.

00:14:55.000 --> 00:15:00.000
I would have expected greater than 10 here I just chose 5 so it's a hyper parameter.

00:15:00.000 --> 00:15:10.000
So just like with every other hyper parameter, you do some sort of cross validation or validation set, and you would set up like a grid of values.

00:15:10.000 --> 00:15:11.000
And for her it would just be like a list. So you do like a list, and you could go from like K equals.

00:15:11.000 --> 00:15:25.000
One all the way up to like K. Equals 50 or more, depending on the size of your data set, and then see which one gives you the best cross validation, metric.

00:15:25.000 --> 00:15:32.000
So maybe this is a good lead in so one validation metric that you all use for classification problems is called accuracy.

00:15:32.000 --> 00:15:49.000
So accuracy measures the proportion of all predictions that you made, that are correct, and so we could define this by hand.

00:15:49.000 --> 00:15:57.000
So to do this by hand, we're going to define a function accuracy that takes in the true values along with your predicted values.

00:15:57.000 --> 00:16:01.000
And then to do, to get the accuracy.

00:16:01.000 --> 00:16:04.000
You want to see? How many of those values did you predict correctly divided by the total number of observations?

00:16:04.000 --> 00:16:16.000
So the total number of predictions you made, and so then we could do accuracy.

00:16:16.000 --> 00:16:20.000
And then we would do the true values which again, we're gonna use. The training set.

00:16:20.000 --> 00:16:33.000
So iris train dot iris class dot values, and then we'll do the prediction.

00:16:33.000 --> 00:16:37.000
And I guess I probably don't need values. But I'll just leave it there. Hey?

00:16:37.000 --> 00:16:41.000
And so on. The training set. We got a 98.3 repeating, so like 98, and a third percent accuracy.

00:16:41.000 --> 00:16:51.000
So it's gonna look like this. But this is a percentage so that's the accuracy.

00:16:51.000 --> 00:16:56.000
So that's one. And I also want to point out, in addition, you don't have to define your own accuracy, and we'll learn this in the next notebook.

00:16:56.000 --> 00:16:59.000
But there's sk learn just like it has mean squared error.

00:16:59.000 --> 00:17:09.000
It has a number of classification, metrics that you can just use.

00:17:09.000 --> 00:17:14.000
So the one for accuracy is from Sk. Learn dot metrics.

00:17:14.000 --> 00:17:25.000
It's called the accuracy underscore, and then the word score, and so we could redo this whole thing.

00:17:25.000 --> 00:17:35.000
And just say, accuracy score the it works the same way where you put in the true values followed by the predicted values.

00:17:35.000 --> 00:17:38.000
Oh, no! What did I do not scored accuracy, score!

00:17:38.000 --> 00:17:45.000
There we go, and so I have a link for the documentation in the code here.

00:17:45.000 --> 00:17:49.000
So we could go here.

00:17:49.000 --> 00:17:56.000
And then there you go. So you put it in the true, followed by the predicted.

00:17:56.000 --> 00:18:04.000
Alright, and then you can learn more about the different other different arguments on your own time.

00:18:04.000 --> 00:18:06.000
Okay. So before I show you this last bit, are there any questions about anything?

00:18:06.000 --> 00:18:20.000
So far, I know Kate, nearest neighbors is pretty straightforward for an algorithm, but I just want to make sure there's room for anybody who has questions to ask questions.

00:18:20.000 --> 00:18:23.000
I had a question.

00:18:23.000 --> 00:18:28.000
So, if I understand the algorithm correctly, you're basically you pick a random point in your data set.

00:18:28.000 --> 00:18:42.000
And then you look at things that are closest to it. And then, if it's, for example, surrounded by red points, you would label that as red, and if it's surrounded by green points, it would label that point as green, that's kind of the general idea.

00:18:42.000 --> 00:18:43.000
Yeah, so, and it. But it's not a randomly selected point.

00:18:43.000 --> 00:19:03.000
So in the visual sort of description I had up here like these red and green points.

00:19:03.000 --> 00:19:04.000
Hmm!

00:19:04.000 --> 00:19:05.000
These are the points in your training set so like. If we had only used these 2, these 2 features, it would look like this for us, but we used all 4 features, and so then the points that we're trying to predict on like they look random.

00:19:05.000 --> 00:19:09.000
Here because I'm just showing you an example of the different outcomes.

00:19:09.000 --> 00:19:17.000
But these points, like the black X's in our example, down below.

00:19:17.000 --> 00:19:21.000
If the Irs data would be the observations we're trying to predict on.

00:19:21.000 --> 00:19:30.000
So like you're going to input like you have in your head or in your computer, a list of values that you'd like to get predicted.

00:19:30.000 --> 00:19:31.000
And so the black X's are those values that you want.

00:19:31.000 --> 00:19:35.000
Predicted.

00:19:35.000 --> 00:19:46.000
Right? So I guess the question I was at the point I was trying to get out. It's the algorithms assuming that you're surrounded by similar things like, so if you're red, you'd be surrounded by red things.

00:19:46.000 --> 00:19:48.000
If you're green border.

00:19:48.000 --> 00:19:55.000
Yeah. So it will only predict like, so the algorithm when it makes a prediction, it doesn't know what your label is.

00:19:55.000 --> 00:20:00.000
So it's trying to guess that. And so it does that by looking at your K closest neighbors in the training set.

00:20:00.000 --> 00:20:08.000
And then counts up like what class shows up the most in those neighbors.

00:20:08.000 --> 00:20:14.000
Right? So would it work on data that's not clustered together.

00:20:14.000 --> 00:20:23.000
Yeah. So like, if the if the if, the green like, let's say, if for these 2 features say, if the red points and the greeting points were like overlapping with one another, yeah, it wouldn't work very well.

00:20:23.000 --> 00:20:26.000
Exactly. Yeah, yeah.

00:20:26.000 --> 00:20:34.000
But then, you know none of like the like, you'd have to do some sort of pre-processing, if like, if that situation sort of happened.

00:20:34.000 --> 00:20:39.000
I don't think there are any classification algorithms that would work well in that kind of data.

00:20:39.000 --> 00:20:43.000
If, like, the 2 classes are indistinguishable from each other in the data set.

00:20:43.000 --> 00:20:46.000
Hmm, okay. Thanks.

00:20:46.000 --> 00:20:52.000
Yeah.

00:20:52.000 --> 00:20:53.000
So Kirtha is asking. Seems like there would be a normal Gaussian type distribution for K.

00:20:53.000 --> 00:20:57.000
It would have to have an upper limit beyond which accuracy would drop.

00:20:57.000 --> 00:21:00.000
Is that true? So the once you get to like, if K.

00:21:00.000 --> 00:21:24.000
Is just the cardinality of the training set. It's just going to be predicting the majority class every time, which is a baseline model I don't know that it necessarily I don't know that like if you did cross validation for the accuracy it would follow some sort of

00:21:24.000 --> 00:21:48.000
Gaussian distribution. I don't think there's anything, any theorem or theory that says that it has to follow any distribution for every problem.

00:21:48.000 --> 00:21:59.000
Any other questions?

00:21:59.000 --> 00:22:05.000
Okay, so before we move on to the next notebook, I wanna talk about this feature called predict underscore proba.

00:22:05.000 --> 00:22:12.000
So this algorithm, like it made like just a prediction when you call dot predict, it just makes a prediction of the class.

00:22:12.000 --> 00:22:31.000
This is not always advantageous, so sometimes, instead of making a straight per prediction, you want to get a probability, a predicted probability of it being that class.

00:22:31.000 --> 00:22:39.000
So, for almost all of the algorithms that we'll learn, you can do the name of the algorithm.

00:22:39.000 --> 00:22:43.000
So whatever variable, so for us it was Knn.

00:22:43.000 --> 00:22:59.000
Dot predict underscore pro bus. So pro by here stands for probability. And then you input your data set.

00:22:59.000 --> 00:23:11.000
And then what's gets returned is an array where each entry so each row corresponds to an observation, and then each column corresponds to the algorithms.

00:23:11.000 --> 00:23:13.000
Estimated probability that that observation is a member of that class.

00:23:13.000 --> 00:23:41.000
So, for instance, in this 0 row, the algorithms predicting that this observation has a 0% probability of being class 0, a 1% or a hundred percent probability of being class one and a 2% probability of being classes 0 and so you can scrroll through the rest and see a lot of these are being predicted

00:23:41.000 --> 00:23:45.000
as like one of these are being predicted as like 100%, one of the classes.

00:23:45.000 --> 00:23:48.000
So the way that this works is for K. Nearest neighbors.

00:23:48.000 --> 00:23:49.000
This probability is just the fraction of the neighbors that are of a class.

00:23:49.000 --> 00:23:59.000
So, for instance, what this is telling us is that all 5 of this observation's neighbors are of of class 2.

00:23:59.000 --> 00:24:12.000
All 4 of these. This observations, neighbors are of class 2, and then one of them are of class, one.

00:24:12.000 --> 00:24:14.000
And so forth. So that's what's going on here.

00:24:14.000 --> 00:24:15.000
And so here's another example that's slightly different.

00:24:15.000 --> 00:24:22.000
So here you have 2 of the 5 are of class 2, and then 3 of the 5 are of class, one.

00:24:22.000 --> 00:24:35.000
If you were doing a weighted voting, it would be like the fraction of the weights instead of just the fraction of the neighbors.

00:24:35.000 --> 00:24:46.000
So sometimes you want to have probabilities instead of a hard cutoff, and we'll see some examples as to why we want that in the coming notebooks.

00:24:46.000 --> 00:24:57.000
Alright any questions about the probability stuff.

00:24:57.000 --> 00:25:02.000
Okay. So the next notebook we're gonna look at is notebook number 3 in classification.

00:25:02.000 --> 00:25:06.000
So we looked at notebook number 2. Now we're on notebook number 3.

00:25:06.000 --> 00:25:22.000
So this notebooks called the confusion matrix. So in the last notebook we talked about accuracy but as you're gonna see, that's not always the one you wanna go with in terms of performance metrics, so in this notebook, we're gonna introduce a number of different metrics, show

00:25:22.000 --> 00:25:41.000
you something called the confusion matrix. And really, it's more so, maybe about how we can get confused with all these different matrix than like the algorithms getting confused. And then we'll sort of give you like a lay a link to a useful summary table to help you keep all this straight.

00:25:41.000 --> 00:25:46.000
So with the K nearest neighbors. Notebook, we defined accuracy, which is the number of correct predictions you make divided by the total number of predictions you make.

00:25:46.000 --> 00:25:54.000
So sometimes this can be a misleading metric.

00:25:54.000 --> 00:26:06.000
So, for instance, if you have a data set where the vast majority of your observations are of one class, and then a very few of the observations are of the other class, you can misleadingly get like what seems to be a really good algorithm with very silly models.

00:26:06.000 --> 00:26:25.000
So let's say. For instance, I had a data set where 10% of my data was of class one and 90% of my data was of class one and 90% of my data was of class one turned out to be an infectious disease.

00:26:25.000 --> 00:26:29.000
Or any kind of disease that is deadly, but is treatable.

00:26:29.000 --> 00:26:36.000
If you detect it in time, and so we could get a very silly algorithm that has 90% accuracy.

00:26:36.000 --> 00:26:39.000
If we just say, Okay, no matter what you show me, classify it as a 0.

00:26:39.000 --> 00:26:46.000
And so based on this distribution, we know that okay, this algorithm should be about 90% accurate.

00:26:46.000 --> 00:26:54.000
And if you were to tell somebody that they did a 90% like if somebody got a 90% on a test, they think that's really good.

00:26:54.000 --> 00:26:58.000
But here it's sort of misleading because we're 90% accurate.

00:26:58.000 --> 00:27:11.000
But we haven't identified any of the ones so for everything, you know, like, if this was the sort of thing where we're, you know, like, if this was the sort of thing where we wanna use it to try and detect this sort of deadly disease that we could treat and to detect this sort of deadly disease that we could treat and

00:27:11.000 --> 00:27:14.000
cure if we knew about it. Then this is a terrible model.

00:27:14.000 --> 00:27:15.000
And so that's sort of why we want to start developing.

00:27:15.000 --> 00:27:20.000
Some additional matrices for classification problems to give us a sense of like, how our models are correct.

00:27:20.000 --> 00:27:26.000
So with regression and Time series, we kind of relied heavily on the mean square error.

00:27:26.000 --> 00:27:39.000
Here. We have to be a little bit more careful about what metrics we use, because they tell us it different ways that our models are correct or incorrect.

00:27:39.000 --> 00:27:43.000
So we're gonna work in the world of binary classification.

00:27:43.000 --> 00:27:45.000
So you have 2 classes that are depending on the algorithm.

00:27:45.000 --> 00:27:50.000
You're working with 0 and one. And so I say that because I've algorithms are kind of like developed by different academic fields.

00:27:50.000 --> 00:28:05.000
And so in fields like statistics and probability, your 2 classes are tend to be 0 on one, but in fields like computer science, the 2 classes tend to be negative one and one.

00:28:05.000 --> 00:28:08.000
So for the confusion matrix setup, you know, the labels don't really matter.

00:28:08.000 --> 00:28:18.000
But for the confusion matrix setup, we're gonna keep it as 0 and one, so they can see matrix. You set it up in the following way, the rows represent the actual classes of your observations, and then the columns represent what your algorithm predicts.

00:28:18.000 --> 00:28:33.000
So you can go through the different entries and for things that are actually a 0, that your algorithm predicts is a 0.

00:28:33.000 --> 00:28:41.000
Those are called the true negatives, or t ends for things that are actually zeros, that your algorithm predicts to be ones.

00:28:41.000 --> 00:28:45.000
Those are false positives, because you're falsely predicting a positive case.

00:28:45.000 --> 00:28:50.000
For things that are actually ones that are are predicted to be zeros.

00:28:50.000 --> 00:28:55.000
Those are called false negatives, because you're falsely predicting that it's not a one.

00:28:55.000 --> 00:29:06.000
And then finally, for things that are actually ones that you predict are ones those are called true positives, because you're correctly predicting that they are, in fact, a positive case.

00:29:06.000 --> 00:29:10.000
So it's true that your prediction is correct is a positive.

00:29:10.000 --> 00:29:16.000
So we can use these. So actually, what is contained in these entries?

00:29:16.000 --> 00:29:17.000
I just went over what the names mean, but like when you do it like, what's actually shown in there?

00:29:17.000 --> 00:29:27.000
These are the counts. And so basically like, for all the zeros, you put every observation that you correctly predict as a 0, you put in here.

00:29:27.000 --> 00:29:32.000
So if you let's say you had 30 total zeros, and then 20 of them.

00:29:32.000 --> 00:29:38.000
You predict correctly or 0, this would have a 20, and therefore this would have a 10 in it.

00:29:38.000 --> 00:29:44.000
So the entries of the confusion matrix count up the number of the types of like classifications that you just made.

00:29:44.000 --> 00:30:02.000
So before we go ahead and dive in. Oh, and also, if you're working in like a public health or Statsy field, these are sometimes referred to as contingency tables, as opposed to as opposed to confusion, matrix, so if you're if you're familiar with contingency, tables.

00:30:02.000 --> 00:30:07.000
it's the same concept. If you're familiar with frequentist statistics, you can kind of think of like follows negatives as type 2 errors and false positives as type, one errors.

00:30:07.000 --> 00:30:16.000
If you're not familiar with frequentist statistics, then you don't have to worry about that.

00:30:16.000 --> 00:30:22.000
It's just sort of trying to get this to relate to all the different backgrounds everybody has.

00:30:22.000 --> 00:30:31.000
Okay. But before we dive into how to do this with sk, learn, and then how to derive metrics from this.

00:30:31.000 --> 00:30:45.000
Are there any questions about like just the definition of the confusion matrix?

00:30:45.000 --> 00:30:49.000
Okay.

00:30:49.000 --> 00:30:53.000
So? What are some metrics derived from the confusion?

00:30:53.000 --> 00:31:02.000
Matrix. So the confusion matrix is great. I can give you a sense of like how your matrix is going wrong, or how you're algorithm is, you know, doing wrong.

00:31:02.000 --> 00:31:08.000
But typically people will want to see like metrics. So like just something like accuracy as opposed to trying to digest the entire matrix.

00:31:08.000 --> 00:31:21.000
At the same time. So we're gonna show 6 different metrics that tell to be popular when looking at the performance of an algorithm.

00:31:21.000 --> 00:31:22.000
So it gets really confusing because a lot of these are more or less the same metric or slightly different.

00:31:22.000 --> 00:31:30.000
And then they have different names, and so on. On top of the formula.

00:31:30.000 --> 00:31:32.000
You have to remember the name. So to me. That's where the confusion part comes in.

00:31:32.000 --> 00:31:41.000
I think it's called the confusion Matrix, because it's trying to measure what your algorithm is confusing.

00:31:41.000 --> 00:31:42.000
So, but for me it's just because then it's always a lot of vocab.

00:31:42.000 --> 00:31:50.000
To remember. Okay, so the first 2 that we're gonna talk about are called precision and recall.

00:31:50.000 --> 00:31:53.000
And so the formula, using the entries of the confusion matrix are both precision is going to be the true positives plus the true positives.

00:31:53.000 --> 00:32:15.000
False positives, false, positive. So you want to think of this like out of all the things that you've predicted to be class one we're actually Class One. And so we're adding along the column of the confusion matrix the one

00:32:15.000 --> 00:32:16.000
column, and then getting the fraction that are correctly predicted.

00:32:16.000 --> 00:32:27.000
So you can kind of think of this as like, how much should I trust my algorithm when it says something is class one another metric that gets looked at at the same time as precision.

00:32:27.000 --> 00:32:34.000
Is called the recall. And so here the numerators, the same.

00:32:34.000 --> 00:32:37.000
But now the denominator is summing along the bottom row.

00:32:37.000 --> 00:32:43.000
So here you're sort of saying, Okay, out of all of my positives, my actual positives.

00:32:43.000 --> 00:32:47.000
What? Fraction of them did I correctly predict? And so you can sort of think of this as the probability, the algorithm correctly detects a class.

00:32:47.000 --> 00:32:54.000
One data point, so thinking of like sort of a conditional thing like, given that, the observation is actually a one.

00:32:54.000 --> 00:32:58.000
What is the probability? I predicted it to be a one?

00:32:58.000 --> 00:33:08.000
So we're gonna sort of show you how you can calculate this in sk, learn or using sk, learn, we're going to use the iris dataset. So we're going to make a slight tweak to it.

00:33:08.000 --> 00:33:19.000
Because remember, the Irs data set has 3 possible classes.

00:33:19.000 --> 00:33:22.000
We're going to turn this into a binary classifier has 3 possible classes. We're going to turn this into a binary classifier by just saying, I only want to predict Virginica's so instead of using the fact that I have 3 different classes.

00:33:22.000 --> 00:33:31.000
I'm gonna reduce everything to be a virginica.

00:33:31.000 --> 00:33:32.000
So instead of using the fact that I have 3 different classes, I'm gonna reduce everything to be a vir classifier.

00:33:32.000 --> 00:33:36.000
So every observation that is, a Virginica.

00:33:36.000 --> 00:33:46.000
I'm just gonna start off by labeling everything as a 0 and then locate the ones that are virginicas and replace that column with one.

00:33:46.000 --> 00:33:51.000
Okay, so just to make it very clear.

00:33:51.000 --> 00:33:57.000
Like here we can see.

00:33:57.000 --> 00:34:05.000
No, I didn't include the target call. Maybe it makes better sense to do.

00:34:05.000 --> 00:34:08.000
Well, there should be a target call. Yeah, I don't know why that's in there.

00:34:08.000 --> 00:34:15.000
Oh, it's because I said, Iris, okay, so if we do, Iis at Target, okay?

00:34:15.000 --> 00:34:18.000
So here's Iris at Target, and then we can compare it to.

00:34:18.000 --> 00:34:29.000
I we can compare it to Y, okay? And so you can see that it's 0 everywhere that the target is 0 or a one, because these are not the Virginica's.

00:34:29.000 --> 00:34:30.000
And then the Y that we're interested in is a one everywhere.

00:34:30.000 --> 00:34:37.000
The target to 2, because Virginico wasn't coded as a 2.

00:34:37.000 --> 00:34:40.000
So, I hope I didn't just like make it very more confusing than it needed to be.

00:34:40.000 --> 00:34:43.000
The basic idea is these are binary clification metrics.

00:34:43.000 --> 00:34:46.000
And so I just wanted to take a data set we were familiar with, and then turn it into a binary classification problem.

00:34:46.000 --> 00:34:55.000
It could have been any of the other 2 items, but I just chose Virginia.

00:34:55.000 --> 00:34:59.000
Okay. So now I'm make a train test split.

00:34:59.000 --> 00:35:12.000
And then we saw this last notebook. So I just fit and get the predictions on the training set for A. K nearest neighbors. Classifier with K equals 5.

00:35:12.000 --> 00:35:15.000
The value of K. Doesn't really matter here, because I'm just trying to demonstrate how to calculate performance metrics.

00:35:15.000 --> 00:35:24.000
And then in like, if I wanted to build a model, I would do cross foundation or something.

00:35:24.000 --> 00:35:25.000
Okay. So the first thing we're going to do is show you a quick way to calculate the confusion matrix.

00:35:25.000 --> 00:35:41.000
Using. Sk learn. So sk. Learn has a metric. So from sk learn dot metrics, we're going to import confusion.

00:35:41.000 --> 00:35:46.000
Underscore matrix. And then, just like with Msc.

00:35:46.000 --> 00:35:51.000
And with accuracy. Score you call confusion, matrix.

00:35:51.000 --> 00:35:54.000
And then the first thing you do is input the actual values.

00:35:54.000 --> 00:35:56.000
So? Why train? Followed by the predicted values which I've stored in a variable Y train predict?

00:35:56.000 --> 00:36:06.000
So why underscore trade and underscore P. R. E. D.

00:36:06.000 --> 00:36:18.000
So this tells us that we have 77 true negatives, 38 troop positives, 3 false positives, and 2 false negatives.

00:36:18.000 --> 00:36:26.000
So we could then calculate the recall and the precision using the formulas.

00:36:26.000 --> 00:36:36.000
So the remember, the recall is true. Positives, divided by false negatives plus true positives and then I want to comment.

00:36:36.000 --> 00:36:45.000
There and then. The precision is true. Positives, divided by false positives plus true positives.

00:36:45.000 --> 00:36:54.000
Oh, and I actually think I wanna move this down here.

00:36:54.000 --> 00:36:58.000
Here we go. Okay. So we have a 95% training recall.

00:36:58.000 --> 00:37:00.000
And a 92.6 8 training precision.

00:37:00.000 --> 00:37:06.000
So like an a vacuum like these are sort of meaningless to us in terms of like knowing whether or not this is a good model.

00:37:06.000 --> 00:37:13.000
These become useful when we compare them to other models.

00:37:13.000 --> 00:37:17.000
So typically you wanna have high precision and high recall.

00:37:17.000 --> 00:37:21.000
And you wanna choose, like the model. Are you to choose a metric?

00:37:21.000 --> 00:37:25.000
And then you want to choose the model that has the best metric that you're looking for.

00:37:25.000 --> 00:37:31.000
Alternatively. Instead of calculating them by hand, like I did here.

00:37:31.000 --> 00:37:39.000
Sk. Learn has functions called precision, score, and recall score that calculate the precision, and recall for you.

00:37:39.000 --> 00:37:48.000
So you would do from S. Learn dot metrics import precision score comma.

00:37:48.000 --> 00:37:55.000
Recall score, and then you'll just input the functions like we've seen before.

00:37:55.000 --> 00:38:00.000
So precision, score, and then we want y train.

00:38:00.000 --> 00:38:09.000
Y train predict, but that actually should go down here. Because this is where I wanted the precision.

00:38:09.000 --> 00:38:25.000
And then this one should be the recall. Okay? And so we can see that our formula that we did by hand kind of is the same as the one from Sk, learn.

00:38:25.000 --> 00:38:47.000
Okay, are there any questions about precision and recall before we move on to the next metrics?

00:38:47.000 --> 00:38:52.000
Okay. So the next set of metrics are called the rates.

00:38:52.000 --> 00:39:00.000
The various rates. So we have 4 of them. And it's basically all 4 entries of the confusion metric.

00:39:00.000 --> 00:39:02.000
But as a rate. So we have true positive rate, false positive rate, true negative rate and false negative rate.

00:39:02.000 --> 00:39:25.000
And so basically, what you're saying is a sort of a series of series of conditional probability estimates basically so.

00:39:25.000 --> 00:39:35.000
I think here I have a slight table. I should just say, instead of true positive, I think what I say true, positive here I'm pretty sure I mean actual instead of true.

00:39:35.000 --> 00:39:36.000
So let me change that, because that is slightly misleading.

00:39:36.000 --> 00:39:46.000
So let's say, actual here. Okay, so given that an observation is actually positive.

00:39:46.000 --> 00:39:51.000
What is the probability that we correctly predicted as positive?

00:39:51.000 --> 00:39:57.000
Is the true positive rate and it just says a note, this is the exact same thing as recall same formula.

00:39:57.000 --> 00:40:03.000
Given that our observation is actually positive. What is the probability that we incorrectly predicted as a negative?

00:40:03.000 --> 00:40:17.000
So this is the false negative rate, then we have the other 2 given that an observation is an actual negative, what is the probability that we correctly predicted as a negative?

00:40:17.000 --> 00:40:22.000
That's the true negative rate. And then given that an observation is actually negative.

00:40:22.000 --> 00:40:30.000
What is the probability that we incorrectly predict it as a possibility that's known as the false positive rate so the formulas for that are given below?

00:40:30.000 --> 00:40:31.000
So you take the numerator for that are given below. So you take the numerator, and then you're dividing by the row.

00:40:31.000 --> 00:40:47.000
That numerator is found in. Okay, so true, positive is divided by all actual positives, false, negative, divided by all actual positives.

00:40:47.000 --> 00:40:53.000
True negative is divided by all actual negatives, and so forth, so other than the true positive rate.

00:40:53.000 --> 00:41:04.000
Which is the same as recall these ones. You have to calculate by hand, using the confusion matrix function.

00:41:04.000 --> 00:41:15.000
So here I've calculated all of the rates for our for our training set.

00:41:15.000 --> 00:41:19.000
Okay.

00:41:19.000 --> 00:41:25.000
And then the last 2 that we're gonna look at, and you just to introduce to you is sensitivity and specificity.

00:41:25.000 --> 00:41:29.000
So these have a long history of use in the field of public health when it comes to the sort of understanding the performance of various screening and diagnostic tests.

00:41:29.000 --> 00:41:44.000
So the sensitivity of a classifier is the probability that it correctly identifies a positive observation.

00:41:44.000 --> 00:41:50.000
So once again. This is the exact same thing as true positive rate, and recall, and then the other is specificity.

00:41:50.000 --> 00:42:02.000
This is the probability that your classifier correctly identified a negative observation so this is the same as true negative rates.

00:42:02.000 --> 00:42:06.000
So sensitivity is tp over Tp plus Fn.

00:42:06.000 --> 00:42:20.000
Specificity is tn over tn, plus fp, so once this one sensitivity, you can calculate either by hand using confusion, matrix, or the reconstruction matrix or the recall score.

00:42:20.000 --> 00:42:33.000
But specificity. That's no like as of, to my knowledge, they may have added one, as of to my knowledge they may have added one, but my knowledge there is no function like specificity score, so you'd have to compute this by hand, so this is a lot.

00:42:33.000 --> 00:42:37.000
Very aware that it's a lot so to help you like.

00:42:37.000 --> 00:42:45.000
You know I don't remember. I usually don't remember any of these formulas other than precision is is the column and recall is the row.

00:42:45.000 --> 00:42:47.000
So I always have to look it up to remind myself before I like.

00:42:47.000 --> 00:42:53.000
Try and use them. So to make it easier for you guys I've created a confusion matrix.

00:42:53.000 --> 00:42:56.000
Cheat sheet which should be in the repository.

00:42:56.000 --> 00:42:57.000
If it's not, you can let me know, and I'll make sure to upload it later tonight.

00:42:57.000 --> 00:43:18.000
So it has the picture of the confusion matrix, and then has the different metrics with their names, and then formulas on the right hand side okay, so there's there are some on here that we did not formally introduced like we didn't talk about total error rate.

00:43:18.000 --> 00:43:19.000
We didn't talk about type one error rate type, 2 error rate explicitly.

00:43:19.000 --> 00:43:31.000
But you can look at them here and then see the formula, and just so like this is just useful to have, even if you know a lot about data science, it's really easy to forget what these formulas are.

00:43:31.000 --> 00:43:51.000
And like, if you're going into like a job interview or something it's useful to maybe use this to try and brush up on what these are, because you may be asked like, oh, so what's the difference between precision and recall and if you you know if you can't remember

00:43:51.000 --> 00:43:52.000
it's good. It's just good to know it, and then you can reason out like what the differences from looking at the formula.

00:43:52.000 --> 00:44:03.000
I don't think any job interviews gonna penalize you if you draw the confusion matrix to reason out what it is.

00:44:03.000 --> 00:44:10.000
Yeah, so this is, for you know. So you don't have to open the Jupiter notebook anytime.

00:44:10.000 --> 00:44:14.000
You get just refer back to the cheat sheet in the real world.

00:44:14.000 --> 00:44:21.000
It's really important. So like in these sort of like, a lot of the problems we've been working on aren't like real world problems that have impact on anybody because we're just trying to learn.

00:44:21.000 --> 00:44:31.000
But in the real world you want to give it your metrics that you use really care careful consideration.

00:44:31.000 --> 00:44:40.000
So the metric you use is going to help determine what model is seen as best so you can have models again like that have really good accuracy.

00:44:40.000 --> 00:44:59.000
But if what your problem requires is precision or recall having, like the best accuracy, maybe does not lead to giving you the best precision or the best recall, so, as an example like public health will often focus on sensitivity, and specificity, because they can be translated into real-world

00:44:59.000 --> 00:45:08.000
health impacts. So like, let's say, in the case of a deadly disease, we have successful regimens for treating.

00:45:08.000 --> 00:45:15.000
We may wanna have high sensitivity there if we we may opt for high specificity.

00:45:15.000 --> 00:45:16.000
If the disease or condition in question does not tend to cost severe outcomes, and the test or treatment is like highly invasive.

00:45:16.000 --> 00:45:31.000
So, maybe, like this condition or disease just provides like a mild inconvenience, like a common cold, tends to do.

00:45:31.000 --> 00:45:40.000
But like maybe, like the test that we would have to give to check whether or not this person has that disease or that condition is very invasive.

00:45:40.000 --> 00:45:44.000
We may opt to have one that has a high specificity.

00:45:44.000 --> 00:45:49.000
So it's just like, you know, sort of considering.

00:45:49.000 --> 00:45:53.000
I guess the takeaway here is not necessarily to remember public health stuff.

00:45:53.000 --> 00:46:06.000
But consider what the impact of like different types of incorrects and correct predictions are for your particular models, and then choose the metric that will help either maximize that impact.

00:46:06.000 --> 00:46:09.000
If it's positive or minimize that impact, if it's a negative impact when choosing your classification algorithms.

00:46:09.000 --> 00:46:20.000
Another thing I'll stress is a lot of times in the business and business settings.

00:46:20.000 --> 00:46:25.000
People like to have these metrics, because you can readily interpret them in terms of like outcomes.

00:46:25.000 --> 00:46:32.000
So there are other, metrics that people will look at that we'll talk about, maybe a little bit later in notebook number 5.

00:46:32.000 --> 00:46:33.000
These, ones I personally appreciate these ones because I can like interpret them.

00:46:33.000 --> 00:46:47.000
What they mean in terms of the actual real world problem. There are other metrics that sort of try and take them and then combine them all into one like super metric that gives you some kind of average.

00:46:47.000 --> 00:46:51.000
Those are hard to interpret in terms of like, what's the impact in the real world problem?

00:46:51.000 --> 00:46:52.000
So I think a lot of people like wanna know the impact in terms of the real world.

00:46:52.000 --> 00:47:12.000
So I would encourage you to first appeal to these ones, where you can kind of disentangle what it means to to the actual impacts of your problem instead of like all encompassing metrics that are just saying, well, this model is the best because this number's lowest and then

00:47:12.000 --> 00:47:18.000
you can't really actually interpret that number in terms of the real problem.

00:47:18.000 --> 00:47:40.000
Okay. So before moving on to the next notebook, are there any questions about these metrics?

00:47:40.000 --> 00:47:42.000
Okay, so now we're gonna learn a different algorithm called logistic regression.

00:47:42.000 --> 00:48:00.000
So this is probably familiar to a large number of you, because it's a very commonly it's very commonly taught in statistics, courses so sort of like as a thing that we'll start off by talking about.

00:48:00.000 --> 00:48:05.000
This is technically a form of statistical regression.

00:48:05.000 --> 00:48:11.000
So that's where the regression comes from. It's part of a framework called generalized linear models.

00:48:11.000 --> 00:48:28.000
So it is a regression statistical model. It's not used to solve regression, supervised learning problems because the outcome that you're trying to predict or trying to understand is not a numeric outcome, but a binary by a Binary outcome.

00:48:28.000 --> 00:48:34.000
So it is a regression statistical algorithm. It is used for classification problems and supervised learning.

00:48:34.000 --> 00:48:35.000
So this can cause a lot of friction between statisticians and machine learning.

00:48:35.000 --> 00:48:49.000
People, I'm sure if you've been trying to learn data science, you've seen these sort of massive cheat sheets that I remember like growing up in high school.

00:48:49.000 --> 00:49:02.000
The kids that would like, Oh, I get a cheat sheet and then try and cram like literally everything from the course into one sheet of paper you've probably seen those sorts of things online about data science.

00:49:02.000 --> 00:49:03.000
And then they say, like logistic regression, is a classification algorithm.

00:49:03.000 --> 00:49:13.000
This is, causes a lot of friction with like stats, people. So just be aware, it's technically a statistical regression algorithm.

00:49:13.000 --> 00:49:19.000
But it gets used in classification problems so I don't go hard in either way.

00:49:19.000 --> 00:49:38.000
Just be understanding of what it actually is. So based on the fact that it is a statistical regression problem like we need to be regressing, not to some sort of continuous measure like the thing that we're trying to understand right regression problems need a continuous measure of some kind is the idea

00:49:38.000 --> 00:49:41.000
so like, how do we get that from a binary 0?

00:49:41.000 --> 00:49:46.000
One problem. Well, let's look at an example. So to make it really nice for us, I just have this random data that's going to have a binary problem.

00:49:46.000 --> 00:49:57.000
I'm going to make a train test split, and then we'll look at the data.

00:49:57.000 --> 00:50:05.000
So here the data has a single feature, and then I've plotted the class of the observations on the vertical axis.

00:50:05.000 --> 00:50:06.000
So it looks like things that tend to be closer to 0.

00:50:06.000 --> 00:50:16.000
Are classified as a 0. And but here I'm at 0 on the feature classified as a 0.

00:50:16.000 --> 00:50:22.000
And then, as you start to get larger, more, most of your observations tend to be of class One.

00:50:22.000 --> 00:50:33.000
And so, while this vertical axis says, class as the label, we could have very easily said probability that the observation is a one.

00:50:33.000 --> 00:50:39.000
So why could we say that we know the labels for all of these observations?

00:50:39.000 --> 00:50:49.000
We know that anything down here is a class 0 observation we know that everything up here is a class, one observation.

00:50:49.000 --> 00:50:55.000
So the probability that any one of these points up here is Class One.

00:50:55.000 --> 00:51:08.000
Well, that's equal to one. So the probability up here is one, and then down here the probability that any of these observations is classes is class one is 0, because we know for a fact that they are not class one.

00:51:08.000 --> 00:51:24.000
So that's sort of the idea. Here is we're replacing the 0 one label as like a class, and then thinking of it instead as a probability that it is class one so that's what we're sort of looking at.

00:51:24.000 --> 00:51:29.000
And so that's what we're sort of the continuous measure that we're going to regress on.

00:51:29.000 --> 00:51:39.000
And then we're gonna use in the linear regression setting the functional form that we assumed was a linear combination of terms.

00:51:39.000 --> 00:51:44.000
Here we're going to use for logistic regression a sigmoidal curve which in general, for a single variable, looks like one over one plus E to the negative.

00:51:44.000 --> 00:51:49.000
X. And so here's just what this is. Just plotting that function.

00:51:49.000 --> 00:52:02.000
This is little X is not our data. This is just plotting the function one over one plus E to the negative.

00:52:02.000 --> 00:52:06.000
X, okay? And so you could see, though, how this shape of function sort of fits, how we might wanna model this probability right?

00:52:06.000 --> 00:52:22.000
So you could see how the sigmoid curve kind of fits the shape of our data very nicely, and why we might be inclined to use it.

00:52:22.000 --> 00:52:29.000
So for us the function that we're gonna try and estimate is gonna be little P of X.

00:52:29.000 --> 00:52:35.000
And I think I forgot to say this. But little P. Of X is going to be.

00:52:35.000 --> 00:52:36.000
The probability that y is equal to one given the features that we've observed.

00:52:36.000 --> 00:52:47.000
So it's the probability Y is equal to one conditional on the features that you're observing.

00:52:47.000 --> 00:52:51.000
Okay, so this little P of X, we're gonna assume that it's following a function of the form one over one plus e to the negative.

00:52:51.000 --> 00:53:04.000
X times, beta Beta just like in linear regression, is a column vector of coefficients.

00:53:04.000 --> 00:53:11.000
And then X is a matrix of feature where I've included a column of one set, the front.

00:53:11.000 --> 00:53:29.000
So in general, this model is fit, using this statistical method of maximum likelihood, estimation, so for a dairation of like how that works, you can check out the practice problems, we're not gonna go over it here because we're gonna use sk learns logistic regression

00:53:29.000 --> 00:53:35.000
model and it's a little bit more complicated to fit this than it was for the linear regression stuff, at least to write it out and go over it.

00:53:35.000 --> 00:53:38.000
Okay. So sk, learn has a logist regression as a model object.

00:53:38.000 --> 00:53:42.000
As we might assume. So we're gonna say from Sk, learn dot linear model so that's where it's stored is in linear model.

00:53:42.000 --> 00:53:51.000
So that's where it's stored is in linear model.

00:53:51.000 --> 00:54:05.000
We're going to would just regression. Okay? And now we're gonna go to the package or the model object documentation.

00:54:05.000 --> 00:54:06.000
So I wanna point something out. So if you look here, there's this term, penalty equals L.

00:54:06.000 --> 00:54:17.000
2, if you remember, back to when we learned linear regression last week, or maybe it might have actually just been a Monday this week.

00:54:17.000 --> 00:54:21.000
L. 2 was the Ridge regression? Right? So it was. The Ridge regression.

00:54:21.000 --> 00:54:26.000
Norm. So what this is saying is by default, sk, learn, and pl.

00:54:26.000 --> 00:54:33.000
Ug. It uses the Ridge Regression version of logistic regression.

00:54:33.000 --> 00:54:43.000
And so because I just want to demonstrate regular logistic regression, we're going to set the penalty equal to none.

00:54:43.000 --> 00:54:44.000
And so this is going to go back to regular old school.

00:54:44.000 --> 00:54:51.000
Logistic regression. Just be aware that by default sk!

00:54:51.000 --> 00:54:59.000
Learn employments, regularized logistic regression.

00:54:59.000 --> 00:55:08.000
Okay. So when I define my model object, I'm going to call logistic regression input penalty equals to none.

00:55:08.000 --> 00:55:16.000
And so you might also, if the if you haven't seen none before, this does a special python object?

00:55:16.000 --> 00:55:20.000
It kind of just means. It literally just means nothing like there is no argument.

00:55:20.000 --> 00:55:21.000
It's different from the string of the word.

00:55:21.000 --> 00:55:29.000
None. Okay? So then I'm gonna fit my model and X train.

00:55:29.000 --> 00:55:33.000
And then this is a one dimensional vector or array.

00:55:33.000 --> 00:55:40.000
So I have to use reshape, followed by y train.

00:55:40.000 --> 00:55:45.000
Logistic regression. Object is not callable.

00:55:45.000 --> 00:55:52.000
Let's just copy and paste.

00:55:52.000 --> 00:55:57.000
Oh, I forgot, that's it! That's why got it.

00:55:57.000 --> 00:56:05.000
Here we go. And so, just like with K nearest neighbors, we can call dot predict.

00:56:05.000 --> 00:56:08.000
And we input.

00:56:08.000 --> 00:56:11.000
Our features, so you can see now we've got zeros and ones.

00:56:11.000 --> 00:56:14.000
But remember, I said, we're trying to model of probability.

00:56:14.000 --> 00:56:18.000
And so you know what are these zeros and ones?

00:56:18.000 --> 00:56:24.000
How do I get the probability? Well, just with the predict provo that we talked about in K nearest neighbors?

00:56:24.000 --> 00:56:37.000
So we do law, Greg, and instead of predict, we do predict, underscore, pro bus, or praba. I guess I don't really know how you should say it and then we're gonna input our features.

00:56:37.000 --> 00:56:56.000
And now you can see we have 2 columns. And so the 0 column here is the probability that the observation is of Class 0 and the one column here is the probability that it's of class One.

00:56:56.000 --> 00:57:02.000
And so we can use this predict proba to plot the fitted logistic regression model.

00:57:02.000 --> 00:57:03.000
And so we have our training data as blue circles and our red dotted line is the fit of the model.

00:57:03.000 --> 00:57:18.000
So this represents the probability that y is equal to one given x as fit by the the logistic reviewression model on the training data.

00:57:18.000 --> 00:57:28.000
Okay.

00:57:28.000 --> 00:57:39.000
So, Keira Thon, the reason that you're so so I get this error when I run dot fit value error, colon logistic regression supports only penalties, and L one l.

00:57:39.000 --> 00:57:43.000
2 elastic net and lowercase, none as a string.

00:57:43.000 --> 00:57:48.000
So the reason that you have this is you have an earlier version of Sk.

00:57:48.000 --> 00:57:52.000
Learn. So in the earlier version of Sk. Learn, they use the string. None.

00:57:52.000 --> 00:58:08.000
They later updated that to be the python object known with a capital N, and not as a string. So if you haven't that error, you have to actually use the string, none.

00:58:08.000 --> 00:58:23.000
And so this is just a good place to remind everybody a lot of times if you try and run the code that I run and it doesn't work first check for a typo, and then, if you can't see a typo, it's probably because the version of the package that i'm, using is different

00:58:23.000 --> 00:58:24.000
from the version of the package that you're using.

00:58:24.000 --> 00:58:34.000
So check what version you have, and then see if you can find that version's documentation.

00:58:34.000 --> 00:58:45.000
Yahweh is asking, is it worth understanding? The overlapping region between class One and Class 2 so.

00:58:45.000 --> 00:58:58.000
It's possible that like if this was a real world dataset you might be interested in like, is there something about this region of the feature that is like, why, we see sort of this overlap.

00:58:58.000 --> 00:59:02.000
Because this is like synthetic data. That's it's not real.

00:59:02.000 --> 00:59:07.000
It's just randomly generated. This just happens because I wanted to not have a very easy problem of oh, just put a dividing line at like point 6 or something.

00:59:07.000 --> 00:59:21.000
So that's why I would say that, like probably the amount of overlap determines sort of the steepness.

00:59:21.000 --> 00:59:24.000
Is that the right word, the steepness of the curve, so like?

00:59:24.000 --> 00:59:34.000
If there was very little overlap, there would probably be a much steeper transition, and if there was more overlap it would be kind of less steep.

00:59:34.000 --> 00:59:42.000
So that that's what I would say in like a real world problem you might be interested in, like, you know, understanding.

00:59:42.000 --> 00:59:49.000
Is there something about the observations in the feature that explain like that sort of like an actual phenomenon going on?

00:59:49.000 --> 00:59:53.000
As to why, there's overlap here.

00:59:53.000 --> 01:00:03.000
Brantley's asking. It seems strange that logistic regression is regularized by default and ask Kaylor. And do you know why that is? I don't know why that is, I kind of thought it was strange to. And it's also like this weird thing.

01:00:03.000 --> 01:00:18.000
Like, I taught this notebook this is probably like my 6 or 7 time teaching this notebook, and it wasn't until like my third time that somebody asked like the curve used to look really weird compared to the data because it was regularized.

01:00:18.000 --> 01:00:22.000
And somebody finally asked and I looked at the documentation, and it had.

01:00:22.000 --> 01:00:25.000
It was imposing that regularization. And so that was why.

01:00:25.000 --> 01:00:27.000
So, if I don't think a lot of people just know like it's doesn't seem natural to me that you would impose the regularization from the get.

01:00:27.000 --> 01:00:39.000
Go, so I don't know why. I'm sure that they had some reason why.

01:00:39.000 --> 01:00:52.000
So it's also a good message that you should read your read the documentation of the functions that you're trying to implement because they might not be working in the way that you're assuming they're working in the way that you're assuming they're working.

01:00:52.000 --> 01:01:00.000
Okay, so earlier, we talked about earlier in this notebook, we talked about that dot predict automatically gives you zeros and ones.

01:01:00.000 --> 01:01:03.000
And you're like, well, what the heck's going on here?

01:01:03.000 --> 01:01:19.000
I thought this was supposed to give me probabilities so what's happening is when it goes through all of the observations, and it checks out for which of the 2 columns you have a probability greater than a half, and then whatever column that is it assigns the corresponding class so

01:01:19.000 --> 01:01:23.000
if column one had probability greater than a half, it assigns a one.

01:01:23.000 --> 01:01:31.000
If column 0 had a probability greater than a half, it assigns a 0 and so that's what's going on, which this means.

01:01:31.000 --> 01:01:37.000
This allows you to set different probabilities, cut offs, and so one thing you might end up doing is all of the Sk.

01:01:37.000 --> 01:01:57.000
Learn almost all of the sk learn classification. Algorithms will set this sort of cut off of point 5 being the default, or if it's a multi-class, whichever one just gets, the majority, what you could do then, is play around with if setting different probability, cutoffs like maybe i'm going to be a little bit

01:01:57.000 --> 01:02:16.000
stingier, and say, I only want to get observations that have a probability of like point 7 like you really want to be sure that those observations are a one before you're gonna say that the classification is a one so you can set your own cutoffs and then that will alter

01:02:16.000 --> 01:02:20.000
things like the accuracy or the precision or the recall.

01:02:20.000 --> 01:02:28.000
And this cut off sort of becomes a different type of a different type of thing that you can tune with.

01:02:28.000 --> 01:02:32.000
Cross validation. So, for instance, we could set a different cutoff.

01:02:32.000 --> 01:02:38.000
Maybe we wanna be a little bit stingier about what we call one, and we set it to be like point 6 3.

01:02:38.000 --> 01:02:45.000
And so then we can say, All right, so give me all of the observations where my predicted probability.

01:02:45.000 --> 01:02:50.000
So I'm gonna first store. All my predicted probabilities in an array, so I can keep using them.

01:02:50.000 --> 01:02:51.000
Not shape, dot, refresh negative negative one comma one.

01:02:51.000 --> 01:03:04.000
And so I'm just gonna get the first, the one column, the column that corresponds to class one, and then I'm going to do to get my predictions.

01:03:04.000 --> 01:03:14.000
I'm going to say one times the probability greater than or equal to the cutoff.

01:03:14.000 --> 01:03:32.000
And so what this does is it produces an array of 0 of trues and falses, and then multiplying it by a one, will change the array to zeros and ones, and then I calculate the the accuracy, so if I have a cut off of point 6 3 my training accuracy.

01:03:32.000 --> 01:03:34.000
Is now 92.7 5%. And we could play around with this and just see, like, okay, what if I make it point 4 3?

01:03:34.000 --> 01:03:42.000
And you can see now it's a point 4 3.

01:03:42.000 --> 01:03:47.000
I have a slightly lower accuracy. So let's go back to point 6 3.

01:03:47.000 --> 01:03:54.000
And so what you could do. I'm doing it with the training set, but in practice you would use a validation set or cross validation.

01:03:54.000 --> 01:04:01.000
You can change the cutoffs to be different values, and then see how that impacts whatever metric you're using.

01:04:01.000 --> 01:04:07.000
So here, for instance, is the training, accuracy, again, I'm just using the training because it's easiest in practice.

01:04:07.000 --> 01:04:15.000
You want to use cross validation or a validation set, and you can see how the cutoff impacts the accuracy.

01:04:15.000 --> 01:04:23.000
And if this was a cross-validation or a validation set, you may be want to choose the cutoff that has the highest accuracy.

01:04:23.000 --> 01:04:27.000
If that is the metric you've decided to go with.

01:04:27.000 --> 01:04:40.000
Okay, are there any questions about the probability cut off? Cause?

01:04:40.000 --> 01:04:46.000
Okay, so we're gonna quickly go through how to interpret logistic regression.

01:04:46.000 --> 01:04:50.000
So just like you can interpret linear regression by looking at the coefficients.

01:04:50.000 --> 01:04:56.000
You can also somewhat interpret logistic regression by again looking at the coefficients.

01:04:56.000 --> 01:05:05.000
So if you take the log of both, if you do some rearranging and a little bit of algebra, you can find out that the log of P.

01:05:05.000 --> 01:05:06.000
Of X, divided by one minus p. Of x is equal to x times.

01:05:06.000 --> 01:05:15.000
Beta! The expression of P. Of x, divided by one minus p.

01:05:15.000 --> 01:05:25.000
Of X. This is the odds of the event. Y equals one so the probability of an event divided by the probability of the event not happening is known as the odds and for us the event is Y equals one.

01:05:25.000 --> 01:05:40.000
I guess it would be technically the conditional odds conditional on the features, the statistic model for logistic regression is a linear model of the log odds of being class One.

01:05:40.000 --> 01:05:44.000
And so this allows you to interpret the coefficients of the model.

01:05:44.000 --> 01:06:02.000
So if you look at the model we just fit, we have that the log odds as equal to beta 0 plus beta one x, or rather the odds given X is equal to some constant C times E to the beta one x so x is our feature C is some constant that we're ultimately

01:06:02.000 --> 01:06:09.000
not gonna care about this allows you to interpret. Sort of what does a one unit increase in the feature due to the odds that we're going to be class One.

01:06:09.000 --> 01:06:19.000
And so if you go through this, you can see that for every one unit increase in your feature.

01:06:19.000 --> 01:06:24.000
You're then going to see A an E to the Beta.

01:06:24.000 --> 01:06:28.000
One multiplication of your odds.

01:06:28.000 --> 01:06:32.000
Okay. And so we're gonna go through and show you how to do this.

01:06:32.000 --> 01:06:36.000
And I also just as a another, thing of this penalty.

01:06:36.000 --> 01:06:37.000
This only works if your penalty is equal to none.

01:06:37.000 --> 01:06:46.000
So you have to be doing logistic regression, not regularized legislation.

01:06:46.000 --> 01:06:52.000
So here's our coefficient. So you access it again with.co F.

01:06:52.000 --> 01:06:59.000
So our coefficient is 23.1 2, and so then we can interpret the coefficient.

01:06:59.000 --> 01:07:09.000
Just like I said so instead of a one unit increase to make it more, you know, like one unit is the whole span of of our feature.

01:07:09.000 --> 01:07:11.000
So we're gonna go off a point, one unit increase.

01:07:11.000 --> 01:07:16.000
So for a point, one unit increase in our feature.

01:07:16.000 --> 01:07:22.000
Our odds are multiplied by a factor of 10.1.

01:07:22.000 --> 01:07:26.000
So finally, we have some assumptions for the algorithm.

01:07:26.000 --> 01:07:32.000
So we didn't mention any of these, because the data was generated to be to follow these assumptions.

01:07:32.000 --> 01:07:36.000
The first assumption is that your samples need to be independent.

01:07:36.000 --> 01:07:39.000
If you use multiple predictors, you don't want them to be correlated.

01:07:39.000 --> 01:07:50.000
Similarly with linear regression, you have the. Your assumption is that your log odds are a linear function of of your your data.

01:07:50.000 --> 01:07:53.000
And then finally, you typically want to have like a larger data set.

01:07:53.000 --> 01:08:14.000
So if you have a really small data set. And again, this is dependent upon, like the number of features you're including in your model. That sort of thing. If you have a really small data set, logistic regression won't be the best model for your data.

01:08:14.000 --> 01:08:27.000
So Yahweh is asking is odds in some sense of odds we don't want it to happen so odds like the way you've I don't know what the proliferation of sports gambling you've probably heard odds and some sort of

01:08:27.000 --> 01:08:34.000
advertisement recently, so odds is essentially like the way you probably assume odds like work in the real world.

01:08:34.000 --> 01:08:35.000
So like the odds are the probability that something will happen.

01:08:35.000 --> 01:08:50.000
Divided by the probability that it will not happen. So it's supposed to give you some sort of sense of how likely something is to happen compared to not happening so like this expression right here.

01:08:50.000 --> 01:08:56.000
The P. Divided by one minus p. That those are odds.

01:08:56.000 --> 01:09:13.000
So like. If something has 2 to one odds, it's 2 times more likely to happen than not happen.

01:09:13.000 --> 01:09:15.000
Okay.

01:09:15.000 --> 01:09:20.000
So what'll see how far away get through this? I expect that we'll be able to finish the diagnostic curves.

01:09:20.000 --> 01:09:30.000
I don't know that we'll be able to start notebook number 6 today, but I'm pretty happy with our progress, with the amount of lectures we have left.

01:09:30.000 --> 01:09:34.000
So we're gonna use this data set that we just use for logistic regression.

01:09:34.000 --> 01:09:40.000
So I'm just going through a refitting everything and reminding ourselves of what we literally just looked at.

01:09:40.000 --> 01:09:44.000
And let's double check. Okay? Good penalty was none.

01:09:44.000 --> 01:09:57.000
So just like we had a different metrics, those metrics can also allow us to find different curves to allow us to diagnose our model and the compare it to other models.

01:09:57.000 --> 01:10:06.000
So remember our confusion. Matrix, the nice thing. And like, we literally just talked about this, which is really nice about being able to do this notebook today.

01:10:06.000 --> 01:10:11.000
We talked about how different probability cutoffs lead to different predictions.

01:10:11.000 --> 01:10:12.000
Right. And so basically literally, every probability cutoff we could choose.

01:10:12.000 --> 01:10:28.000
Is going to give us a different confusion. Matrix, right? So, for instance, again, this is all in the training data, just for the simplicity of a.

01:10:28.000 --> 01:10:42.000
Of the lecture. So, if we chose a probability cutoff of point 4, our confusion matrix would look like this versus a probability cutoff of 0 point 6 gives us a confusion matrix that looks like this.

01:10:42.000 --> 01:10:56.000
And so basically, this means there's a wide range of possible uhcisions possible recalls possible specificities possible sensitivities that we could get just by choosing different probability cutoffs.

01:10:56.000 --> 01:11:02.000
So this allows us to develop a series of curves that we can use to sort of look at differences and compare models.

01:11:02.000 --> 01:11:07.000
And sort of give us a feel for okay, if we chose this cutoff versus this cutoff, what precision and recall are possible?

01:11:07.000 --> 01:11:17.000
So the first type of curve that you may have heard of before is called the precision.

01:11:17.000 --> 01:11:21.000
Recall curve. So the precision, recall, curve.

01:11:21.000 --> 01:11:28.000
How is going to plot the precision on the vertical axis and the recall, on the horizontal axis?

01:11:28.000 --> 01:11:33.000
And so you could if you're doing this on your own, you could try and do as an exercise.

01:11:33.000 --> 01:11:42.000
I'm just gonna do it for you. So we're gonna import our precision score and our recall score.

01:11:42.000 --> 01:11:52.000
And then what I'm gonna do is go through an array of cutoffs, and then I get the precision and the recall for each of those cut offs.

01:11:52.000 --> 01:11:53.000
So I've got logistic, or did I store this?

01:11:53.000 --> 01:12:05.000
I did not so let's go ahead. I'm gonna copy this and add an extra line here.

01:12:05.000 --> 01:12:14.000
So my probability, my pro, my probabilities. I'm going to store in this vector and then from my cut offs, I'm going to do one times.

01:12:14.000 --> 01:12:24.000
Y prob greater than or equal to the cutoff, so each time through the loop I'm looping through this array of possible probability cutoffs.

01:12:24.000 --> 01:12:30.000
Then I'm finding what the predictions would be if I use that cutoff and now I'm going to track the resulting precision and recall from that.

01:12:30.000 --> 01:12:40.000
So precision scores dot append precision score, and then I need my Y.

01:12:40.000 --> 01:12:50.000
So again, this is on the training data. Just for the simplicity of the lecture, and then predict.

01:12:50.000 --> 01:13:03.000
And then my recall, my rec underscore scores dot append, recall, scores Y train with the predicted values.

01:13:03.000 --> 01:13:08.000
Not scores. Just score.

01:13:08.000 --> 01:13:12.000
Okay. And so here you can see what a precision recall.

01:13:12.000 --> 01:13:19.000
Curve. Looks like it's supposed to give you a sense of what precision and recall combinations are available.

01:13:19.000 --> 01:13:32.000
And so there is sort of a trade off with precision, and recall typically the higher your precision, I if you try to raise your precision in general that will lead to a lowering of your recall and vice versa.

01:13:32.000 --> 01:13:38.000
So it's not always possible to get perfect, persistent, and a perfect recall by raising one you tend to lower the other, and vice versa.

01:13:38.000 --> 01:13:42.000
And so the the perfect classifier would have a precision.

01:13:42.000 --> 01:13:50.000
Recall, curve that hugs the upper right hand corner of the plot.

01:13:50.000 --> 01:13:56.000
So perfect, classifier would be one where you can get 100% precision and 100% recall.

01:13:56.000 --> 01:14:02.000
So anytime you make a prediction that it's positive it is actually positive.

01:14:02.000 --> 01:14:04.000
And you're able to capture all of the actual positives.

01:14:04.000 --> 01:14:10.000
That's the idea here. Okay? And so basically, the idea is, you're gonna want to like, have some sort of trade off in mind of like, okay, I would rather have a higher precision.

01:14:10.000 --> 01:14:23.000
But there may be some sort of business or implication of choosing, like one value versus the other.

01:14:23.000 --> 01:14:38.000
It really depends on the problem, like from the project you're working on, you may be able to then, like, sort of say, Okay, if I have a recall or precision of this, it has this implication on the real-world problem, I'm solving.

01:14:38.000 --> 01:14:48.000
And so you might have some sort of limited of okay, I don't want to go below this recall, or I don't want to go below this precision because of X Y and Z.

01:14:48.000 --> 01:15:08.000
In the and the business problem you're looking at. So these sorts of curves allow you to see, like what are the possible precision and recall scores for any given classifier you're looking at, and then you could plot multiple curves for multiple different models to see if one model could give you a

01:15:08.000 --> 01:15:12.000
better recall for an equivalent precision or something.

01:15:12.000 --> 01:15:27.000
Okay, so are there any questions about the precision? Recall, curve?

01:15:27.000 --> 01:15:33.000
Right. Another curve that tends to get looked at, and maybe you've heard of before.

01:15:33.000 --> 01:15:45.000
If you tried to learn classification before the boot camp is called the receiver operating Characteristic, or Roc curve, so these curves are arose in World War Ii.

01:15:45.000 --> 01:15:49.000
As a way to aid operators of radar receivers for detecting enemy objects in battlefield.

01:15:49.000 --> 01:15:52.000
So I think that's where the Roc comes from.

01:15:52.000 --> 01:15:54.000
I've also been told before by some friends of mine that Roc actually didn't have a meeting for a long time.

01:15:54.000 --> 01:16:00.000
I'm not sure which story is true or not.

01:16:00.000 --> 01:16:05.000
So what this does is, it plots the true possibility rates against the false positive rates for various cutoff values.

01:16:05.000 --> 01:16:13.000
So here's just a reminder of what these are.

01:16:13.000 --> 01:16:18.000
So it's the estimate that you're so true.

01:16:18.000 --> 01:16:21.000
Positive rate is Tp over Tp plus Fn.

01:16:21.000 --> 01:16:30.000
So it's the probability that you're producing a one given that you're actually a one.

01:16:30.000 --> 01:16:37.000
And then the false positive rate estimates the probability that you're predicting a one given that it's actually a 0.

01:16:37.000 --> 01:16:41.000
So let's see what I wrote before, and I think this makes sense.

01:16:41.000 --> 01:16:47.000
Okay, so one way to think of these metrics is sort of like, imagine you're in an oncology field.

01:16:47.000 --> 01:16:59.000
So, sometimes if you're somebody who has a tumor, a collection of cancer is cells potentially cancer cells, people will have surgery to remove that tumor. And so the goal of this surgery right?

01:16:59.000 --> 01:17:10.000
Is you want to maximize the number or the proportion or amounts of cancer cells that are removed and minimize the amount of normal cells that are removed.

01:17:10.000 --> 01:17:15.000
And so we can think of the removal of cancerous cells as a true positive.

01:17:15.000 --> 01:17:16.000
And then the removal of normal cells as a false, positive.

01:17:16.000 --> 01:17:40.000
And so your goal with your classifier, or in this surgery, right is to remove the total amount of cancer cells, predict as many actual ones as you can, while limiting the number of non like accidentally classified ones, or normals in this oncology example and so this is the idea is we typically want

01:17:40.000 --> 01:17:43.000
to maximize our Tpr. While minimizing our Fpr.

01:17:43.000 --> 01:17:44.000
Once again it turns out that it's not always possible to increase one without decreasing the other.

01:17:44.000 --> 01:17:56.000
So there's 2 ways to do this. You can just do a for loop like we did before.

01:17:56.000 --> 01:18:03.000
So you would calculate the confusion. And I'm gonna get rid of that.

01:18:03.000 --> 01:18:14.000
And put in my probability from before you put in the confusion matrix with the actual, with the predicted.

01:18:14.000 --> 01:18:17.000
And then you'll just for some simplicity of the formulas.

01:18:17.000 --> 01:18:21.000
I'm doing, extracting the different values here.

01:18:21.000 --> 01:18:25.000
So confusion. Matrix at 0, comma, one confusion matrix at one comma 0.

01:18:25.000 --> 01:18:26.000
And then the true positives are the confusion, matrix.

01:18:26.000 --> 01:18:34.000
At one comma one, so remember, Tprs are Tp.

01:18:34.000 --> 01:18:41.000
Divided by fn, plus tp, and then false positive rate is the Fp.

01:18:41.000 --> 01:18:46.000
Divided by Tn. Plus Fp.

01:18:46.000 --> 01:18:53.000
Okay. And so then you plot the true positive rate against the false positive rate.

01:18:53.000 --> 01:19:00.000
And so typically what you'll see along with your curve is this dotted line, or it doesn't necessarily have to be dotted.

01:19:00.000 --> 01:19:12.000
But you'll see the line y equals x. And so this often is a reference line that gets plotted on these roc curves.

01:19:12.000 --> 01:19:22.000
So these Roc curves, this, this line is supposed to represent what you would get if your algorithm just does random guessing.

01:19:22.000 --> 01:19:29.000
And so ideally, what you want to be is above, and the upper left hand diagonal over to the upper left hand triangle.

01:19:29.000 --> 01:19:41.000
So you wanna be above that line, because otherwise your algorithm is no better than random guessing.

01:19:41.000 --> 01:19:54.000
The best algorithm you could get would be one that has a point at the 0 comma one mark, because then it is possible for your algorithm to have.

01:19:54.000 --> 01:19:55.000
Huh? A false, positive rate of 0, and a true positive rate of one.

01:19:55.000 --> 01:20:05.000
So once again, you can use this curve to choose cutoffs that you know, provide, and like, show you the trade off for your algorithm, for true positive rate.

01:20:05.000 --> 01:20:29.000
And false positivity you could use it to compare what's possible for different algorithms, and maybe see if one algorithm allows you to get a true positive rate with the false positive rate that you're willing to accept based on the project you're working on and it also occasionally

01:20:29.000 --> 01:20:35.000
people will use the area under these curves as a matrix metric on its own.

01:20:35.000 --> 01:20:48.000
I would encourage not to do that, as your metric, mainly because that metric doesn't indicate whether your area is high because you're able to get like it doesn't.

01:20:48.000 --> 01:20:52.000
Tell you why your area is high, like it doesn't tell you if it's because you're able to get better true positives.

01:20:52.000 --> 01:21:00.000
You know what I mean like you're not able to get a good sense of like what it's actually like telling you you're obfuscating some of the information that the curve gives you.

01:21:00.000 --> 01:21:12.000
By just looking at the area.

01:21:12.000 --> 01:21:19.000
Yeah, so Lara just asked, would you compare the curves visually to determine which model is best, or is there a quantitative measure?

01:21:19.000 --> 01:21:25.000
So typically what people have done in the past is, they'll look at the area under the curve, and then they'll say, Oh, this one has a higher area.

01:21:25.000 --> 01:21:37.000
Therefore it must be better. One reason I don't like that is because by just having a single measure, you're sort of obsolete like, why is this area better?

01:21:37.000 --> 01:21:38.000
Is it because it tends to have better true, positives, or does it tend to have better false positives?

01:21:38.000 --> 01:21:48.000
So, if I had a choice, I would just look at the curves and then select, based off the curves.

01:21:48.000 --> 01:21:58.000
And not necessarily. I'm not a big fan of metrics that try and take a bunch of different measrics and then combine them together for, like sort of like a super, or I don't know.

01:21:58.000 --> 01:22:13.000
Met met metric for comparison purposes. I think it's better to get like, understand? Like the real world implications where you couldn't take the area under the curve and then translate to what that means in terms of the real world problem.

01:22:13.000 --> 01:22:14.000
Okay, so that was, how to do it with a for loop.

01:22:14.000 --> 01:22:25.000
This is such a popular thing that Sk. Laren has a function, for it's called Roc Curve.

01:22:25.000 --> 01:22:32.000
And so, Roc Curve takes in the true values followed by the predicted probability.

01:22:32.000 --> 01:22:39.000
So y prob!

01:22:39.000 --> 01:22:48.000
And you'll see it returns 3 things. So the first thing it returns is an array of the false positive rates.

01:22:48.000 --> 01:22:55.000
The second thing it returns is an array of the true positive rates.

01:22:55.000 --> 01:23:02.000
And then the last thing it returns is a list of the cutoffs for that probability the probability cutoffs.

01:23:02.000 --> 01:23:05.000
Now there's a slight difference in the cutoffs.

01:23:05.000 --> 01:23:06.000
You'll notice that the 0 entry is greater than one.

01:23:06.000 --> 01:23:21.000
The 0 entry and the cutoffs is a special thing that I would have to check the documentation to remind myself, like what it does so like if you're interested, you can click on it here and like read through the returns.

01:23:21.000 --> 01:23:29.000
Okay, so there it is. I think the 0 entry is just the maximum possible score plus one.

01:23:29.000 --> 01:23:33.000
So the maximum possible probability that's predicted, plus one.

01:23:33.000 --> 01:23:39.000
So that's the 0 entry of the cutoffs.

01:23:39.000 --> 01:23:48.000
Okay? And so then you can plot it just like you did above, okay?

01:23:48.000 --> 01:24:07.000
And then one reason why it looks different is sk learn automatically includes 0 comma 0 as an entry, whereas like when we did it as a for loop that didn't show up. And so that's why this one goes all the way down. To 0 comma 0 and ours did not.

01:24:07.000 --> 01:24:12.000
The very last chart type, and I think we have just enough time to explain it.

01:24:12.000 --> 01:24:16.000
It's called our gains and lift charts, so this is sort of like it's it's like a weird sounding chart.

01:24:16.000 --> 01:24:28.000
But just bear with me while I explain it. It's used a lot, and I believe in marketing and advertising.

01:24:28.000 --> 01:24:37.000
So the basic idea is, you will take your observations, arrange them in terms of descending probabilities.

01:24:37.000 --> 01:24:48.000
So look at the probability that it's class one, and then arrange your observations in order of that and then what you'll do is plot the true positive rate of your algorithm.

01:24:48.000 --> 01:24:54.000
If you are to only classify the V-th uppers percentile of predicted probabilities as a one.

01:24:54.000 --> 01:24:58.000
So basically let's just assume for a sake of argument, you had a hundred observations.

01:24:58.000 --> 01:25:03.000
You're predicting then what you would do is you would choose the top the 20 with the.

01:25:03.000 --> 01:25:11.000
If you wanted to look at the twentieth upper percentile, you would take the top 20 observations percentile. You would take the top 20 observations.

01:25:11.000 --> 01:25:12.000
The top 20 observations, with highest probability of being one, classify those as one, then you would calculate your true positive rate.

01:25:12.000 --> 01:25:19.000
You do this for every possible percentile, and then you plot the curves that that go along.

01:25:19.000 --> 01:25:29.000
And so sort of the idea of like, why would you ever do this is a lot of times in advertising and marketing.

01:25:29.000 --> 01:25:35.000
You have a limited amount of funds and so you can't advertise to everybody because you just don't have the money to do that.

01:25:35.000 --> 01:25:51.000
And so what you're gonna do is you're gonna allocate your advertising budget to market to V percent of your potential customers and so you want to do this in a way that you maximize the number of people who would see your ad and then become a customer or do it whatever process you're

01:25:51.000 --> 01:26:09.000
modeling, so if you take Class One to be someone who will become a customer after seeing an ad and class 0 to be someone who is not gonna be a customer by only marketing to those people that fall in the top these percent of predicted probabilities, you're only marketing to the people that you think

01:26:09.000 --> 01:26:20.000
are most likely to become a customer, and so the gains chart allows you to see this true positive rate as a function of the percent of observations you've classified.

01:26:20.000 --> 01:26:24.000
So as a function of V, and so similar to the Roc curve, you typically plot a baseline, which is the line Y equals x, which would just be randomly guessing.

01:26:24.000 --> 01:26:50.000
V percent, the lift chart is then basically sort of taking the gains chart and then dividing the line you get from doing this weird process by the ran guessing line to give you a sense of the lift that the algorithms giving you over just randomly advertising the people and so in order to use this you can do

01:26:50.000 --> 01:26:54.000
either. Pandas, quantile function or num pies. Quantile function.

01:26:54.000 --> 01:27:07.000
These allow you to take in a an array of probabilities, and then get the quantile that represents like this is the V percent upper v percentile, so that may sounded weird.

01:27:07.000 --> 01:27:17.000
So here, I'm just making an array of my predicted probabilities.

01:27:17.000 --> 01:27:24.000
So, okay, so like the actual compared with the probability that I got from my algorithm.

01:27:24.000 --> 01:27:30.000
And then what I'm gonna do is do a loop where I start at one, and I work my way down to 0.

01:27:30.000 --> 01:27:40.000
And then I calculate the quantile for for that entry, so like the first entry, will be one, and then point 9 9, then point 9 8, and then point 9 7, and so forth.

01:27:40.000 --> 01:27:51.000
Okay, so here, we can see like, these are the first 5 upper probability quantiles.

01:27:51.000 --> 01:28:13.000
Now that I have that I can write a loop where I go through and calculate my predictions, using those probabilities, my cutoffs, and then calculate my true positive rates as I go through, and then here I'm just making my lists for the lift part plot okay.

01:28:13.000 --> 01:28:18.000
so here's what shows up. This is my for this particular curve.

01:28:18.000 --> 01:28:32.000
My gains, plot, looks like this, so it gives you a sense of like ideally, the perfect algorithm would be a line that goes straight up as like a solid, like a regular line and then I can't.

01:28:32.000 --> 01:28:36.000
It's like hits. The number of actual ones, I think, is where it would hit.

01:28:36.000 --> 01:28:37.000
That would be a perfect algorithm. And then the lift would, you know, correspond to that?

01:28:37.000 --> 01:28:56.000
So this gives you a sense of like, how well your algorithm does versus random guessing in terms of, you know, following this sort of weird procedure.

01:28:56.000 --> 01:28:57.000
So I know a lot of times I know we're over.

01:28:57.000 --> 01:28:59.000
I know a lot of times like people like to know, you know.

01:28:59.000 --> 01:29:08.000
Oh, what's like! People like to have like a one size fits all approach that you know.

01:29:08.000 --> 01:29:16.000
You see, this type of problem, and you apply this approach, choosing the metrics or the diagnostic curves, you all you're gonna use for like choosing your classification algorithm is not always like a plug and chug type.

01:29:16.000 --> 01:29:39.000
Approach. You kinda have to put some thought into it into like, what are the implications for your real world problem like, what are like if this makes is low, what does it mean in terms of like the things you're trying to classify so this is a situation where it's helpful to put in some thought

01:29:39.000 --> 01:29:46.000
using the actual real-world context of your problem and then transate what those metrics would mean in a business setting.

01:29:46.000 --> 01:29:59.000
Maybe there are actual costs like financial costs to doing something incorrect in a certain way, and like public health settings, there are costs in terms of people's quality of life or lifespan.

01:29:59.000 --> 01:30:07.000
So it's important to make. Take careful thought into choosing diagnostic curves and performance metrics.

01:30:07.000 --> 01:30:09.000
Okay. So for the sake of time, I'm gonna just go ahead and stop the recording, and I'll hang back for like 5 to 10 min to answer questions.

01:30:09.000 --> 01:30:22.000
I hope you enjoyed today's lecture, and I hope you have a great weekend.