Diagnostic Curves Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back. Uh This is another classification video. We're gonna talk about diagnostic curves. Uh So we've seen some metrics that are individual, you know, statistics, these are gonna be curves that allow us to look at a wide array of uh of things. Uh So uh we know a couple different algorithms now. So we're, we also talked about, if you watch the logistic regression video, we talked about probability cutoffs and how that allows you to look at a wide range of different uh possible classifiers from the same model. Uh So we're essentially gonna build off of that and discuss how different probability cutoffs can produce different classification metrics, which you can then uh use to make various diagnostic curves including the precision recall curve, the R O C or receiver operating characteristic curve. Uh And then two types of charts called the gains in a lift chart. And it will be doing these using S K learn and map plot lip. ... So in this notebook, I'm going to use the same data that I used in my Logistic Regression notebook. Uh So this random binary and we're just gonna quickly fit a logistic regression model to that data after making a train test split. So here's my data, here's my model fit. So this is all code that should be familiar to you. So we just imported logistic regression, we did a train test split and then we fit the logistic regression model on that data. OK. Uh And in this notebook, all of my curves are going to be made on the training data because it's just easiest for me to implement the curve and not have to think about making validation sets or how to work cross validation. Uh In practice, you would probably want to do something with a validation set or, or, or uh if cross validation, you could plot, if say you did five fold cross validation, you could kind of plot all five of the uh for a particular type of curve, you could plot all five of the resulting curves for a cross validation and then kind of get an idea of the spread that you're looking at. Uh and think of it as a range of different curves as opposed to just one single curve. OK. So as we mentioned in the Logistic Regression notebook, we're gonna remind ourselves, well, if we choose different probability cutoffs, we would also get different confusion matrices, right? So each probability cut off uh corresponds to a different prediction rule. So different zeros and different ones. Uh And so because of that, we get ultimately uh a spectrum of confusion matrices corresponding to these different probability cutoffs. So I'm gonna import the confusion matrix and then we have two examples. One with a cut off of 10.4. So around here and then one with a cut off of 10.6, which is right here. So when we have a cut off of 0.4, our confusion matrix looks like this. So we've got 100 and 90 true negatives, 23 false positives, six false negatives and 181 true positives. And then when we increase the cutoffs, we make it harder to be classified as uh we need a higher uh level of probability uh to consider something being a one. Uh We increase the number of true negatives, decrease the number of false positives. But at the same time, we increase the number of true negatives and decrease the number of oh sorry, we increase the number of false negatives and decrease the number of true positives. So uh we can see how there are tradeoffs between choosing different values of for the cut-off. For instance, if we are most interested in not, you know, false flagging somebody as being a class one, maybe we'd want this higher cut off or if we were more interested in having, making sure that we're capturing all of the people who are of class one. Uh we could maybe use this earlier cut-off. And so it might be nice to be able to see what's possible for a given algorithm uh for different performance metrics by plotting all of the possible outcomes of a wide array of cutoffs instead of just looking at them one by one. And so that's the idea behind these diagnostic curves. The first one that we're gonna look at is called the precision recall curve. And so this plots precision on the vertical axis and recall on the horizontal axis. So if we can recall from our uh confusion Matrix notebook, the precision estimates the probability that an observation is actually of class one when our algorithm says it is and so conditional probability wise, we can think of this as being the probability that Y is actually equal to one given that Y hatch is equal to one. And then on the other side, recall estimates the probability that your algorithm classifies something as a one when it is actually a one. And so this sort of flips the two, this is the probability that your prediction is one given that it's actually one. Uh the observation is actually one. So one way to think about this is in terms of a web search result. So you go onto the web, whatever your favorite web browser is uh worldwide web. Uh And then you search for something when you the uh number of results that show up that are relevant, actually relevant to what your search is is a measure of precision. Whereas the number of total possible relevant results that you capture is measuring the recall. So uh that's sort of go, we're gonna come back to this when we come to trying to interpret the precision recall uh curve. But just that's a nice way that I think uh we can think of precision and recall. So uh we're going to see how to generate this in Python. So the first we can do it with uh S K learns precision score and recall score in a four loop. Uh So this is one of those times where you can try and do this exercise on your own by filling in some of the missing code in the next two chunks. So you can pause the video and try and code it up yourself or you can just watch me and code along or you can watch me and come back and code it later. So the first thing we need to do is import recall score and precision score. ... Then we have various cutoffs going from 0.1 all the way up to 0.974 nine, I think. Uh And so what we're gonna do is we're gonna loop through these cutoffs and record the precision score and the recall score at each one. And so our prediction has to be um ... uh log Greg dot Predict Proba uh Xtra dot reshape negative 11. Uh We want the ones, the one column greater than equal to cut off. Then we do one times this, then we can put in precision scores dot A pen the precision score function, uh why train prediction? And then similarly the recall scores dot append recall score, why train prediction? And now here's where we plot the curve. So our recall scores are on the horizontal, our precision scores are on the vertical. OK. So this is what the curve looks like for this particular algorithm. So we have a lot, we have a precision of one for many values of the recall. But then once we hit a little bit below 10.9, we start to go down in precision until uh we can see with a recall of one, we have a precision of about 10.85. So uh if we had say a perfect algorithm that was able to split the data perfectly into zeros and ones, we would expect our precision recall curve to hug this upper right hand corner. So we would have essentially a blue line that traces out along the upper right hand corner of a unit square. Um This is usually not gonna happen in practice unless you have a data set that can be separated perfectly. Uh which isn't standard. Uh So we can see as we talked about, once you start to increase precision, there does come a point where the recall or sorry. As you start to increase recall, there comes a point where the precision starts to go down. And that's because in practice there is a trade-off between precision and recall. So uh one way we can think about this is going back to our web search result. So how would you want to increase recall? Well, you need to return uh a higher number of the total possible relevant posts. But usually when you return more posts, you're more likely to also return non relevant posts. And so when you look at your web, your web results, right? If you're returning more posts in total, some of which are now more likely to be non-relative. Well, that's going to decrease your precision, which again is a measure of all the posts that I returned. Uh how many of them are relevant to my search term. OK. Uh And we can also sort of see this from the trade-off functions or from the precision and recall formula. Uh So, you know, if we have a cut off probability of zero, uh everything is classified as a one and so recall is one and then precision will be whatever percentage of the observations are of class one. OK. Uh And so then as we uh do do do ... so, the denominator of recall is not impacted by the cut-off. So this stays the same no matter what. However, the denominator of precision is uh impacted by the cut-off, right? Because this is solely determined T P plus F P is solely determined as a function of the actual algorithm output. So since the number of true positives decreases as the cut-off increases. Uh the recall must also decrease to zero. So as we increase the cut-off, the true positives go down because we're less likely to classify something as a one in practice or uh we're less likely just to classify something as one. So true positives have to go to zero, which means recall goes to zero uh in the other or conversely precision would go up because um as you increase the cut-off, you're less likely to catch false positives, which means that of the things that you label a positive, more of them are likely to be a true positive. So precision would go up and then in practice, usually, like, let's say that we have zero um predictions being positive. Uh In that case, we usually define precision to be one. So and in, in summary, this is the precision recall curve. It gives you a way to try and select a probability cut-off that gives you a good uh balance between precision and recall because typically there is a trade-off that exists between increasing your probability of being correct when you say something's a true positive. And actually capturing all the true positives. ... The second type of curve we're gonna talk about is the receiver operating characteristic curve or the R O C curve, which is what we'll call it from now on because it's easier. So this curve actually came about in world War Two as a way to aid operators of radar receivers uh that were detecting enemy objects in the battlefield. So what it does is it plots the true positive rates on the vertical axis against the false positive rates uh for various cut-off values. And so this curve for a specific algorithm is then often compared to a theoretical R O C curve which is like what if I just randomly guessed what it was. Uh And if you take the um average curve that you get over various uh randomly generated test sets, this turns into the line Y equals X. So you're going to plot the R O C curve along with a dotted line usually which is Y equals X, which is uh it gives you a comparison point to random guessing. ... Uh So true positive rates. Uh That's just again, measuring the probability that an observation is a one given that uh is classified as a one, given that it actually is a one. So this highlighted probability and then false positive rate gives you the probability that an observation is classified as a one when it's actually a zero. So this highlighted probability. Uh So I like to think that a nice analogy for these two metrics and sort of the tradeoff because there's a tradeoff here as well between them comes from oncology. So sometimes if you have a, a tumor or a patient, let's say you're a doctor and you have a patient that has a tumor sometimes they'll have surgery to remove that tumor. Uh which again is a collection of cancer cells. And if you're not familiar with this cancer is bad in general, uh there may be some kind of weird fringe case where it's OK, you wanna have the cancer cell there. I don't know. But maybe so, the goal of these surgeries is to try and remove all of the cancerous cells that you can while not removing too many healthy cells because healthy cells are good to have. Uh And so essentially, you want to maximize the amount of cancer cells you cut out and you want to minimize the amount of normal cells that are removed uh uh removed. And so we can think of this as we want to maximize our true positive rate. So we can think of the true positive rate as the uh fraction of cancer cells we remove and we want to minimize the false positive rate, which we can think of as being the fraction of uh normal cells that are removed, right. So we want, ideally in a, in this setting, we would want to remove all the cancer cells to have a true positive rate of one and leave all the normal cells to have a false positive rate of zero. So we can implement uh uh the R O C curve again using S K learn and a four loop here, the S K learn is just gonna be getting the confusion matrix. Uh And then the four loop is gonna be operating in kind of the same way as the last one. So again, this is gonna be an exercise, a good opportunity to practice. So if you want to code it up on your own, feel free to do so, or you can watch me or you can come back and fill it in after I've done it. ... So I believe we uh imported confusion matrix earlier. Yep. So we can just call confusion matrix. ... And we want to put our true values first followed by our predicted values. The true negatives are the confusion matrix at zero comma zero. The false positives are at zero comma one, the false negatives are at one comma zero and then the true positives are at one comma one. ... So for the true positive rate, true positives go on the top and then true positives plus false negatives go on the bottom where the false positive rate, false positives go in the top and then false positives plus true negatives go in the bottom. ... And so then we can run this code which just plots the two so true positive on the vertical axis, false positive on the horizontal axis. This is what our R O C curve looks like. And we can see here that the um tradeoff between the two comes in sort of the opposite direction. So as you decrease the false positive rate, you will tend to in general uh also decrease the true positive rate. So remember decreasing, the false positive rate is a good thing. Decreasing the true positive rate is a bad thing. And so the R O C curve acts as a way to give you an idea as to what false positive rate could I maybe get with this algorithm? Uh while balancing also having a high true positive rate. And so for instance, we would maybe be interested in this area over here where we're trying to see, how far can I push the false positive rate down before I get to uh an area where I'm no longer willing to reduce the true positive rate. And an ideal curve, the perfect classifier uh would have a uh a um ... uh a curve that hugs the upper left hand corner or sometimes it's just visualized as a single dot up in the 01 space. Uh that's the classifier that's able to achieve a false positive rate of zero and a true positive rate of one at the same time. OK. And your goal here is to at least have a classifier that is above and to the left of this dotted line. Uh If you're having a curve that falls below into the right of the dotted line, uh that indicates that you could do better by just randomly guessing the observation. And so your classifier is doing something not good. Uh So again, uh as I said before, we show how to do this in S K. Learn. Uh there is a trade-off between the two. Um So if we think about this in our cancer surgery example, maybe you're a surgeon or your surgeon gets way too ambitious while targeting cancer cells. Uh So they're very aggressive in what cells they're cutting out. Well, if they're very aggressive in what cells they're cutting out, they're probably gonna get more cancer cells. A K have a high true positive rate. But then they're also more likely to accidentally remove a normal cell, meaning that they'd have a low false positive rate. And then conversely, if you do the other way, if maybe you're too cautious because you don't want to remove any normal cells. Uh Well, you would have a um a high or a low false positive rate good, right? We're not removing any of our healthy cells, but at the same time, you're probably gonna miss more of the cancer cells than you would like to. Uh And also, uh before we move on to the next, uh we did this by hand in a four loop, we also do it. There's this function s can learn called R O C curve in the metrics. So we would call can learn dot metrics import R O C curve. And then what you can do is the R O C curve is going to return both the false positive rates, the true or all three of the false positive rates, the true positive rates and the probability cutoffs that were uh used to generate those rates, uh they're returned in that order. So false positive rates first, then true positive rates, then cutoffs uh if you put in the true values and then the probability of being class one. Uh So for instance, we call F pr S T pr S cutoffs equals two R O C curve Y train true values first and then uh log reg dot Predict Xtra dot reshape negative 11 and then the, the one column ... hm do too many indices for ray oh Predict proba. ... And we can see that they're probably not identical but they're close to identical. OK. The final types of chart that we're gonna look at are called gains and lift charts. Uh So this kind of comes more from the business side of things and marketing. So these curves try and give you an estimate of the true positive rate of your algorithm if you were to classify. So remember we're getting probabilities here if you take the vth upper percentile of predicted probabilities and classify them all as one. So you can think of this as ranking the observations in order of highest probability of being one down to lowest probability of being one. And then you take the upper V percent of that and you would classify those for sure as ones and you'll classify the rest as zeros. And so you might think, well, this is a weird way to do things, but the idea kind of comes from marketing. So you maybe only have the funds in your advertising budget to market to V percent of your potential customers. So let's say you have 100 possible customers and you want, you only have the money to send ads to 20 of them. And so you'd want to make sure that those 20 ads go to the ones, the people in that 100 that are most likely to actually respond to the ads in a positive way. Uh And so you can think in this setting is something being predicted to be class one represents a person that's likely to become a customer after seeing your ad. Uh and someone at class zero who's someone who would not be uh would not become a customer after seeing your ad. So if you arrange it, so that you only mark it to those that are in the top v percent of all of your predicted probabilities, uh You're only then gonna market it to those that have a high probability of being a class one person. So someone who will become your customer according to your algorithm. And so what the gains chart does is it plots the true positive rate on the vertical axis. Uh And then the percent of observations that you've predicted as a one on the uh horizontal axis. So for instance, if you only take the top 1%. It would give you the true positive rates of the top 1%. Uh This is similar to the R O C curve in the sense that we usually plot a B A baseline curve for comparison where the baseline we use is that you're just gonna randomly select a V percent of your observations to be of class one, an accompaniment to the lift chart or to the gains chart is called a lift chart where essentially you're gonna take the ratio of your uh algorithms gains plot to the random guessing. And that's what's get plotted. And it might be easiest to show this in Python. So what we're gonna do uh is we can either use Panda's quantile function which would then you put in the quantile you want and then it returns the value that corresponds to uh that. So let's say you want the people in the upper 1%. Uh You would put in 99 it would give you the number where 99% of people have a probability at or below that number. So here I'm making a data frame that has the actual Y value along with the predicted probability. And then I do a list comprehension to uh arrange the probabilities that give me the different cutoffs. So for instance, these are the first five upper probability quant tiles. So this 100% of the people are at or below this. Um 99% of the people have a probability at or below this uh and so forth. And so then this goes through those cutoffs that we've just generated gets the T pr S calculates the lift curve which again is just the true positive rate of your algorithm divided by random guessing, which is the line Y equals X. OK. ... So this all this stuff is just plotting that makes my plot look nice. Uh The blue solid line is my games plot for the logistic regression, which we can compare to the baseline of the line Y equals X, which is just randomly sending ads to people. And so we can see that we have a pretty steady increase in true positive rate until we hit a little bit above 0.4, then it gets kind of bumpy. And then once we get, if we could say, if we had the funds to market to uh 55% of the people uh in our customer database, then it looks like according to this, we should be able to get close to 100% of the people who would become our customers. And so the way to then read the lift plot is to say, OK, well, let's say um we can only mark it to 20% and you can go up here to the blue line. And you could see well, according to our algorithm, if we use our algorithm to choose who we market it to, then we can do two times better at converting customers than just randomly choosing people. OK. Uh OK. So you might be wondering, we have all these nice curves, we have all these nice metrics. Again. What can I do? What's the right way to do this? How can I do this in algorithmic way? Uh Unfortunately, the answer is tough. Uh There isn't one set way to answer any question. There's no one set metric that is the best metric or no one curve, that's the best curve. It really depends upon the problem that you're working on. So for instance, if you're in a world where you only care about getting the people that are ones, for instance, this marketing world and you don't really care. Uh If people who are not never going to become your customers, get a bunch of spam messages from you, you probably just care about the games plot. If you're in the world of medicine, you probably also care about getting things wrong the other way, right? So it just depends. Uh And then again, I said, you know, we only did the training sets here. That was for convenience. In practice, we could make these plots with a validation set or we could make these plots uh multiple times with uh let's say five cross validation split sets and then sort of think of like averaging them in some sense. OK. So that's it for this notebook. I hope you enjoyed seeing all these nice curves and seeing how you can use them to think about tradeoffs uh for a particular algorithm given a probability cut-off. Uh I hope to see you in the next video. Have a great rest of your day. Bye.