Multiclass Classification Metrics Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We're gonna talk about multi class classification metrics. So we do know some classification metrics that we talked about in our diagnostic curves and our confusion matrix notebook here, we're just gonna introduce a couple more that might be useful if you're working on a multi class classification problem.
Uh So in this notebook, we'll discuss how maybe you can try and assess a multi class classifier uh expanding in the process, the confusion matrix to account for more than one class and then see how maybe you might use that in your analysis of a model. And at the end, we'll introduce a concept known as cross entropy. ... So the easiest way and maybe the silliest sounding is just that you'll turn your multi class problem into a binary problem somehow.
So this might seem foolish. Why would I ever want to take my multi class problem and turn it into a binary classification problem? Well, there are a number of times you might want to do this. So maybe you have um an issue or a project where you're working on this problem and there are multiple classes. There's lots of classes, but really you're mostly interested in a handful, like one or two or maybe even three of them.
Uh So maybe because these are the most uh important classes to get, right. Uh Or maybe these are the, the most, let's say in a business setting. If something happens to be of this class, it incurs the greatest cost to the business somehow. Uh So one such example is this uh Cleveland Heart disease data set at the U C Irvine repository. So there they have um a data set about a bunch of different patients from the Cleveland Clinic uh where they have, they're trying to keep track of the
heart disease status um trying to find that column. But anyway, so the point OK, here it is uh so the heart disease is integer valued, the status of the patient is integer valued from zero, meaning no presence uh to four. And so there is the presence of heart disease with anybody with a 123 or four. And so, while you may want to build a classifier that can specifically classify each of these 123 or four, ultimately, it may be most important to determine whether or not somebody has
heart disease. Uh So you might combine all of these into a single class heart disease and then uh you know, make sure that you have the highest sensitivity specificity, what have you um for that binary problem instead of the multi class problem. ... Uh Another way that you can sort of check out how your multi class problem uh model is doing is to look at a multi class extension of the Confusion Matrix.
So we did talk about this in the Confusion Matrix notebook. Uh it's relatively straightforward. So before, right, when we had two, it was 0101. Again, our rows are our actual class, our columns are our predicted class. Uh But now you can just extend. So however many classes you'll have, you'll have ac, let's say we have capital C classes. You have AC by C matrix.
Uh where each entry is, the row tells you what the actual value is, the column tells you what you've predicted. So for instance, uh here we would have you predicted a one when it's actually a zero. If you kept going, you all the way to, you predict it's AC but it's actually a zero. Uh And this, so this is useful because it allows you to see how specifically you're getting thing, getting things wrong.
Uh And so here's an example of the Iris data set. We, I'm gonna quickly build. Uh So here's our I S data, right? We've got our zeros, our ones and our twos, I'm going to quickly build uh an LDA model on this. So we went over this in the linear or the base based classifiers notebook and then we're gonna import the confusion matrix and use it to assess how this model does on all of the observations.
So from S Q learn dot metrics import confusion matrix and then we will just create our confusion matrix. The process is exactly the same as the last video. So we put our true values first. So the training set and then we predict on the features that the model was built upon. ... And now we can see here's a nice D uh Panda's data frame visualization of it.
So we have our actual 012, our predicted 012. So 40 we did perfectly on uh one. It looks like we've got two misclassification uh for one and two. So, and the way we can see this is well actual 1 38 of the 40 observations end up correctly predicted, but two of them end up to be predicted as two. And so, a nice feature of this instead of just looking at whether or not they were predicted correctly as we can see.
Well, it looks like our model is confusing the ones with the twos here, which is we can see in this lower uh left hand or rower, lower right hand corner. And if we go back up to our plot, we can see why maybe that would be right. So here we have probably the things that are getting misclassified are around this area that I'm hovering over with my mouse.
We've got a couple orange triangles being pretty close to the green squares. OK. So a confusion matrix can be useful in a multi class problem because it can allow you to see uh what things you're getting wrong. And so maybe this could allow you to build some extensions on your model to handle these sorts of cases. Or in this instance for if we only had these uh two uh features, pedal length and pedal width, it would be difficult, right?
So it just looks like sometimes uh your type one Iris is pretty close to a type two Iris in terms of these measurements, a final metric that we might think about for multi class is called uh cross entropy. And so this is um sometimes known as log loss. Uh just reading what I wrote, this is sometimes known as the log loss metric and we'll see why in a second.
So cross entropy measures how well two probability distributions align. So, and what we're trying to do and a lot of these problems with these algorithms is approximate a probability distribution conditional. So like what is the probability that Y is equal to whatever class given the features that we've had? So our goal is to sort of produce a probability uh distribution estimate one that matches up closely with what the true real world probability distribution is.
Uh And so we can use cross entropy. The point of this is to measure how close we got to that. So we're gonna have data that's B you know, usually some kind of binary or um classification where it's either a 01 probability of being each class. Um But we're gonna use this cross entropy entropy to measure like how closely. So for instance, in LDA, right, we assume that this followed some kind of uh normal distribute, we made some probability distribution assumptions.
So we're just gonna like see how closely they match. And that's kind of the idea. So here we're gonna take our distribution to be a set of indicator functions. Uh So these are random variables that are 01 depending on the value of C. So Y C little C here is gonna be also denoted as one underscore Y equal to little C. This is the indicator as to whether or not Y is equal to the ca uh category of little C.
So for instance, if little C was two, this would be whether or not that observation is equal to a two. So for each observation, I, we then go on to compute um the sum from little C equals one up to big C of Y sub C comma I. So this is the indicator at observation I times the logarithm. And I believe here we're just using natural log of the probability uh uh sub C comma I.
So the probability that we estimate of observation I being of class C. So here capital C is our total number of possible classes and little P sub C comma I is the probability that observation I uh I is of class C. So essentially what we have here is why sub C I is the actual value, right? So this will either be a zero or one depending on what class observation I is.
And then this is gonna be a log of the probability that we estimate. And then the total cross entropy, as we can see uh is this is the cross entropy for a single observation. The total cross entropy for your model is you just sum over all of the uh observations. So we can do this either by hand, which maybe will help us understand the formula. Uh And then we can also do it in S K learn.
So the first thing we need to do is uh generate these Y sub C. So we can just use P D dot get dummies to do that. So here we can see we have um three columns because we have three classes. And for instance, the first observation here uh is of class one. The second observation is of class two. And then if we keep going 34, the fifth observation is of class zero.
Now we can generate the P CS by doing predict proba from LDA. And it doesn't always have to be LDA. I'm just using this because this was the notebook that came exactly uh uh right before this one. ... Now to remember the formula is Y C so Y C com I log of PC I and so that's this part. And then we need to take the sum of each row. So this is the sum along this first row.
So the product with a log P uh zero uh is zero, then we don't, you know, we're just left with the log of P one comma zero because it's the zero of observation. And then this is a zero. So we're just taking this call these columns multiplying it by the logs and then summing them each row. So this is an array that is all of those row sums. And then what do we have to do to get the total cross entropy is we have to sum all of these little individual highlighted sums that we just calculated.
So we do another sum. So this is the cross entropy for that LDA model. And maybe you don't want to code that up every time which you probably don't. But so we can see how to do this in S K learn in S can learn, we use the function log underscore loss. So from S K learn dot metrics, we import blog loss and now we can do this uh similar to most of the metrics and S T learn.
So we first put the true values, then we put the predicted probabilities which were the P CS. Then we put the labels for the three classes. So our labels were 012. And then finally, we're gonna use normalize equals true or normalize equals false. So why do we put this in? This is to make sure that the log gloss formula is the one that we used here.
Um If you normalize is equal to true S K learns function does some sort of normalization which is slightly different uh than the function that we showed above. So I just want to be consistent with the function I'm showing if you're interested in knowing what normalize does uh I encourage you to go to the documentation to see what happens when you set it equal to true.
OK. And so here, I think out to like 10 decimals maybe were the same. And then after that, we start to depart a little bit. But that's just because they're using a different uh pro probably using a different method to add and multiply than we did. So it might be useful to think of, you know, I, I introduced this thing called cross entropy. Maybe it sounds cool because it's a little physics.
Uh And sometimes they have cool sounding things. Uh So, but like we don't really know what it means. And so I, we did talk about how it's measuring the overlap between the actual distribution we observe and then the um probability distribution we uh estimated. But we can think about what it's going on, what's going on here by looking at the formula.
So this is the cross entropy for a single observation. So what each observation contributes to the overall sum? And we know that the only term here that contributes to the sum is the one that corresponds to the class that observation I actually is. Uh so let's say that's little L. So if this is little L then that sum up here turns out to just be negative log piece of L comma I, right?
Because the Y C comma I will be the zero for every class except for little L where it will be a one. So when P L sub I is closer to one, this is closer to zero. And then conversely, uh if your probability as it goes to get zero, so as it gets smaller and smaller, this negative log gets larger and larger. And so since we want to assign observation I to its actual class, what we want is the value of P L comma I, that's as close to one as possible.
So the closer to one, each of these probabilities are the P L I S for each observation, uh the closer to zero our cross entropy is. So what we're saying is the more weight we put on the correct probability. So the correct class, the smaller our cross entropy has to be. So our good model is the one with the lowest cross entropy. So essentially we can think of this as cross entropy punishes models that are confidently incorrect.
Why does it do that because such models would have a very low value for P L comma I because if you're confidently incorrect, that means you're assigning a lot of probability to the incorrect class, which means that your P L comma I, which is the probability for the correct class is close to zero. OK. So that's gonna be it for this notebook. We talked about some different classifications. Essentially, you can just look at binary versions of your problem. You can examine the multi
confusion matrix to give you multi class confusion matrix to give you a sense of what's going on with your model and exactly how it's getting things incorrect. And then cross entropy might be a useful metric for measuring overlap between the actual distribution and the estimated distribution. OK? So I hope you enjoyed this video. I enjoyed having you watch it. I can't wait to see you in the next video. Have a great rest of your day. Bye.