Bayed-Based Classifiers I Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this notebook. We're gonna continue learning classification algorithms. This will be the first in uh two video series on Babas classifiers. Uh So this is talking about Bay's Rules Bays Rule, which we will review in this video and notebook. Uh And we'll also introduce a classification model framework for the three models that we consider in this notebook. Uh We use the IRS data set to help us demonstrate the different notebooks and these are um can be multi class classification models. Uh And the three models we're gonna consider are a linear discriminate analysis, an extension of that which is quadratic discriminate analysis and essentially an extension of both of those. Uh the most general is naive bases. Uh Let me go ahead and import our stuff. Um We're gonna start off with a quick review of what Bayes rule says. Um And so we're gonna assume that we have some probability space capital Omega, which is visualized down here as a rectangle. Uh The first thing we need to know if we're gonna review Bay's rule is what is conditional probability. So let's just say that we have events a and events B ... with the probability of B uh being nonzero. So it has a positive probability. So essentially we want to divide by it, right. So we define the probability of a conditional on B as this is capital P for probability, A vertical line B or the vertical line is taken to mean given or conditional on. So the probability of a conditional on the event B happening is defined to be the probability of A intersection B. So the probability of both events happening uh divided by the probability of B and this can be visualized like so, so we have our uh probability space omega event A is the circle on the left event B is the circle on the right. And we can see that the probability of A given that B has happened has to be this green shaded in region, right. So if B has happened, the only way that A can also happen is in the overlap area. So if you want to know, well, if you just know that the B E event B has happened, what's the probability that A has happened? Well, it's the overlapped region divided by uh the fraction of B that is represented by the green shaded overlap region if that makes sense. So essentially, you can restrict yourself. Once we know B has happened, you can restrict yourself to the part of the probability space that is just B and then you just have to look at the fraction that's made up of event A, which is this middle region. Uh We also need to remind ourselves of the law of total probability. So if B one B two, all the way up to B N are disjoint events. Uh so that means that B one intersection B two is equal to the empty event. Uh B I intersection J is equal to the empty event as long as I and J are not the same. Uh And we also have, if the union of all of these events, these disjoint events is equal to the probability space, then it must be. So that the probability of any event A is just equal to the sum of the probabilities of the intersection of A with all of those events. So a visualization of this, let's say we have B one through B 12 that segment the space in this sort of weird way. And then A is the blue circle in the middle where we can see that B one through B 12 partition up A as well. And so we can get the probability of A just by adding up the different shaded pieces sort of like a puzzle. OK. So that's a visualization of the law of total probability. So Bay's rule combines both the definition of conditional probability along with the law of total probability. Uh like so, so events A and B, if we assume that the probability of B is not equal zero Bay's rule, which is also called the Bayes Price theorem says that the probability of a conditional on the event B happening is the fraction of probability of B conditional on the event A happening times. The probability of A divided by the probability of B. And so really, this is just a double application of conditional probability. So up top would have been the probability of A times the probability of B but then we can use conditional probability a second time to substitute this. That's been highlighted a lot of times. And in this notebook, I believe we will also do this in uh when we set up our three classifiers. This has taken a step further where the probability of B is broken up according to the law of total probability uh to be probability of B intersection A plus plus the probability of B intersection A complement. So all the events in the space that are not in A and then we apply conditional probability again. So ultimately, one version of Bay's rule that you often see uh is the probability of A given B is the probability of B given A times the probability of A divided by probability of B given a times the probability of A plus the probability of B given a complement divided uh times probably a compliment. So if you'd like a nice visualization of this, there's this nice blog post I found a couple of years ago that I think visualizes it well. So we did all this work reminding ourselves of conditional probabilities in Bay's rule. Why are we doing that? What we have in these three classifiers that we're gonna learn in this notebook? Uh We use Bayes rule to set up the classification. So let's say that we have M features collected in a matrix X and an output variable Y that can take on any of capital C possible classes. So when we had logistic regression C was equal to two, and we modeled this conditional probability that Y was equal to one given some value for our features in order to make predictions. So Bay's rule could take this expression, the probability that Y is equal to one given that X equals X star and it can rewrite it again using this rule that we've just talked about. So Bay's rule takes this expression and rewrites it in general as the probability that Y is equal to class C or category C uh given the features is equal to something called PC F C times X star. And so this is the same as um this part here in the Bay rule. So PC is the probability that Y is equal to class C. And then F C X star is the conditional probability that X is equal to X star given that Y is equal to C um piss is often called the prior probability. And in practice, we usually estimate that using just the fraction. So in our training set, we'll have, all right, 30% of them are class 1 30% is class two and so forth. The F C we have to make some assumption. Um So, and, and then this denominator comes from breaking up the probability space according to the law of total probability. So as I was saying, P CS, usually just you use the fraction of the training set. Uh Whereas once it comes to the F sub C, which are these um probability density functions, uh these uh depend upon the data and we have to make a bunch of different assumptions. And so basically, these three different classifiers that I said we're gonna learn, I'll make different assumptions on these F sub CS. Uh And that's how we get the different classifiers. But the basic setup, where did I go? The basic setup is this uh so the probability that Y is equal to C is equal to this fraction uh We estimate pi C, like I said, and then we make some assumptions on the little FS that allow us to come up with different types of models. So we're gonna, we're gonna show you how to do this with this Iris data set that we've looked at a few times before. So all I've done is just load the data and here I'm visualizing it. So here's pedal length on the vertical axis, pedal width on the horizontal. We've got a bunch of class zeros, a bunch of class ones and a bunch of class twos. OK? So we're gonna use these three different classifiers and demonstrate them on this data set. And all three of these classifiers use the Bayes rule uh setup of the classification problem that we have now highlighted up above. ... So the first classifier that we're gonna look at in this notebook is called linear discriminate analysis or LDA. So LDA is actually a somewhat ambiguous uh abbreviation or acronym in machine learning and data science. So LDA also stands for latent delay allocation and that's used a lot in natural language processing models. Uh But here in these series of notebooks, we're gonna call D A to refer to linear discriminate analysis. So in LDA, the assumption that we make on the little FS is that X given Y equals C. So the conditional distribution of X given that Y is equal to category C is a Gaussian. So it's normally distributed. Uh what this means does depend upon the number of features. So we're gonna start off with the single feature M equals one because we can demonstrate that in a nice way. Uh And then we'll expand to M being greater than one. So in a thing, uh a single feature, this assumption boils down to the density function being the normal distribution. Uh The no, the Gaussian density function given here, I'm not gonna say it all with words. Uh But this is a Gaussian distribution function or density function with mean uh that is class dependent. So you'll have different values for the mean depending on what class you have, which you can estimate uh that we give below. Uh And then we're gonna assume that all of the classes have this uh same standard deviation sigma here, it's denoted Sigma C. But in LDA, we assume that all of the Sigma CS are the same uh Sigma. So we then if you plug this into that base uh setup we had before, this is what we get for the probability of Y equals C give an X uh we estimate the following way. So the, the estimated mean uh for class C uh is given by one over N C times the sum of all the X I where X I is the different observations in your training set. And then this is essentially the sample standard or the sample variances uh over the data sets as well. ... OK. So how does this make classifications? Well, you determine uh what class you're gonna classify it as using what is called the discriminate function. So this is why uh it's called linear discriminate analysis because we look at a discriminate function. So you will go through all of your classes, you find the one that has the largest probability and that's the one you determine the observation is. So if you have three classes and class two has probability equal to 20.4, which would be, um, let's say class two actually equals uh the highest probability out of all of these. Once you go through an estimate, then you would say that uh class two, the observation that you're looking at is of class two. So you can actually do some arithmetic and algebra and maybe a little bit of calculus trig, whatever, do some math. And you can turn, you can kind of give a, an um an expression that would help you determine what class something is without having to go through and compute all the probabilities. And this is called the discriminate function. Uh So the discriminate function is given by X times mu C sigma squared minus mu C squared, two sigma squared plus the log of pi C. This is the discriminate function and you would estimate this using the estimates we got above and whichever one of the uh the capital C classes has the highest of these. Uh is the one that you classify the observation as. So why is this called linear discriminate analysis? Because it is linear and X and remember working with a single feature here. Uh So that's what we got going on. So we are going to apply this to our IRS model. Uh The first thing we're gonna do is show how to use linear discriminate analysis and S K learn. And then we're gonna do a nice illustration of what's going on here. Uh step by step. So first we're gonna import it. So from S K learn that um discriminates ... underscore analysis, we import linear discriminate analysis. Then we make our model object just like we've had just like we've done all along when you're discriminate, hardest part with this model. And as he learned is just spelling it all and then we fit it. So LDA dot Fit, uh let's do, what did I choose? I'm just looking. OK. I chose pedal width. Uh Is that what I chose pedal length? We will use pedal length as our feature? OK. Uh ... And what did I do for my data? Split X train. So let me just check real quick. ... X OK. LDA fit X train at pedal length dot values dot reshape at negative 11 and then Y train. ... And then once again, we have this nice predict proba just demonstrating that all of these are the same ... then copy and based. OK. And so here is the probability that it's a class zero. Here's the model's estimate of the probability that it's class one. And here's the model's estimate that it's probability class uh two. And so here it looks like this one has a 82 or 83% chance according to the model of being class one. OK. So now we're gonna go through. So we talked about all this stuff where we estimated we're gonna go through and, and uh demonstrate what's going on to give you a better understanding. So uh here is that discriminate function delta C till this, I've just coded it up. So it takes in, ... in X, it takes in an estimate for the meal, it takes in an estimate for the sigma and then it takes in the estimate for pie, all of which will be class dependent. And so what we have to do for all of these is we first have to estimate the class dependent averages. So the MU CS, we have to estimate all of those. And the way you do that is you just take the average pedal length for each of the three classes. And then now we go through and we estimate the common variants according to the formula that I gave up above. So now we've got those estimates and now I'm gonna make a bunch of plots. ... So this first plot, these are the actual distributions of pedal length for all the observations in the IRS training set. So over here, this is the distribution of pedal length, the histogram of pedal length for class zero. Here's the histogram, the orange with the going down into the right hatches uh is for class one and then the last one is for class two. So these are the actual sample distributions. And LDA, we're assuming that these would follow normally distributed uh distributions, normal distributions with means dependent on the class and a common standard deviation. So that's what we estimated, we estimated the means and the common standard deviation here. And those estimates allowed us to say, well, this is what we're assuming the true distribution is it's these three normal distributions. So these are the fitted normal distributions that we're assuming for the LDA model. Now, we can use these distributions to estimate the three discriminate lines. So delta zero, delta one and delta two, delta zero is blue, delta one is orange and a dotted line. And delta two is a green dash dot uh dot dash line. And so the determination of what the model predicts is just given by whichever delta is highest. So from zero to about 2.82 point nine delta zero is highest. So the model would classify everything as a zero from 2.9 all the way up to, it looks like 4.8. Uh The delta one hat line is the highest. And so that's where the model would predict observations be a one. And then finally, from here all the way up to seven, the green line is the largest, the delta two hat line. So the model would predict that the observation is a two. And so here are the observations that go along with that. So predicted Iris class on the vertical pedal length on the horizontal. OK. So hopefully, that helped illustrate what's going on with the single feature. The idea is exactly the same with multiple features. But uh the it's harder to visualize So for multiple features, instead of assuming that the distribution follows a normal uh distribution, you assume that it follows a um a Multivariate normal. So essentially a normal distribution in either direction. Uh So for instance, here is a picture from Wikipedia of a two dimensional normal distribution. So why this Y is normally distributed? This X is normally distributed. This is a Multivariate normal distribution uh in two dimensions. OK. And so again, we're assuming that the mean is dependent upon the class. You're looking at A K A this uh and then we're also assuming a common covariance matrix regardless of class. So the covariance between any of the features is the same regardless of what class you're looking at. Uh Once again, you end up with a linear discriminate. So here's the function it's linear in X. OK. So again, the, the LDA classifier chooses the class C with the highest estimated delta C. Uh And so we are gonna go through and the making of this and S Q learn is exactly the same as the single feature class uh case, not class. The single featured case here, I use pedal width and pedal length because I want to be able to plot. Uh So I'm gonna plot here is pedal width on the horizontal pedal length on the vertical. And then the dots uh blue dots are training zeros, the orange triangles are training ones, the green XS are training twos. And then these shaded in regions, if you can see those, these are the decision boundaries. So in this area, the blue shaded region, that's where delta is zero. The discriminate function for class zero is highest. This is where the discriminate function for class one is highest in the orange shaded region. And then finally here, the discriminate function for class two is highest in the upper right hand corner in the green shaded region. OK. So now we've reviewed Bayes rule. We've gone over one assumption where we're assuming that these things follow Multivariate normal distributions with a common covariance. That's linear discriminate analysis. Because when you make that assumption, you end up with a discriminate function that's linear in X. OK. So that's it for this first video and the next video will finish out the notebook and talk about as you can see quadratic discriminate analysis and then the more general naive base. All right. I hope to see you in the next video where we keep learning about naive Bayes or Babas classifiers. Bye.