Bayed-Based Classifiers II Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, welcome back in this video. We finish up our work talking about Babas classifiers. Let me go ahead and share my notebook. So if you're starting over fresh, so you watched the previous video and now you're coming back uh later after maybe having closed your notebook, make sure you go through and rerun the notebook all the way from the beginning until you get to the point right above quadratic discriminate analysis. You are going to need some of the code that we ran above in order for the bottom to work properly. So as we were talking, let me scroll back up, I'm not gonna rerun anything, just gonna show you what we are talking about. So we have this nice thing called Bayes rule, which is essentially this. Uh So the probability of A given B is equal to the probability of B given A times the probability of A divided by the probability of B uh which we then sometimes expand the bottom using the law of total probability, which is pictured right here. Uh We can use this to set up a classification problem where the probability that Y equals C given the features that you observe uh is equal to this fraction, which I'm not gonna say out loud. Uh PC is the probability that Y is equal to C in general regardless of the feature. Uh And then F C X star is the probability density function uh assuming it exists, which we are going to assume for all of our models uh of X equal to X star. We talked about linear discriminate analysis which assumes that these are Multivariate normals with class dependent means which we estimate and a common co or a common covariance in general in this example, with a single feature, a common variances stigma. Uh And now we're gonna see, we see that this gives um a linear discriminate function which is visualized with this these linear discriminate boundaries. So just drawing straight lines on your data space. The next extension of this idea is known as quadratic discriminate analysis and this, you drop the assumption uh that you have a common covariance matrix. So before when we wrote this part that's highlighted only the mu had the subscript for the class. Now that was meaning that we had a specific mean, a different mean for each class. Now we're also going to assume that each of the covariance is the covariance matrices are dependent upon the class. So uh that's a normal Multivariate normal distribution that is, has its two parameters entirely dependent upon what the class of Y is. So if you perform Q D A and work it all out. You get this as a discriminate function, which again, I'm not gonna say out loud, but you can read it if you'd like, it's a mouthful to try and say it all. Uh And if you go through uh in matrix algebra speak, which maybe isn't familiar to all to everyone. It wasn't familiar to me for a long time. Um This would be considered uh like the squared, the quadratic part. So this what I've highlighted here. Uh What I am now highlighting here, this is my discriminate function. When I make this assumption on the little Fs from the bay rule above. This is now a quadratic function in X which is why it's called quadratic discriminate analysis. Again, we're gonna do quadratic discriminate analysis on this same data set and we will then compare the decision boundaries that show up from LDA and from Q D A. So Q D A is implemented in S K one and just as uh straight in a, just as in a straightforward way as was LDA. I don't know if that's the proper sentence, but that's what I wanted to say. Uh So from S K learn dot discriminate ... analysis, we import quadratic discriminant ... analysis, then we can make our Q D A objects we just call quadratic discriminate analysis parentheses. Uh And then we fit it. So Q D A dot Fit Xtra at pedal uh do, do, do pedal wits pedal links dot values. Why train ... and then make sure we run all of these codes. We just don't, we don't just type it, we run it as well. So this is just gonna go through and it's plotting uh This makes a nice grid which allows me to shade the, the space. Uh Then we plot the decision boundaries. Uh So essentially we go through the grid, we plot the predictions for LDA. Uh And then we're gonna do the same thing for Q D A. ... So here we have LDA and its decision boundary on the left. On the right, we have Q D A and its decision boundary uh pedal width on the horizontal for both and pedal length on the vertical for both. So we can see here the big difference between Q D A and LDA is Q D A gives us nonlinear boundaries in general. Uh So we can see that here because they're curved, they're not lines. Um And so that is a nice feature of Q D A. It's more flexible and it allows you to classify more flexible data sets better. But it can also be a weakness of Q D A because as you can see Q D A is predisposed to over fit to the data a little bit. So I would say, and again, I'm just observing the data we have uh it seems like to me that out of the two LDA may ultimately generalize better to future observations, uh because Q D A fits the training points relatively close. So if you look at the region that would, you know, be classified as a one, it fits the training points pretty close over here and tries to sort of it like envelopes them um more closely. And so for instance, if we were then to go out and measure future irises, if they move just a little bit out to here, it would be classified as a two even though it might be a one. So essentially, you have to think about that when you choose the model. Um So it can be helpful to plot what the model is doing in addition to looking at these metrics. Um Another reason that Q D A is maybe not always the best choice is when you a lot of code, when you make this assumption here, you have to have a lot of data points in order to be able to estimate uh capital C different covariance matrices. Uh So remember each of the different classes and we're assuming we have capital C of them is gonna have its own unique covariance matrix. Capital Sigma Little C. Uh And each of those needs to be estimated. Now, covariance matrix can be quite big and uh needs a lot of data to estimate depending on the number of features you have. So for instance, if you only had one or two features, it might not be so bad, but we may work in problems where we have tens of features or if you're working on a problem with hundreds of features Q D A may not just, may just not be possible for you. So that's something to consider. Uh So sometimes again, quadratic discriminate analysis, uh maybe has higher variants than LDA. And so it maybe will over fit your data. And then you also need to have a lot of training points to fit it in comparison to LDA uh where we only have to estimate the covariance matrix once. ... So the last model that will work on in this notebook is sort of like the maybe the most general of these. It's the naive Bayes classifier. I don't know, maybe most general is not the best word to describe it, but it is a, a different extension of this. So remember, we're using this uh bay rule to estimate uh in order to make these estimates, we need to estimate the capital C pi subs as we mentioned earlier in the previous video, this is relatively straightforward, we just have to take the percentage of each of the classes in the training set uh regardless of feature. And then we have to estimate C M dimensional density functions. And so that's what these little F CS are. These are density functions, we have to be able to estimate them in their capital C of them. So in linear discriminate analysis and in um quadratic discriminate analysis we made the assumption a very strong assumption on their form that these were Multivariate normals. Uh We relaxed it a little bit with Q D A by allowing them to have different covariance matrices. But we're making a very specific assumption on the form uh to, to make this estimating a little bit easier on us. Uh The way that naive Bayes approaches this hard estimation problem is to make a, an a, a different kind of assumption. So these density functions are Multivariate density functions, meaning they're sort of joint probability functions. Uh What we're doing instead with the naive Bayes is we're just going to assume that all of our features are independent of one another. And when we assume that that allows us to break up F sub C into a multiplication of M, remember we have M features here into a multiplication of M individual uni uh single variable density functions which will change our probability set up into this. So before we were measuring Multivariate normals, which is a joint density distribution, these are a multiplication of M single density distributions which can be easier to estimate in general. Uh and maybe allows us some flexibility. So in theory, this would allow us to say we could estimate one type of distribution for X one, the first feature, a second type of distribution for the second feature, an MTH type of distribution for the NTH feature. Whereas with LDA and Q D A, we're limited to that Multivariate normal. ... So what's commonly done with naive Bayes is for quantitative variables, we assume that it's a normal distribution, which is different than what we do for LDA and Q D A. Uh Because we're not assuming that these are independent uh in LDA or Q D A, what we're assuming with naive bases is all of these variables are independent of one another. So here, if we were to say X one is quantitative, we would estimate a single normal distribution with its own mean and its own variance, not covariance matrix variances uh for X one. But if like say we had a categorical for X two, we could use something like a Bernoulli distribution, which is, that's just a, a coin toss uh and then estimate the value of P accordingly. Uh So again, the difference between naive phase and LDA and Q D A is here, we're taking a Multivariate distribution or density function and breaking it up into a product of M independent, single variable distributions. So you might ask yourself if independence is a good assumption. So it's not always a good assumption in the sense that some of the features may actually be related to one another. So maybe X one and X two do have a strong relationship with them. But uh even if this assumption doesn't hold, you can sometimes get decent or good classifiers using naive base. And so this is particularly true if you don't have enough data to reasonably estimate the distributions we are using in LDA and Q D A. So remember we talked about Q D A, you need a lot of data to estimate the covariance matrix. Uh Here, this independence assumption uh reduces the amount of data you would need to estimate uh this product of single variable distributions. OK. So one way to think about the naive Bayes is we can take LDA or Q D A and naive base is kind of like adding bias to the model. If we think of that bias variance tradeoff, uh we can think of naive bases as adding bias to the model. And maybe by adding that bias, we reduce the variance enough to get a better prediction. So uh naive base can be implemented in S K learn with the naive base module which has um a couple different types of naive base models. Uh So the one we're gonna work with is the Gaussian naive base because all of our features are quantitative. Uh So we're essentially going to assume for pedal length and pedal width that both of those follow their own independent normal distributions. Uh And so you can find out about this specific function here and the module in general here. Uh So from, and now I'm gonna show you how to implement a Gaussian Naive Bayes model objects from S K learn on the Iris data set. So from S K learn dot naive underscore Bays import Gaussian uh N B for then we make the model and I believe I called it N B. So little N B dot uh is equal to no dot is just equal to Gaussian N B parentheses. N B dot fit uh X train square bracket, square bracket, uh pedal width, ... pedal lengths dot values. Why train? And now this is gonna make the exact same plot that we did above, but now it'll plot all three of those. ... So we have LDA over here on the left Q D A in the middle and then naive Bays on the right. And so we can kind of see that Q uh naive Bayse in this particular example is sort of like a happy medium in some sense between LDA and Q D A. So it has a little bit, I think personally, it has a little bit uh better generalize than D A, which kind of is the same for LDA, right? But it also allows for nonlinear decision boundaries like Q D A which LDA does not. OK. OK. So that is it for this notebook. We've gone over quite a bit over two videos. We reviewed Bay Rule, we talked about LDA and discriminate functions because we used Bayes rule to set up a classification. Uh This, we demonstrated what LDA did on a single variable. Uh We extended LDA with Q D A and got some nonlinear boundaries. And then we took a slightly different approach by assuming independence of all of our features to use naive Bayes as an estimator. OK. So that's it for our base based classifiers. I hope you enjoyed learning about these different approaches. Uh I enjoyed teaching them to you and I hope to see you in the next video. Have a great rest of your day. Bye.