WEBVTT 00:00:14.000 --> 00:00:15.000 On with classification. So we have 3 notebooks that I want to try and get through today. 00:00:15.000 --> 00:00:26.000 All 3 of them are long. So we're gonna do our best to get through them. I will do, I will stop for questions, but I'll also try and move things along. 00:00:26.000 --> 00:00:36.000 Cause it's a lot of content. So we're in classification and we're gonna start with notebook number 6, base classifiers. 00:00:36.000 --> 00:00:42.000 And then we'll talk, move on from there. To support vector machines. So we are going to skip notebook number 7. 00:00:42.000 --> 00:00:50.000 I encourage you to go through and look at that on your own time, especially if you're doing a multi-class classification problem for your project. 00:00:50.000 --> 00:01:01.000 And then after that, we'll leave the world of supervised learning for a brief notebook in unsupervised learning to end today. 00:01:01.000 --> 00:01:10.000 So let's talk about Bayes based classifiers. We're here, Bayes refers to Bayes rule from probability theory. 00:01:10.000 --> 00:01:21.000 So I want to start just by reviewing it. So some of you are either, you know, don't remember Bayes rule, which is perfectly fine if you're not using it right like you're gonna forget it or maybe you've never been introduced to Bayes role before. 00:01:21.000 --> 00:01:23.000 So to understand the idea of Bayes rule, there's 2, things that you need to remember. 00:01:23.000 --> 00:01:34.000 Or no, if you don't remember it, you need to know the definition of conditional probability. 00:01:34.000 --> 00:01:50.000 So if we have 2 events, A and B, so again, think of events if you're new to probability theory, think of it like you're tossing a coin and seeing how it lands, one event would be that the coin lands tails, the other would be that the coin lands heads and then that's 00:01:50.000 --> 00:01:56.000 all the possibilities. You know, I guess in theory you could have a situation where the coin lands. 00:01:56.000 --> 00:02:01.000 You know, face up or sort of like on its side so that it's not heads or tails, but that's a probability. 00:02:01.000 --> 00:02:07.000 I'm going to say that's a probability of 0 and the real world. Probably. Okay. 00:02:07.000 --> 00:02:13.000 So then if you have these 2 events where you're saying that the probability of V is not 0. 00:02:13.000 --> 00:02:21.000 The probability of event A happening conditional on the fact that event B has happened. So this is P of A. 00:02:21.000 --> 00:02:30.000 Vertical line means given can or conditional on B. Is the probability of their intersection divided by the probability of B. 00:02:30.000 --> 00:02:45.000 And so one way to visualize this. With the picture. So if all the possible events that could happen or represented by this rectangle and All the situations where A happens is this left circle all the situation where B happens is this right circle the probability of A. 00:02:45.000 --> 00:02:56.000 Conditional on B is sort of this green circle or where the 2 circles intersect divided by the entirety of B. Right? 00:02:56.000 --> 00:03:04.000 And so that's because this is the only part of the space for B happens that A also happens. 00:03:04.000 --> 00:03:11.000 So then the other concept that's useful for base theorem gets used a lot is the law of total probability. 00:03:11.000 --> 00:03:15.000 So let's say we have a lot is the law of total probability. So let's say we have a sequence of disjoint events. 00:03:15.000 --> 00:03:20.000 So let's say we have a sequence of disjoint events, meaning that the intersection of disjoint events, meaning that the intersection between any 2 of these events is, meaning that the intersection between any 2 of these events is equal to the empty set. 00:03:20.000 --> 00:03:36.000 And also it holds that the union of all the sets covers the entire space. So one way to visualize it is here's that rectangle from before like the this is called the event space and then you have this sequence of smaller events be that when you take the union of all of them It's the entire space. 00:03:36.000 --> 00:03:47.000 Then it holds that the probability of any other of event A is equal to the sum of the probabilities of A intersection with those BI. 00:03:47.000 --> 00:03:59.000 So if you have a set of events that segment off the space. Then the probability of any event is just the probability of that event intersected with those different segments. 00:03:59.000 --> 00:04:03.000 Okay, so this is a nice graphic visualizing that. Well, I think it's nice I made it. 00:04:03.000 --> 00:04:13.000 This is a graphic visualizing that where you have, the circle representing event A and then all these different weird looking pieces representing the different bees. 00:04:13.000 --> 00:04:20.000 And you can see that if you were to take the intersection of A with the different pieces and add them up together, you'd get A back. 00:04:20.000 --> 00:04:39.000 So Bayes rule is sort of a combination of these. The main part of Bayes rule or also known as the Bayes price theorem is that the probability of A conditional on B is equal to the probability of B conditional on A times the probability of A divided by the probability of B. 00:04:39.000 --> 00:04:44.000 So this part here, like this is Bayes rule, has nothing to do with the law of total probability. 00:04:44.000 --> 00:04:51.000 It's just reworking the definition of conditional probability. So you rewrite this intersection as a different conditional probability. 00:04:51.000 --> 00:05:02.000 That's what's going on here. Now the part where The law of total probability comes in is it's often advantageous to rewrite the denominator according to the law of total probability. 00:05:02.000 --> 00:05:13.000 So you know that the probability of V is equal to the probability of B intersection A plus the probability of B intersection A complement. 00:05:13.000 --> 00:05:18.000 So that's all the things that A doesn't hold. And then you can once again take advantage of the fact that you know the what conditional probability is and rewrite it like this. 00:05:18.000 --> 00:05:30.000 So this is. Regardless of what or not you remember any of these classifiers that were about to learn today. 00:05:30.000 --> 00:05:31.000 Bayes rule is just like a useful thing to know for data science interviews or data analyst interviews. 00:05:31.000 --> 00:05:45.000 So a lot of questions you'll get if you get to a company that does sort of like Screener questions over the phone or on Zoom. 00:05:45.000 --> 00:05:51.000 Oh, there's like a whole class of questions that are basically just like you have to apply Bayes rule and then once you apply Bayes rule you can get the right answer. 00:05:51.000 --> 00:06:01.000 So this is a good thing to know both the original version and then the version that takes advantage of the law of total probability. 00:06:01.000 --> 00:06:07.000 And it's a good thing to practice like if you just did a web search for Bayes rule practice interview problems you'll find a whole bunch of them. 00:06:07.000 --> 00:06:16.000 This is just a good thing to know, regardless of if you remember the different classifiers. We're going to learn today. 00:06:16.000 --> 00:06:29.000 And then assuming this link still exists, I forgot to check it. Assuming this link still exists, there's like a nice visualization that I figured it was easier to give you the link than to try and drop myself, but there's like a nice visualization pictureing what Bayes rule looks like. 00:06:29.000 --> 00:06:36.000 And sort of this kind of setting. Okay, so now with that probability review. Aside, let's see how we can. 00:06:36.000 --> 00:06:41.000 Brooks is telling me that the link still works. Thanks, Brooks. 00:06:41.000 --> 00:06:48.000 So I wrote this notebook a few years ago, so it's always a gamble if I forget to check like does this link still go to what I think it goes to. 00:06:48.000 --> 00:06:59.000 So now that we remember or learned, have learned Bayes rule today, we can now see how you can use it for classification. 00:06:59.000 --> 00:07:10.000 So we're gonna for simplicity for the setup, imagine we are in a space where, you know, you have 2 possible classes, but in general. 00:07:10.000 --> 00:07:20.000 We have our matrix of features X. And an output variable Y. Or that in general. Can take on capital C possible categories. 00:07:20.000 --> 00:07:30.000 And then like just as a reminder, you know, when we had C capital S equals 2, We are trying to predict this, right? 00:07:30.000 --> 00:07:34.000 The probability that Y equals one, given the features. And when we say given the features, that means x equals x star. 00:07:34.000 --> 00:07:45.000 So x star is a particular set of features that we'd like to. To see like what's the probability that. 00:07:45.000 --> 00:07:51.000 Given these features, my output would be one. So you can use Bayes rule to rewrite this and it might look different than like this setup because this is for discrete events. 00:07:51.000 --> 00:08:05.000 And here we have to account for the fact that we have non discreet things possibly. So you can rewrite this. 00:08:05.000 --> 00:08:17.000 To be the probability that Y equals C, given the features. Is equal to pi sub c. So little c here denotes the particular category that y is equaling. 00:08:17.000 --> 00:08:28.000 So in binary it might be one. Pi sub C, F sub C, evaluated at X star divided by the sum of L equals one to capital C. 00:08:28.000 --> 00:08:35.000 So here we're just summing among all the categories of Pi sub L. F sub l evaluated at x star. 00:08:35.000 --> 00:08:41.000 So what are all these different pi's and f's? So pi here is known as a prior probability. 00:08:41.000 --> 00:08:50.000 So pi is the probability that a random observation would come from the SETH class. So you're just choosing an observation at random. 00:08:50.000 --> 00:08:56.000 What's the probability ignoring the features? What's the probability that it's from class C? 00:08:56.000 --> 00:09:18.000 So in practice, you typically just estimate this using the training test. So, training set. If your training set, for instance, had like a 50 50 split, both of your pies would be estimated at point 5 if your training class had a 60 40 split one pi 0 would be point 6 and pi one would be estimated at point 00:09:18.000 --> 00:09:34.000 4. Hey, and then. Here we can think of F sub C. This is the probability density function that of observing the X star so we can think of this. 00:09:34.000 --> 00:09:47.000 As the probability of observing x Given that we know the categories Y equal to C. So This is sort of assuming that X is a qualitative variable, but you could rewrite it. 00:09:47.000 --> 00:09:53.000 I just wanted to give you sort of like what it looks like to try and bring it back to this Bayes rules. 00:09:53.000 --> 00:09:57.000 Okay, so we can see how this is the Bayes rule. So the probability of X given Y times the probability of Y. 00:09:57.000 --> 00:10:15.000 That's what we've done here. We rewritten this according to Bayes rule. And so now the 3 different types of classifiers that we are going to learn in this notebook are all basically just giving us ways to estimate the F's. 00:10:15.000 --> 00:10:17.000 So we already have a way to estimate the pis. We just use whatever fraction of the categories exist in our training set. 00:10:17.000 --> 00:10:29.000 Now we have to figure out, okay, how do we estimate these probability conditional probability densities? 00:10:29.000 --> 00:10:43.000 Okay. So the reason that we were gonna have 3 different algorithms is because we're going to look at 3 different assumptions we make on F and that will result in different algorithms. 00:10:43.000 --> 00:11:02.000 Okay. So before we dive into showing you the 3 different algorithms, are there any questions just on phase rule and then like the setup for these 3 classifiers. 00:11:02.000 --> 00:11:10.000 Okay, so to help us demonstrate these 3 algorithms, we're gonna make a return to that Iris data set. 00:11:10.000 --> 00:11:17.000 And so, we're going to plot it and I just want to point out so I would just want to make sure we have a reminder. 00:11:17.000 --> 00:11:23.000 We're in classification world when we do classification, we want to stratify our data splits. 00:11:23.000 --> 00:11:32.000 So that means including the stratify argument when we do train test splits. And if we were doing cross validation doing stratified, KAY. 00:11:32.000 --> 00:11:43.000 So just as a reminder, cause we learned this stuff last week, we've had a whole weekend, maybe we had some fun and we forgot, just as a reminder for Classification. 00:11:43.000 --> 00:11:47.000 So here's the data we're looking at where I'm plotting pedal length against pedal width. 00:11:47.000 --> 00:11:54.000 So we've got y equals 0, y equals one, y equals 2. Each observation here represents a different type of Iris. 00:11:54.000 --> 00:12:04.000 I think it goes, or the blue circles. Verse to color are the orange triangles and virginica are the green X's. 00:12:04.000 --> 00:12:14.000 I believe is how it goes. And so our goal is we're going to try and try and build classifiers to make these classifications. 00:12:14.000 --> 00:12:28.000 Using this sort of Bayes rule setup. Okay, so for us it would be the probability that Y equals 0, given X, the probability that Y equals one given X and the probability that Y equals 2 given X. 00:12:28.000 --> 00:12:38.000 So the longest one we're gonna look at today is the linear discriminate analysis. So this is also abbreviated to LDA as a quick note, LDA is sort of an ambiguous acronym. 00:12:38.000 --> 00:12:45.000 So if someone tells you, oh, are you going to use LDA for this or I'm using LDA for this. 00:12:45.000 --> 00:12:51.000 Be aware of like what setting you're in. So LDA is also known as latent deerish layout location, which is from NLP. 00:12:51.000 --> 00:13:02.000 But for us when we say LDA, because we're just doing classification, we're gonna look at it as linear discriminate analysis. 00:13:02.000 --> 00:13:11.000 So the model assumption that gets made for LDA is that the distribution of X conditional on Y equals C is Gaussian. 00:13:11.000 --> 00:13:23.000 And what that means is dependent upon the number of features. And so for illustrative purposes, we are going to look at a single feature and then I'll present the extension and I'm just doing the single feature to help you get it intuition for what's going on in general you can have more than one feature. 00:13:23.000 --> 00:13:45.000 So for a single feature this assumption gives you the, following, density function where the standard deviation on the NOR or on the normal distribution depends on the class as does the mean. 00:13:45.000 --> 00:13:50.000 But in linear discriminate analysis, this sigma C, which is our standard deviation, actually is assumed to be the same for all the classes. 00:13:50.000 --> 00:14:06.000 So our distribution is going to allow for different means for the different classes, but we are going to assume that all of the classes have the same standard deviation. 00:14:06.000 --> 00:14:13.000 When you do this, you can rewrite your p of y equals c given x from the above expression like so. 00:14:13.000 --> 00:14:21.000 And if we want to estimate mu of C and Sigma. Who would use the following formulas that I've provided here? 00:14:21.000 --> 00:14:32.000 So you would just take the class dependent means for the Mu of C and then use this, I believe it's the pooled standard deviation formula for the standard deviations. 00:14:32.000 --> 00:14:40.000 So when we make classifications in the multi-class setting, you typically go with which probability is the largest. 00:14:40.000 --> 00:14:50.000 So for which of our 3 classes is the probability the estimated probability that Y is equal to C the largest, that's the one that you would you would predict for. 00:14:50.000 --> 00:15:09.000 So using some algebra and some like log logarithm. Manipulations you can show that the largest value of P is the same as choosing the class C for which the discriminate function is largest. 00:15:09.000 --> 00:15:22.000 So the discriminate function is delta C. Which is equal to your features. Times the mu for that particular class divided by the variants. 00:15:22.000 --> 00:15:29.000 Minus the mu or the mean for that particular class squared divided by 2 times the variance. Plus the log of the probability of being class C. 00:15:29.000 --> 00:15:42.000 Okay, so here you're using the estimates from above. So this is known as the discriminate function for linear discriminate analysis. 00:15:42.000 --> 00:15:48.000 So we're going to go through, show you how to fit this, an SK Learn. 00:15:48.000 --> 00:15:52.000 And then we'll do like a step by step breakdown of what's going on. 00:15:52.000 --> 00:15:58.000 So SK Learn has linear discriminate analysis stored in the linear discriminate analysis or scored and. 00:15:58.000 --> 00:15:59.000 Discriminate analysis and it's just called linear discriminate analysis. That was a mouthful. 00:15:59.000 --> 00:16:18.000 Okay, so we would do from SK Learn. Dot, discriminate analysis. Import linear discriminate. 00:16:18.000 --> 00:16:27.000 Analysis. Then to save myself typing I'm just gonna copy and paste that here. Then we're going to go ahead and fit the data. 00:16:27.000 --> 00:16:33.000 So we're going to do ld. Fit. And then did I make it X train and Y train? 00:16:33.000 --> 00:16:37.000 I sure did. 00:16:37.000 --> 00:16:48.000 So we do X train, Y train. Okay, and so just like. Logistic regression and c nearest neighbors. 00:16:48.000 --> 00:16:58.000 We would predict proba works the same way. So do LDA dot predict. Proba, X train. 00:16:58.000 --> 00:17:04.000 Okay, so the 0 column is that y equals 0 class, the one column is the y equal one class, probabilities, and then the 2 column is the y equal to class probabilities. 00:17:04.000 --> 00:17:26.000 So before we do like a step by step breakdown of what LDA is doing, are there any questions just about the SK Learn or about the setup of LDA? 00:17:26.000 --> 00:17:33.000 I'll just ask a quick question. So. For the classification, the discriminate function goes further. 00:17:33.000 --> 00:17:38.000 Typically for the class. Where it's largest. 00:17:38.000 --> 00:17:39.000 Sorry, give me 1 s. 00:17:39.000 --> 00:17:43.000 But then. Okay. 00:17:43.000 --> 00:17:45.000 Oh, say that again. 00:17:45.000 --> 00:17:54.000 The 00:17:54.000 --> 00:17:55.000 Yeah, when you have a multi-class, yeah. 00:17:55.000 --> 00:18:02.000 You choose the class C for which the discriminate function is the largest. No. Right. And but then there's gonna be another function to figure out. 00:18:02.000 --> 00:18:05.000 The other classes, right? 00:18:05.000 --> 00:18:19.000 So, this, is what you're estimating. And then basically if you're following the rule, the idea is if we're following the rule that we're going to for each observation assign the class where this expression is largest. 00:18:19.000 --> 00:18:28.000 As we're gonna show in a little bit, that's equivalent to showing. Where this discriminate function is largest. 00:18:28.000 --> 00:18:29.000 Oh, okay. 00:18:29.000 --> 00:18:37.000 So that's why it's, yeah, so that's why it's called linear discriminate analysis because your discriminate function, is linear, in the features. 00:18:37.000 --> 00:18:39.000 That makes sense. 00:18:39.000 --> 00:18:42.000 Yep. 00:18:42.000 --> 00:18:55.000 Yeah, and I, I think just the main idea is that like the dealing with the discriminate function is easier to, to like handle, get your head around them like dealing with this expression. 00:18:55.000 --> 00:19:04.000 So Ramazan's asking does Gaussian NB give the same results? So that's a different model we are going to learn and it does give different results as we'll see later in this notebook. 00:19:04.000 --> 00:19:15.000 So there are 2 different models. They're both based around this Bayes rule rework, but they make different assumptions. 00:19:15.000 --> 00:19:21.000 Okay, so I think this will now go get closer into Brooks question. So here is a function I've made that takes in your features. 00:19:21.000 --> 00:19:31.000 Your mu hat, your sigma hat, and your, estimated for being class P. 00:19:31.000 --> 00:19:41.000 And so what we're gonna do is look at the discriminate function as a single feature. So we're gonna imagine, you know, here we fit with multiple features, right? 00:19:41.000 --> 00:19:47.000 So we had multiple features here when we fit it. And I think maybe I wanted to do this with just a single feature. 00:19:47.000 --> 00:19:53.000 So let me redo this real quick because I just wanted pedal length and you might be wondering, well, why are we? 00:19:53.000 --> 00:20:00.000 Why are we doing this? Don't worry, it will make, it will make sense when I get through. 00:20:00.000 --> 00:20:07.000 If you're confused. So I meant to just do this with a single feature. So let me refit this. 00:20:07.000 --> 00:20:21.000 And then. Redo this part. Okay, and this will make sense. So I just wanted to make an apples to apples comparison of like doing it by hand versus the SK Learn one. 00:20:21.000 --> 00:20:36.000 And so basically what we're doing here is I have got my discriminate as a function here. It's going to take in a value of x, an estimated value for the mean, which again remembers to depend on the class, an estimated value for the sigma, which is dependent on no class that's the same for 00:20:36.000 --> 00:20:44.000 all 3 classes. And then the estimated probability for each class. And so here I calculate the estimates for those means. 00:20:44.000 --> 00:20:52.000 I calculate the estimate for the variance. And so that's what we do. And then here I'm going to go through and plot. 00:20:52.000 --> 00:21:01.000 So. We're just gonna go through what's being assumed. I'm doing it with a single feature because I can plot that and visualize it where with higher dimensions it gets harder. 00:21:01.000 --> 00:21:10.000 So these are the actual sample distributions from the training set. So we've got y equals 0 and all these have petal length on the horizontal axis. 00:21:10.000 --> 00:21:19.000 So these are the actual sample distributions for petal length, they're histograms. And so we can see we've got y equals 0, y equals one, y equals 2. 00:21:19.000 --> 00:21:30.000 So now what LDA is is that all of these are normal distributions and these are the normal distributions that we fit using the training data. 00:21:30.000 --> 00:21:32.000 So they all have the same variance or or standard deviation, whichever you like. But they have different means. 00:21:32.000 --> 00:21:47.000 Okay. So that's what we get from fitting. On the training data. And so from this, we get the resulting discriminate lines. 00:21:47.000 --> 00:21:54.000 So these are those delta C's, the fitted delta C's. And so then the class that we would predict. 00:21:54.000 --> 00:22:00.000 Is the one where the corresponding line is the highest. So here the solid blue line is the discriminate function for class 0. 00:22:00.000 --> 00:22:24.000 So everywhere from pedal length a little bit less than 3 and to the left we admit predict a class 0 for this very small region from about a little less than 3 to around a little less than 5, we would predict that we are class one and then from that point onward to the right we would predict that we're class 2. 00:22:24.000 --> 00:22:31.000 So that's because the corresponding discrimin function is the largest, in each of those regions. 00:22:31.000 --> 00:22:40.000 And so then we can see, you know, the corresponding predictions for each pedal length. So we can see the blue predictions are where the blue line was on top. 00:22:40.000 --> 00:22:51.000 The class one predictions are where the orange dotted line is on top and then the class 2 predictions are where the green dot dash line is on top. 00:22:51.000 --> 00:23:21.000 So that's the idea. With LDA. So are there any questions on hopefully this less confusing breakdown now that we're through it? 00:23:23.000 --> 00:23:35.000 Okay, I fit this using all of the training set all 4 columns and so if you're going along with me you have to go and re- do this with just the pedal length. 00:23:35.000 --> 00:23:43.000 And so then when you come to run the predict again, it would be fixed. 00:23:43.000 --> 00:23:48.000 So Jonathan is asking, so the fit normals aren't really from the data points, but the group variance and subgroup means from the data. 00:23:48.000 --> 00:23:55.000 So the fit normals here. 00:23:55.000 --> 00:24:06.000 To get these, you calculate the means and the variances. So the means for each individual class using the training data. 00:24:06.000 --> 00:24:13.000 So those means are coming from the training data and then you do the group. Standard deviation estimate using that from above. 00:24:13.000 --> 00:24:30.000 So that's what the fitting procedure is here. So they are coming from the data. But They were using the data to estimate the distributions. 00:24:30.000 --> 00:24:34.000 Are there any other questions? 00:24:34.000 --> 00:24:42.000 Ernesto is asking what happens if your points is far are far from the training distributions, then this wouldn't be a good model. 00:24:42.000 --> 00:24:59.000 So like if your, if your data is like really not like close to these assumptions. If you're in general, if you're fitting a model that has certain assumptions and your data like egregiously violate those assumptions, it's probably not going to be a good model. 00:24:59.000 --> 00:25:03.000 Okay, so to keep things moving along since we do have a lot to get through today, I'll hold any other questions until later. 00:25:03.000 --> 00:25:13.000 So that was for like seeing it for a single feature in general for multiple features, the assumption is that your conditional distribution is a multivariate normal with class dependent mean vector. 00:25:13.000 --> 00:25:27.000 So you have your means in each of the individual components and then a unified covariance matrix regardless of class. 00:25:27.000 --> 00:25:35.000 And so as an example, here's what a bivariate distribution looks like, a bivariate normal distribution. 00:25:35.000 --> 00:25:41.000 So imagine this if you if you can try imagine this in higher dimensions. That's what's going on here. 00:25:41.000 --> 00:25:47.000 And so here you can see like, if you'd like to go through for yourself, these are the formulas. 00:25:47.000 --> 00:25:59.000 So this is the FC of X and then this is the resulting discriminate. When it comes to multi- When it comes to higher than one dimension, I'm not going to plot it. 00:25:59.000 --> 00:26:08.000 Like what we just plotted for the discrimination, but I will plot, you know, showing you fitting the model, which we saw earlier. 00:26:08.000 --> 00:26:16.000 I'm restricting to pedal with and pedal length just because I want to be able to show you like what the classification regions look like. 00:26:16.000 --> 00:26:17.000 So here, this is the training set. So the training observations are the points that are outlined. 00:26:17.000 --> 00:26:25.000 So blue circles orange triangles, green X's, and then the shaded regions show you this is what would be predicted in this region by the algorithm. 00:26:25.000 --> 00:26:44.000 So here the algorithm is predicting everything in this blue shaded region. It's predicting a 0, everything in this orange shaded region, it's predicting a one and everything in this green shaded region, it's predicting a 2. 00:26:44.000 --> 00:26:55.000 And then these areas are white just because of the range of the plot. If I were to include these regions and the predictions, they would also be shaded. 00:26:55.000 --> 00:27:04.000 So just ignore like the regions of the plot that are white. Okay. 00:27:04.000 --> 00:27:13.000 So we should also point out. That linear discriminate analysis also results in linear decision boundaries. So these shaded regions represent what are known as decision boundaries. 00:27:13.000 --> 00:27:25.000 So the boundaries of like your function. So. That makes any sense. So like, Basically just the boundaries of where different classes are predicted. 00:27:25.000 --> 00:27:28.000 So a linear decision boundary doesn't always work well. So we're gonna learn 2 additional types of algorithms in this notebook. 00:27:28.000 --> 00:27:44.000 That make different assumptions on the of on the F that allow for nonlinear decision boundaries. So the first of these is quadratic discriminate analysis. 00:27:44.000 --> 00:27:53.000 So the assumptions are the same except for one very crucial detail. Here your covariance and maybe I'll zoom in so you can see this formula. 00:27:53.000 --> 00:28:07.000 The covariance here is also assumed to be class dependent. So in linear discriminative analysis, we assumed the same covariance matrix for all of the classes and quadratic discriminate analysis we're assuming. 00:28:07.000 --> 00:28:19.000 Different covariances depending on what the classes. Class that you're looking at is. So, when you're doing quadratic discriminate analysis. 00:28:19.000 --> 00:28:22.000 This is your discriminate function. So it's quadratic in the features here. 00:28:22.000 --> 00:28:32.000 We're going to do the exact same like I'm not going to go step by step like I did last time for LDA because the process would be similar. 00:28:32.000 --> 00:28:40.000 We would just have to go through and now estimate the variance for covariance for each individual class. 00:28:40.000 --> 00:28:57.000 But the fitting process using SK Learn is the same. So we would import quadratic discriminate analysis from SK Learn dot, discriminate We're going to import quadratic. 00:28:57.000 --> 00:29:11.000 Discrim, Nint, analysis. Okay, and so then once again, I'm just gonna copy and paste to save myself time. 00:29:11.000 --> 00:29:19.000 Then we're going to fit, so QDA. Dot fit. And once again, I will be restricting myself to. 00:29:19.000 --> 00:29:30.000 Heddle with and pedal length and I better put those in a list or I'll get an error. 00:29:30.000 --> 00:29:45.000 And then, And so now what we're gonna do is we're just gonna do this, but now I'm gonna plot the linear discriminate analysis on the left hand side and the quadratic discriminate analysis on the right hand side. 00:29:45.000 --> 00:29:51.000 Okay, so here we can see this is the same picture as before, but now on the right hand side we can see how the quadratic. 00:29:51.000 --> 00:30:12.000 Discriminate analysis allows for nonlinear decision boundaries. So just because it's not linear though doesn't mean it's better so like as you can see like the blue and the orange regions are like vastly limited and then most of the time we'd be predicting an iris and it doesn't necessarily seem 00:30:12.000 --> 00:30:22.000 Like, it's hard to tell without additional observations, but it's, it seems unlikely that Iris, type 2 would be taking over so much of the plot. 00:30:22.000 --> 00:30:29.000 In the real world so you know It's a different model, but it does because it is more complex remembering our very bias variance trade-off notebook because it's a more complicated model. 00:30:29.000 --> 00:30:40.000 Complex in this case because we're allowing for different covariance matrices depending on the class. 00:30:40.000 --> 00:30:57.000 It tends to maybe over fit to the data. But if you have data that very clearly does does have a nonlinear decision boundary, you may want to use a model that allows for that. 00:30:57.000 --> 00:31:08.000 Okay, so I saw that I have a question. So Ernesto is asking, can you provide an answer with these linear classifiers stating some level of confidence? 00:31:08.000 --> 00:31:13.000 For example, if the data point is classified as Y 2, but also close to the boundary of Y one. 00:31:13.000 --> 00:31:14.000 Will you be able to say some sort of error associated with that prediction off of the model? 00:31:14.000 --> 00:31:26.000 So it is possible. I do think it's probably possible that you could provide some sort of confidence interval or prediction interval. 00:31:26.000 --> 00:31:36.000 I'm not sure how frequently that sort of thing is implemented in industry or if they're just looking for probability, you know, I think you probably could get some sort of interval. 00:31:36.000 --> 00:31:37.000 I'm not sure how to do it without diving into, I'm not sure how to do it without diving into like the literature a little bit further. 00:31:37.000 --> 00:31:46.000 I'm not sure how to do it without diving into like the literature a little bit further. So that would be something that if you're interested in doing, you'd have to, you would have to look into it. 00:31:46.000 --> 00:32:07.000 So Brooks is saying, isn't the predict proba a kind of confidence. So predict Provo would be giving a point estimate of the probability of being a given class and so then using that point estimate you could provide like a confidence interval around that I I would say yeah. 00:32:07.000 --> 00:32:19.000 It is it is a measure of like confidence and like the. You know, human like 2 people talking to one another kind of sense of confidence and not like the statistical concept of confidence. 00:32:19.000 --> 00:32:42.000 Like if you see something with a much higher estimated probability, you're maybe personally you as a person is more confident that it's correct as being that class versus like one that has a lower probability, which is slightly different than the statistical concept of confidence. 00:32:42.000 --> 00:32:48.000 So sort of, you know, people might be wondering when would I use LDA versus QDA? 00:32:48.000 --> 00:32:55.000 So LDA works better than QDA for smaller data sets and why is that? So QDA. 00:32:55.000 --> 00:33:01.000 You have to estimate, more parameters. So the assumption in LDA is that the covariance matrix is the same for all the classes. 00:33:01.000 --> 00:33:10.000 So here you have to estimate only this many covariances. While in QDA, you have to estimate much more, many more because you're having a different covariance matrix for each class. 00:33:10.000 --> 00:33:26.000 You also may think that the data can be separated linearly, meaning a linear decision boundary. And so if that's the case, you should probably just use LDA. 00:33:26.000 --> 00:33:33.000 Qda will work maybe give you a better fit if you have a very large data set. 00:33:33.000 --> 00:33:41.000 And especially if you have data that you think is not separable by a linear boundary, you may wanna consider a QDA. 00:33:41.000 --> 00:33:47.000 And then another way to get a nonlinear decision boundary is what's known as the naive base classifier. 00:33:47.000 --> 00:33:50.000 So I believe this was, Ramazan's question earlier. We're gonna come back to that now. 00:33:50.000 --> 00:34:04.000 And so instead of making an explicit assumption on the form of the FC from the most general form of naive bays. 00:34:04.000 --> 00:34:13.000 What we're going to do is instead make an assumption on like how the FC is broken, broken up. 00:34:13.000 --> 00:34:30.000 So The FC is a joint probability density. So the assumption for naive phases, instead of having a joint, we're going, I mean, we're still gonna have the joint density, but in this case we're going to assume that all of the individual features are independent of one another. 00:34:30.000 --> 00:34:39.000 And so this joint density can then be broken down into individual univariate densities. For each of the features. 00:34:39.000 --> 00:34:49.000 And so typically what that means in the implementing, so like for things that are continuous features, you'd assume something like a Gaussian features you'd assume something like a Gaussian and then for things like categorical. 00:34:49.000 --> 00:34:57.000 And then for things like categorical, you would assume, like by, and then for things like categorical, you would assume, like by, binomial, not polynomial, so 0 ones. 00:34:57.000 --> 00:35:14.000 And so that's sort of the idea here is the big assumption for the naive bays is that you're maybe naively assuming because it might not be a good assumption, but you're naively assuming that each of the features are independent of one another, which then allows you to rewrite something 00:35:14.000 --> 00:35:18.000 like this. And then the idea is it's much easier to estimate the individual univariate densities than it is to estimate a giant joint distribution. 00:35:18.000 --> 00:35:31.000 And so that's the idea. So you might be thinking, well, this seems like a pretty strong assumption that probably isn't going to hold. 00:35:31.000 --> 00:35:33.000 So one reason why naive bays might work better than either LDA or QDA is because you're making this assumption. 00:35:33.000 --> 00:35:49.000 It actually implements a lot of bias into the model and sometimes remember the trade-off you can actually get better performance because maybe you're increasing bias, but you're doing in a way that decreases the variance enough so that the actual generalization error tends to go down. 00:35:49.000 --> 00:36:02.000 And so that's the idea here. And then I think I said this already, but typically what gets assumed is that if you have a quantitative, you would assume it's a normal distribution. 00:36:02.000 --> 00:36:13.000 And if you have a categorical, you would assume a Bernoulli. So Bernoulli, if you haven't heard that before, is just the name for a coin toss. 00:36:13.000 --> 00:36:21.000 So in this case, it would be a biased coin toss for your value of p is just your proportion of observations. 00:36:21.000 --> 00:36:27.000 Okay, so, so RAM is on has a question. Would Bayes models give confidence intervals? 00:36:27.000 --> 00:36:35.000 I thought we usually just interpret the posterior distribution as opposed to confidence intervals of frequentist approach. 00:36:35.000 --> 00:36:46.000 So I think that There's probably my guess. I don't know for sure, but my guess would be that they're probably you could do sort of the Bayesian statistics approach and then. 00:36:46.000 --> 00:36:55.000 Forget I forget what those are called if it's like a credibility interview interval or something and then there probably is a frequentist approach that would allow you to get confidence intervals on different things. 00:36:55.000 --> 00:37:05.000 But You know, cause I think people like LDA, I think was developed before Bayesian statistics became like a big thing. 00:37:05.000 --> 00:37:14.000 So I would imagine that there is a way to get confident, you know, classical confidence intervals or prediction intervals. 00:37:14.000 --> 00:37:20.000 Again, I don't know for sure, but I think there's probably a way to do it. 00:37:20.000 --> 00:37:21.000 Okay, so how can we implement this? So we can do it with Gaussian and B. 00:37:21.000 --> 00:37:41.000 So one big. Downside of SK Learns, naive Bayes implementations is that all of your variables have to be the same, type of, feature. 00:37:41.000 --> 00:37:50.000 So like they either all have to be continuous or all have to be categorical which is sort of a downside at least as of last year they may have updated it in this past year. 00:37:50.000 --> 00:37:55.000 You can't have a situation where you have some columns that are continuous and some columns that are categorical. 00:37:55.000 --> 00:38:07.000 I'm not sure that they have a version of the model that incorporates both of those. So our data has a, has both continuous features. 00:38:07.000 --> 00:38:15.000 So we're going to use the Gaussian NB model, which assumes that each of the individual distributions are Gaussian. 00:38:15.000 --> 00:38:22.000 So we're going to say from . K. Learn dot naive base 00:38:22.000 --> 00:38:30.000 Import. Gaussian and B. And then just like before we're gonna. 00:38:30.000 --> 00:38:40.000 Make the model so copy paste and B equals Gaussian and B And then we fit the model dot fit x train and then I'm gonna do pedal. 00:38:40.000 --> 00:39:00.000 With Petal links. And then why train? And then, okay, good. I called it. 00:39:00.000 --> 00:39:03.000 Why did that not work? Maybe let's do dot values. 00:39:03.000 --> 00:39:07.000 No, the issue Matt is that you're. Your figure code has size equals the letter S in every place. 00:39:07.000 --> 00:39:16.000 If you define a value for S before you plot it, then it works. 00:39:16.000 --> 00:39:21.000 Awesome. 00:39:21.000 --> 00:39:22.000 So just write like 00:39:22.000 --> 00:39:31.000 Okay, okay, I see. I see. Yeah, this is what happens. So when I was making these edits like a month ago I was trying to get through the rest of my day and so this is what happens. 00:39:31.000 --> 00:39:38.000 You do that to yourself. There we go. Okay, so we can sort of see the difference in the boundary. 00:39:38.000 --> 00:39:43.000 So like the naive base is probably an improvement over the QDA, sort of a little bit less overfitting to the training set. 00:39:43.000 --> 00:39:51.000 But you know, we still don't know whether or not it's, you know, it's gonna give you the best performance. 00:39:51.000 --> 00:39:58.000 We would have to do something like a stratified. And so I will say, you know, thanks Brooks for pointing that out. 00:39:58.000 --> 00:40:05.000 You saved me a lot of time of trying to figure out what was wrong. 00:40:05.000 --> 00:40:16.000 Okay, so maybe I'll pause for like one question, just because we, we're really rushed for time today. 00:40:16.000 --> 00:40:17.000 Yeah. 00:40:17.000 --> 00:40:23.000 I have a question. Is, is there particular class, problems which are more suited for days based classifiers? 00:40:23.000 --> 00:40:26.000 Then Canon, for example. 00:40:26.000 --> 00:40:34.000 So if you have data that tend to fit the assumptions of the different models, then they would probably perform well. 00:40:34.000 --> 00:40:40.000 I don't have like a good sense of like, oh, you always want to use this when you're doing like. 00:40:40.000 --> 00:40:45.000 Image classification or something like that. So I would say you probably just have to, you know, look and then see, look at some of the data and sort of get a sense. 00:40:45.000 --> 00:41:08.000 I think ultimately people probably just tend to like fit it and like sort of like in a cost foundation approach and then see like did it perform better than the other ones and then maybe after that they could go back and check like, okay, do the assumptions seem to be egregiously broken, like, okay, do the assumptions seem to be egregiously broken? 00:41:08.000 --> 00:41:12.000 Like, is it okay? Do the assumptions seem to be egregiously broken? Like, is it okay if this is the best one. 00:41:12.000 --> 00:41:14.000 Yeah. 00:41:14.000 --> 00:41:20.000 And for this model, the main assumption is that x given y is. 00:41:20.000 --> 00:41:28.000 So for LDA and QDA, it's different types of Gaussian. So here it's Gaussian with the same covariance. 00:41:28.000 --> 00:41:30.000 QDA, it's Gaussian. With different covariances for the classes and then naive base is the main assumption is of independence. 00:41:30.000 --> 00:41:46.000 And in this particular naive base, it's that it's independent and each of the distributions are Gaussian. 00:41:46.000 --> 00:41:47.000 Thanks. 00:41:47.000 --> 00:42:06.000 Yep. Awesome. Okay, so The next type of model we're gonna learn about today and Hopefully not take too too long because I want to get to principal components analysis is a support vector machine. 00:42:06.000 --> 00:42:12.000 So we're gonna go through like step by step the, 00:42:12.000 --> 00:42:19.000 We're gonna go through step by step the way that support vector machines were sort of built up over time. 00:42:19.000 --> 00:42:24.000 And see like the the development. So the first ones are known as we're kind of breaking this into 2 things. 00:42:24.000 --> 00:42:35.000 Linear support vector machines and then more general support vector machines. So linear support vector machines are designed for data sets that are linearly separable. 00:42:35.000 --> 00:42:38.000 And so what I remember linear means like. The decision boundary is linear. And so that's what's going on here. 00:42:38.000 --> 00:42:52.000 Linear support vector machines produce linear. So what do we mean by a linear decision boundary? It's data. 00:42:52.000 --> 00:42:58.000 That the classes can be separated by a hyperplane. So in 2 dimensions, think drawing a line. 00:42:58.000 --> 00:43:04.000 In 3 dimensions think drawing a plane. I don't really know how to show a plane. Let's say this is a plane. 00:43:04.000 --> 00:43:11.000 And then higher dimensions, it's separating by an n minus one degree subspace. So if you're in RN, a hyperplane is an N minus one subspace. 00:43:11.000 --> 00:43:24.000 So there are 2 types of linear support vector machines. The first one that got developed was a maximal margin classifier. 00:43:24.000 --> 00:43:32.000 And so here's some. Phony data that I've generated. So, this is another instance or not another. 00:43:32.000 --> 00:43:39.000 This is an instance of one of those problems where support vector machines were developed by the computer science community to my knowledge. 00:43:39.000 --> 00:43:45.000 And so there the 2 classes are typically negative one. And one. Whereas in other classifications, right, it's been 0 in one. 00:43:45.000 --> 00:43:53.000 So for us, we're gonna go with the formulation that comes from the computer science people because it allows the formulas to work out. 00:43:53.000 --> 00:44:02.000 More nicely than with 0 and one. But just know that we have 2 classes, negative one and one. 00:44:02.000 --> 00:44:06.000 Okay, and so the question here is, okay, if I were to try and come up with a rule to separate these, what would I do? 00:44:06.000 --> 00:44:18.000 Well, you could just draw a line separating them, right? I could just draw it. But then the question becomes, well, which line is the best line? 00:44:18.000 --> 00:44:22.000 So here are 3 different lines that separate the data and maybe it doesn't necessarily look like that because the edges touch but the center is where the data is at. 00:44:22.000 --> 00:44:31.000 So here are 3 different lines of black solid line, a blue dotted line and a red dash dot line. 00:44:31.000 --> 00:44:43.000 And then the idea for the maximum margin classifier is, well, the line that performs best or generalizes best is going to be the one that is as far away from the training data as possible. 00:44:43.000 --> 00:45:03.000 So here we might say that the red dot dash line isn't very good because we're very likely over here to maybe have a A one cross over and over here to have a negative one crossover and sort of the opposite sides for the blue dotted line. 00:45:03.000 --> 00:45:15.000 Both the black line, we have maximized the distance between all of our observed training set points. And the, dividing line. 00:45:15.000 --> 00:45:25.000 So that's sort of the idea behind a maximal margin classifier. As you try to maximize the distance from the points to the to the decision boundary. 00:45:25.000 --> 00:45:31.000 And then that distance is known as the margin. So that's why it's maximal margin. 00:45:31.000 --> 00:45:40.000 Okay. So here is sort of the setup. We're not going to dwell too much on this. 00:45:40.000 --> 00:45:47.000 And I believe is the distance for the margin. And so. You might be thinking, well, this looks weird. 00:45:47.000 --> 00:45:56.000 This is just like the formula for a hyper plane, right? X times beta equals 0 is the formula that defines the hyper plane. 00:45:56.000 --> 00:46:06.000 And so here you're basically just saying that I want a hyper plane, and then I want all of my points to fall outside of a margin of M on other side of other. 00:46:06.000 --> 00:46:10.000 Either side of the hyperplane. 00:46:10.000 --> 00:46:20.000 Okay, so we're going to go through and show you how to do this. Skeler and for linear support vector machines for classification, it's linear SVC, so support vector classifier. 00:46:20.000 --> 00:46:28.000 So from. SK Learn dot SVM. That's where they're stored. 00:46:28.000 --> 00:46:35.000 You import linear. SVC. 00:46:35.000 --> 00:46:41.000 So we're gonna make our model. So for now, We're gonna go ahead and ignore. 00:46:41.000 --> 00:46:47.000 I'm gonna put something in called capital C. I'm gonna ignore it for now. I'll touch on it. 00:46:47.000 --> 00:46:56.000 One, a little bit later in the notebook. Okay. So we're gonna do max underscore margin is equal to. 00:46:56.000 --> 00:47:03.000 Linear SVC. Or I'm gonna say C is equal to 1,000, which again, I'll talk about it in a little bit. 00:47:03.000 --> 00:47:05.000 And then I'm going to increase the maximum number of iterations. So notice here there's a difference in syntax. 00:47:05.000 --> 00:47:16.000 So in SK L and max iter has the underscore, whereas in stats models it did not. 00:47:16.000 --> 00:47:23.000 I'm gonna do max margin. X train or was it they have an x train 00:47:23.000 --> 00:47:34.000 I think it's just X. Okay. Why? 00:47:34.000 --> 00:47:40.000 Oh, got fit. 00:47:40.000 --> 00:47:46.000 Okay, so here is the line, so it's slightly different from the black line we had above. 00:47:46.000 --> 00:47:51.000 But here is the line that is the decision boundary. That's the solid black line. Then this black dotted line represents the margin. 00:47:51.000 --> 00:48:02.000 And so the margin is the distance. The, and so here we calculate distance as the minimal distance from the hyperplane to the points. 00:48:02.000 --> 00:48:13.000 And so that's the margin. And so these points that are touching the margin, so you have like about 4 blue dots touching and then 2 orange triangles touching. 00:48:13.000 --> 00:48:17.000 So these are what are known as the support factors. So that's why it's a support vector machine. 00:48:17.000 --> 00:48:22.000 It's because it's this algorithm where the support vectors are the things that are closest to the decision boundary or. 00:48:22.000 --> 00:48:39.000 Touching the margin. So they're called support vectors because if I were to move any one of them, it's likely that the decision boundary and the margin may change. 00:48:39.000 --> 00:48:51.000 And so it has nothing to do with if you're from coming from mathematics. Nothing to do with the concept of the support of a function, which is sort of annoying as a mathematician because I spend a very long time trying to see if there was a connection. 00:48:51.000 --> 00:49:02.000 There isn't. It's just called the support because in some sense these points are supporting the decision boundary just meaning that if I were to move them it would change the decision boundary. 00:49:02.000 --> 00:49:05.000 Okay. 00:49:05.000 --> 00:49:10.000 So that is a maximum margin classifier. You could see how this might not be the best classifier for every problem. 00:49:10.000 --> 00:49:24.000 So for instance, what if we had I situation like this where it's essentially the same, but you know, just a couple observations of either class, tends to be sort of commingled in. 00:49:24.000 --> 00:49:33.000 So they're not perfectly linearly separable. So in this example, the original example, we could draw a line between these 2 and separate them perfectly. 00:49:33.000 --> 00:49:46.000 And here we cannot draw a line between these 2 classes and separate them, but we could draw the same exact line for before and basically do What I would say is an okay job at separating them, right? 00:49:46.000 --> 00:49:55.000 So we'd have 3 miss classifications for the orange triangles and then 2 miss classifications for the blue circles that's probably not so bad. 00:49:55.000 --> 00:49:59.000 And so the idea here is, well, what if we allow some of our training points to cross over that decision boundary or cross over that margin. 00:49:59.000 --> 00:50:19.000 So. And the idea here is we're going to make our margin soft by allowing our points to cross over them the margin and also allowing it to cross over the decision boundary if it needs to. 00:50:19.000 --> 00:50:30.000 Okay, so it's still a linear vector machine, a linear support vector machine, but now the difference between the max margin and the linear, just the regular general linear is. 00:50:30.000 --> 00:50:47.000 We're gonna allow points to cross over if they need to. Okay, so the idea here that gets switched is you're still doing the exact same problem as before, but now we're multiplying our margin by this sort of one minus epsilon I. 00:50:47.000 --> 00:50:58.000 Where the epsilon i are determining a basically determining a budget of how much we're going to allow points to cross over both the margin and the. 00:50:58.000 --> 00:51:10.000 Decision boundary. And so these epsilon I are determined in the following way. So if the training point is on the correct side of the margin. 00:51:10.000 --> 00:51:19.000 In this example, the correct side would be over here and over here. If it's on the correct side, then the epsilon eyes are 0. 00:51:19.000 --> 00:51:26.000 If it's on the wrong side of the margin, but on the correct side of the hyperplane, it would be between 0 and one. 00:51:26.000 --> 00:51:37.000 And so what would that be? So like that would be if we were in here. So if an orange triangle was on this side of the decision boundary, but on the other side of the margin. 00:51:37.000 --> 00:51:48.000 Okay, so that would be a situation where it's between 0 and one. And it would be greater than one if the observation of I is on the completely wrong side of the hyperplane, meaning it's being misclassified. 00:51:48.000 --> 00:51:59.000 And so this is where that thing that I called C is coming from. So here the larger C is the less wiggle room you have for things being on the wrong side. 00:51:59.000 --> 00:52:04.000 And then the smaller C is the more you allow things to cross over both the margin and the hype and just the hyperplane altogether. 00:52:04.000 --> 00:52:16.000 Okay, so that's the idea. It's written like this because in the traditional setup it's like alpha or something but in this is corresponding with how it works for SK Learn. 00:52:16.000 --> 00:52:29.000 So for SK Learn, larger alpha means a smaller budget for crossing over and then smaller, sorry, larger C, means a smaller budget for crossing over and then smaller C means you're more likely to let things cross over. 00:52:29.000 --> 00:52:37.000 So to get a sense for how that works. Oh, we're doing here. Is going through these different values for C. 00:52:37.000 --> 00:52:45.000 Fitting a support vector machine and then showing you how the decision boundary and the margin changes, for different values. 00:52:45.000 --> 00:52:50.000 So here we've this is same point but I've changed it so you can actually see the boundary. 00:52:50.000 --> 00:52:58.000 So here this is pretty close to that original C is equal to 10. So this is a larger, meaning we have a smaller budget for crossovers. 00:52:58.000 --> 00:53:06.000 So this is pretty close to that original decision boundary. But then as C starts to get smaller, we'll start to see both the margin get wider. 00:53:06.000 --> 00:53:09.000 And the 00:53:09.000 --> 00:53:19.000 The boundary start to shift. Okay. So the smaller C is, remember the more wiggle room we have for things to be. 00:53:19.000 --> 00:53:27.000 On, on different sides. Okay. 00:53:27.000 --> 00:53:43.000 So this is the idea. So C is a hyper parameter. And just like with rich regression with alpha or K nearest neighbors with K, the value of C that works best for your problem can be determined with hyper parameter tuning and cross validation. 00:53:43.000 --> 00:53:49.000 So what we would probably do is, we would set up a grid of different values of C. 00:53:49.000 --> 00:53:51.000 So like we kinda did already. And then we would say like for each value of C, find the average cross validation. 00:53:51.000 --> 00:54:05.000 Accuracy, say, and then choose the one that had the best average cross validation accuracy. 00:54:05.000 --> 00:54:11.000 Okay, so that's the idea. So that is linear. Support vector classifiers. The first one, is, was developed for things that are linearly separable. 00:54:11.000 --> 00:54:27.000 The second one was for things that are close to linearly separable. They're the same thing in Skler and the only difference is what value of C you use. 00:54:27.000 --> 00:54:36.000 So what we saw first was a relaxation of like needing to be completely linearly separable versus close to linearly reciprocal. 00:54:36.000 --> 00:54:45.000 So then the next natural extension is, okay, well, what if you're not at all linearly separable, like you have a nonlinear decision boundary. 00:54:45.000 --> 00:55:00.000 And so that's the idea behind general support vector machines. So here. And some data. So here is a two-dimensional example of no matter how you draw it, there's no way you could get close to even correctly classifying these things with the linear boundary, right? 00:55:00.000 --> 00:55:07.000 There's no line. We could draw that would, do a good job of classifying this in 2 dimensions. 00:55:07.000 --> 00:55:19.000 Right. And then here's an even simpler example. Here's a one dimensional example where there's nowhere we could place a dividing line that would do a good job of classifying these, right? 00:55:19.000 --> 00:55:24.000 So ideally what we would want to do is put a sort of like, all right, these ones in the middle here. 00:55:24.000 --> 00:55:33.000 So between like negative. Point 3 and then positive point 3, you know, classified as one outside of that classified as negative one. 00:55:33.000 --> 00:55:41.000 But the 2 types of support vector machines that we just learned. Wouldn't be able to do that with the data as it is. 00:55:41.000 --> 00:55:50.000 So what can we do? You do this process known as sort of lifting. So In a situation like this. 00:55:50.000 --> 00:55:58.000 If we just take both the original and then the square of that. To a new data set. Like this. 00:55:58.000 --> 00:56:05.000 So we've got one feature that's the original X one and then the second feature that's the square of x one. 00:56:05.000 --> 00:56:11.000 We can now draw a linear. Cision boundary here, right? We can draw a straight line that would divide these 2. 00:56:11.000 --> 00:56:22.000 So just we can do that. So let's do linear SVC. And then I'll just do C equals one. 00:56:22.000 --> 00:56:31.000 Max iter equals. Bigger. And then we would do, what did I call this X new? 00:56:31.000 --> 00:56:37.000 And then why? 00:56:37.000 --> 00:56:47.000 Let's make see a little. Little bigger, so let's do 10. Okay, so now I can perfectly separate these 2, right? 00:56:47.000 --> 00:56:55.000 So this is the decision boundary that I've created now using this approach. And so this is really just the idea in general. 00:56:55.000 --> 00:57:03.000 So like, for instance, we could do it by hand with this previous 2D example, we could make what's known as a paraboloid, probably in that would work. 00:57:03.000 --> 00:57:16.000 But in general, like, You're not always gonna be able to do like, alright, I'm gonna find the perfect combination of nonlinear transformations of the original data that's going to allow me to use a linear support vector classifier. 00:57:16.000 --> 00:57:24.000 But luckily, There's a trick that does it for you. And so the trick is known as the kernel trick. 00:57:24.000 --> 00:57:26.000 And so 00:57:26.000 --> 00:57:41.000 We have this concept known as a kernel function. And then the idea behind these kernel functions is they're going to allow us to fit a linear support vector machine in a higher dimensional space, maybe infinitely dimensional. 00:57:41.000 --> 00:57:48.000 And then project the resulting decision boundary down into the original space. So. To get there, we're gonna have to review some concepts. 00:57:48.000 --> 00:58:02.000 So again, this is a little bit mathematical. So if you're not a math person, just try and hang on and take away the the bigger picture of lifting to a higher space and then going back down. 00:58:02.000 --> 00:58:10.000 So. We need to know something about inner products. So when you fit, we didn't really go over how to fit the algorithm. 00:58:10.000 --> 00:58:19.000 I just sort of showed you the problem and then didn't say much. So to fit the algorithm in the background, what's going on is some inner products are being calculated. 00:58:19.000 --> 00:58:28.000 And so what is an inner product? Well, For support vector machines, it's basically it's just taking the product of each individual component of the vectors multiplying them and then adding them together. 00:58:28.000 --> 00:58:47.000 Okay, so this gets calculated in the background when SVM is getting fit. So if we have sort of a function that's going to take our data and lift it into a higher dimension that I'm gonna call, let's, I think this is fee. 00:58:47.000 --> 00:58:52.000 I'm gonna call fee. So what we would do in the in the data is like when we're fitting our SVM. 00:58:52.000 --> 00:58:59.000 We would need to calculate this inner product. So as an example. 00:58:59.000 --> 00:59:05.000 Consider this fee that takes in two-dimensional data and produces three-dimensional data like so. 00:59:05.000 --> 00:59:15.000 Okay, so x one and x 2 goes to a three-dimensional vector x one squared square root of 2 times x one times x 2 and then x 2 squared. 00:59:15.000 --> 00:59:21.000 Then we would need to go through the process if we're fitting this of computing and inner product between the higher dimensional stuff. 00:59:21.000 --> 00:59:27.000 So if we did that, you know, this would give us A one squared times B one squared. So that's the first component, right? 00:59:27.000 --> 00:59:36.000 And then the next component would be 2 times A one times B one times A 2 times B 2. So that's the multiplying of the second components of each. 00:59:36.000 --> 00:59:47.000 And then the last of the plus. A 2 squared plus b 2 squared, but. It turns out that we can rewrite that as A one times B one plus A 2 times B 2 squared. 00:59:47.000 --> 00:59:56.000 And this is just the inner product of A and B squared. So it's the inner product in the lower dimensional space, right? 00:59:56.000 --> 01:00:11.000 So. What if we had something like this where yes, we have a higher dimensional transformation. But it turns out the inner product and the higher dimensional space is just some kind of function of the inner product in the original data space. 01:00:11.000 --> 01:00:19.000 I, I keep like moving my arms out like this today, so I'm very animated. And so this is the idea of a kernel function. 01:00:19.000 --> 01:00:29.000 So if you have a map that goes from the lower dimensional space to the higher dimensional space. So if we have like something like that taking the one dimension and squaring it into the second dimension. 01:00:29.000 --> 01:00:37.000 If we have this map that lifts our space into a higher dimensional space. And then that map also has a function K. 01:00:37.000 --> 01:00:43.000 So that the inner product and the higher dimensional space is just a function of the lower dimensional data. 01:00:43.000 --> 01:00:52.000 Then we would say that our map has a kernel function. Okay, so in this example, our kernel function is the square of the inner product. 01:00:52.000 --> 01:01:12.000 So that's the idea here is general support vector machines apply these maps that have kernel functions. So those maps allow us to lift it to a higher space, but then this higher space, the inner product there is just calculated in the lower dimensional space because of this kernel function idea. 01:01:12.000 --> 01:01:27.000 So that's known as the kernel trick. So the 4 most common kernels and the ones that you can implement in SK Learn are the linear kernel, which is not really doing what we just said, but it's what we kind of used for the last 2, right? 01:01:27.000 --> 01:01:34.000 So it's just the inner product of A and B. The polynomial kernel, which is what we use in our previous example. 01:01:34.000 --> 01:01:43.000 So it's, gamma times the inner product. Of A and B plus R. All of that raised to the dimension D. 01:01:43.000 --> 01:01:51.000 Gamma here and R here are more hyper parameters which you would tune and then D is the degree of the polynomial. 01:01:51.000 --> 01:02:03.000 So for us, we would have used degree 2. We've got the Gaussian radial kernel, or Gaussian radial, our G, sorry, Gaussian RBF kernel. 01:02:03.000 --> 01:02:11.000 So it's given by this where the norm here is the Euclidean norm. And then finally what's known as a sigmoid kernel, which is hyperbolic tangent of gamma times the inner product plus R. 01:02:11.000 --> 01:02:24.000 So. What you'll probably, the default is the radial basis function, Colonel. And so if you're doing this, you'd probably try that first, and if it doesn't work, you can try the other ones and play around. 01:02:24.000 --> 01:02:33.000 But this is probably your biggest go-to one. So we're going to show how to implement this in SK Learn. 01:02:33.000 --> 01:02:40.000 So from SK Learn dot SVM, we're gonna import capital SVC, all capital. 01:02:40.000 --> 01:02:52.000 So support vector classifier. So we're gonna go back to our one dimensional example. And so we're gonna set this up and remember I said we used a kernel. 01:02:52.000 --> 01:02:56.000 A polynomial kernel. So we would do SVC. Colonel is equal to the string poly for polynomial. 01:02:56.000 --> 01:03:12.000 We want to specify the degree, so we want a degree to polynomial and then I'm just setting C to be that and then I'll increase the max iter just to be safe. 01:03:12.000 --> 01:03:17.000 So we're gonna fit it. I think I called it. Maybe just X, is that what I have? 01:03:17.000 --> 01:03:24.000 Yes, just X. 01:03:24.000 --> 01:03:30.000 If you gotta reshape, that's what. 01:03:30.000 --> 01:03:37.000 Okay. And so now here I'll plot that decision boundary. So the big X's represent the boundary. 01:03:37.000 --> 01:03:38.000 So outside to the left and the right of the big X's, I'm classifying as a one. 01:03:38.000 --> 01:03:51.000 And on the inside of the big X's I'm classifying at or sorry outside negative one blue circle inside one orange triangle. 01:03:51.000 --> 01:04:02.000 So here we're going back to that other example. And so here we can show you how you could try, like I said earlier, right, you could do something where you make what's known as a paraboloid. 01:04:02.000 --> 01:04:08.000 So that is also. A polynomial kernel. 01:04:08.000 --> 01:04:13.000 And then it's a degree 2. 01:04:13.000 --> 01:04:19.000 And then I guess I put c equals 10 here. And then the other one we could try is that default one. 01:04:19.000 --> 01:04:27.000 So the RBF kernel, so you would just do SVC and because RBF is the default, you would just put in like your value of C that you want to use. 01:04:27.000 --> 01:04:32.000 So here I'll just use 10 for simplicity. 01:04:32.000 --> 01:04:39.000 And then here we've plotted the decision boundary as the solid line and then the dotted line is sort of. 01:04:39.000 --> 01:04:48.000 Equivalent to the margin. So here you can see with the polynomial kernel, I get sort of, you know, these lines, it's still sort of a linear decision boundary in some sense. 01:04:48.000 --> 01:04:54.000 And then on the RBF, you get sort of what looks like, you know, sort of just a Gaussian distribution almost. 01:04:54.000 --> 01:05:00.000 And so. You know, which one works best is something that you would wanna try with the cross validation. 01:05:00.000 --> 01:05:08.000 So maybe you want to try both and then see which one generalizes better. Yeah, and so here are some references. 01:05:08.000 --> 01:05:15.000 I didn't do a lot of the mathematical details for this one. So I've provided some references here. 01:05:15.000 --> 01:05:22.000 They're all chatters of elements of statistical learning. Which I believe here I have linked to the free web PDF of that. 01:05:22.000 --> 01:05:49.000 Okay, so maybe just if there are a few questions we can try and answer them and then we'll get to principal components analysis. 01:05:49.000 --> 01:05:50.000 Okay, yeah, yeah. 01:05:50.000 --> 01:05:55.000 I ask a question. Sorry, a little confused about these, kernels. And the inner product. 01:05:55.000 --> 01:06:06.000 So we the inner products come into the second part of the lecture. Like I'm a little confused why you need these still. 01:06:06.000 --> 01:06:34.000 Is it because it's you have a 2 dimension because you're going into hard dimensions is that why 01:06:34.000 --> 01:06:40.000 The one dimensional space. But that can be really cost prohibitive. So the larger the higher dimensional space that you lift into, the longer it would take to calculate the inner product, right? 01:06:40.000 --> 01:06:53.000 Because you have more dimensions. And especially if, for instance, there are some cases like the RBF, which are lifting to what's known as an infinite dimensional space. 01:06:53.000 --> 01:07:05.000 Hilbert space, where like it wouldn't be possible for your computer to calculate the inner product in sort of this way, right? 01:07:05.000 --> 01:07:16.000 So kernel functions are a trick where like if the if the mapping the function that goes from the lower dimensional space to the higher dimensional space. 01:07:16.000 --> 01:07:26.000 Has this kernel function where like the inner product is just a function of the original data. Then you don't have to, you don't have to compute the inner product up there. 01:07:26.000 --> 01:07:33.000 You can compute the inner product than the original space. So that's the idea. 01:07:33.000 --> 01:07:34.000 Okay, thanks. 01:07:34.000 --> 01:07:41.000 Yeah. So used to ask what type of classification problem do we use for the SVM? 01:07:41.000 --> 01:07:57.000 So you can try it on any classification problem. So there's some things to consider that like the fitting time and I don't have like the, How long like given the number of features and observations. 01:07:57.000 --> 01:07:59.000 I don't know like what the formula is for like how long it would take but that I know that's something you want to consider. 01:07:59.000 --> 01:08:17.000 You would just try it on a regular you can just try it on a classification problem and if it doesn't work well in the cross-validation or if you have other considerations like it takes too long to fit or something like that, then you wouldn't use it. 01:08:17.000 --> 01:08:30.000 And that sort of thing. And then Keithon's asking, in poly support vector classification, is it simply considering 2 instead of one, where they're both linear boundaries. 01:08:30.000 --> 01:08:32.000 So. 01:08:32.000 --> 01:08:37.000 It's still for these problems, it's still only computing a single one, right? So. 01:08:37.000 --> 01:08:46.000 Like here was when we did the polynomial kernel, right? And we got the what looks like 2 different boundaries in the original space. 01:08:46.000 --> 01:08:52.000 But remember what's going on is in the higher space. Because we've lifted it, we can now draw a single boundary. 01:08:52.000 --> 01:09:00.000 And then what happens is Like you, it gets projected back down. So like it goes backwards like through the, through the mapping. 01:09:00.000 --> 01:09:09.000 So like this line will get sort of projected back down. Into the one dimensions where it appears is like 2 decision points. 01:09:09.000 --> 01:09:30.000 Is the idea and it's the same idea here where in the polynomial kernel, what's happening is these are getting transformed through what's known as a paraboloid and then some you're like slicing it with a plane and then bringing it back down. 01:09:30.000 --> 01:09:34.000 Okay. 01:09:34.000 --> 01:09:37.000 So in this notebook, we're taking an aside, so if you're trying to keep track of where this is. 01:09:37.000 --> 01:09:45.000 It is in the unsupervised learning folder. This is going to be the only notebook we cover out of here in live lecture. 01:09:45.000 --> 01:09:51.000 So it's principal components analysis. The reason we're doing this is it's probably like one of the biggest. 01:09:51.000 --> 01:10:00.000 Dimension reduction. One of the biggest dimension reduction techniques you're gonna want to know. So sometimes you'll have data that's really high dimensional and for various reasons you want to. 01:10:00.000 --> 01:10:11.000 Reduce the dimension. So maybe the data, you have too much data to fit your algorithm in an efficient time. 01:10:11.000 --> 01:10:16.000 Maybe you want to get rid of some noise in the data and you want to just recover features that are, not noisy and actually giving you signal. 01:10:16.000 --> 01:10:30.000 So for all of these reasons, you want to apply various dimension reduction techniques. So the one we're going to cover is PCA or principal components analysis. 01:10:30.000 --> 01:10:39.000 So this is a technique. It's probably the most popular dimension reduction technique in data science. And we're gonna go through it. 01:10:39.000 --> 01:10:46.000 So for the sake of time, we're not gonna cover the entire notebook. There's probably one set that I'm going to skip through just for the sake of time. 01:10:46.000 --> 01:10:54.000 But remember there are already pre recorded so if you want to come back and go through the part that I end up skipping, you're more than welcome to do that. 01:10:54.000 --> 01:11:24.000 On your own time. Okay. So I'm going to. Take a drink of water and then we'll go through this. 01:11:25.000 --> 01:11:32.000 Information, but. And S like when you're reducing the dimension, you are losing what's known as information. 01:11:32.000 --> 01:11:37.000 And so the idea behind PCA or most techniques is you want to reduce it in a way that retains the important information as much of that important information as you can. 01:11:37.000 --> 01:11:53.000 So the way that PCA tackles this is very statistical in nature. So there's this idea in statistics that the information of a data set is located within the data sets variance. 01:11:53.000 --> 01:11:59.000 And so PCA looks to reduce the dimension of a data set by projecting the data from the higher dimensional space to a lower dimensional space while capturing as much of the original variance as possible. 01:11:59.000 --> 01:12:08.000 So your original data set has some variance. And what PCA tries to do is capture as much of that variance as it can. 01:12:08.000 --> 01:12:16.000 While giving you a lower dimensional data set. So the way it does this is sort of like an optimized optimization problem. 01:12:16.000 --> 01:12:27.000 It's trying to produce projections. In a way that maximizes the variance of those projections. 01:12:27.000 --> 01:12:32.000 So here's sort of the heuristic algorithm. So the first step is to send your data. 01:12:32.000 --> 01:12:35.000 So it has 0 mean. So you can do this without impacting the data. You're just gonna subtract off the mean and it's just done for convenience. 01:12:35.000 --> 01:12:43.000 It makes the formulation easier. 01:12:43.000 --> 01:12:52.000 Next, you find, and you know, this isn't you, this is the computer, but you find the direction in space along which projections have the highest variance. 01:12:52.000 --> 01:13:03.000 This is going to be your first principal component. Then among all the other directions, so if you have an, and dimensional data set, so M features, right? 01:13:03.000 --> 01:13:06.000 If you have M features, you have after the first principal component, you'll have m minus one possible directions left. 01:13:06.000 --> 01:13:16.000 So then the next step after finding the first principal component is you want to find the direction that is orthogonal. 01:13:16.000 --> 01:13:28.000 So add a right angle to the first principal component that maximizes the variance. This is your second principal component and you keep doing this until you get through as many principal components as you can. 01:13:28.000 --> 01:13:33.000 So every time you make a new direction, it's orthogonal to the all the previous directions. 01:13:33.000 --> 01:13:47.000 Okay, so. We're gonna look at a somewhat silly example, a 2D example, and we'll see why it's Silly because you might be wondering like oh why would I want to reduce the dimension of a two-dimensional data set, it's already pretty small. 01:13:47.000 --> 01:13:58.000 So here I have 2 dimensions, x one and x 2. So in unsupervised learning, we didn't talk about this, like we're just, we don't have any wise, we have no outputs for trying to predict. 01:13:58.000 --> 01:14:11.000 We are just trying to look at an X matrix. So a matrix X. So in the context of the supervised learning we've been doing up to this point, that X matrix might be our matrix of features. 01:14:11.000 --> 01:14:17.000 So the idea with PCA is like, I don't care about what Y is I'm just trying to get data out of the X. 01:14:17.000 --> 01:14:26.000 And so here we just have X here, and we have X 2 and X one. 01:14:26.000 --> 01:14:31.000 So Jacob is asking when you say scale the features so they have mean 0. Does this include the one hot encoded features so you can just use a standard scale on every single feature. 01:14:31.000 --> 01:14:47.000 So that's a good question, Jacob. So Here I'm assuming I guess I'm just implicitly assuming we have continuous features for the setup, just because it's easier to look at. 01:14:47.000 --> 01:14:54.000 So. In general, you would not, you do not want to scale. One hot encoded features. 01:14:54.000 --> 01:15:02.000 You never wanna scale those. You always keep them as zeros and ones. And you can apply PCA to a matrix of zeros and ones and it's fine. 01:15:02.000 --> 01:15:11.000 But for this particular setup to get the intuition behind PCA, I'm just restricting myself to continuous features. 01:15:11.000 --> 01:15:19.000 Okay, so remember I said for PCA our idea is we want to find the direction of mimal variance. 01:15:19.000 --> 01:15:29.000 And so variance means spread. So of these of the directions in this plane, the one where the data has the greatest spread is the one that I'm tracing out. 01:15:29.000 --> 01:15:41.000 With my mouse, my cursor. Okay. And then what's left after that is just going to be the direction orthogonal, which would be with the direction I'm currently tracing out. 01:15:41.000 --> 01:15:44.000 So let's go through. 01:15:44.000 --> 01:15:53.000 And see this by doing PCA. So PCA is stored in a sub-package of SKL and called decomposition. 01:15:53.000 --> 01:15:57.000 So from. S. K. Learn dot D composition. 01:15:57.000 --> 01:16:03.000 We're going to import capital P, capital C, capital A. 01:16:03.000 --> 01:16:14.000 Then we're gonna make our PCA objects. So we're gonna do little PCA equals Capitals, PCA, and then the number of dimensions we want to project down to is the second is the number that we put in. 01:16:14.000 --> 01:16:21.000 So for us, that will be 2. And then to fit the data, you call PCA dot fit. 01:16:21.000 --> 01:16:25.000 And then there's no Y here. So you just put in X. 01:16:25.000 --> 01:16:34.000 Okay. All right, so now this is a function that will take in our fitted PCA and then draw the vectors. 01:16:34.000 --> 01:16:43.000 So what does that mean? So these solid black lines, these are the directions in the. 01:16:43.000 --> 01:16:51.000 And the your data space that we would project onto. So this long black line represents the vector that we're going to. 01:16:51.000 --> 01:16:55.000 So this long black line represents the vector that we're going to project onto for the vector that we're going to project onto for the first PCA direction. 01:16:55.000 --> 01:16:59.000 This shorter black line represents the vector that we're gonna project onto for the first PCA direction. 01:16:59.000 --> 01:17:07.000 This shorter black line represents the direction of the second principal component, the second PCI Okay, so the way PCA takes your original data. 01:17:07.000 --> 01:17:18.000 And produces new transform data is it's going to take each of these observations. Project first onto bigger vector and then project second onto the second biggest vector. 01:17:18.000 --> 01:17:27.000 And then those projections will give us the coordinates for the new data. So. How do we get that with Skewer and we do. 01:17:27.000 --> 01:17:34.000 Fit equals PCA. So this works an awful lot like a scale or object. So you have fit and then you have transform. 01:17:34.000 --> 01:17:45.000 So you have PCA. Transform and then we input the X and I this may be as bad notation I call it fit just because I just what I've always done. 01:17:45.000 --> 01:18:03.000 So this is what's known as the PCA transformed data. So these represent the same observations that you see here, but now they've been projected onto a different space where if this was higher dimensions, it would be onto 2, but since it's 2 to 2, it's the same number of 01:18:03.000 --> 01:18:06.000 dimensions. But we'll see, We'll see sort of what's going on in a second. 01:18:06.000 --> 01:18:15.000 So I've got a question here. To do. Okay, so I believe I've answered the question. 01:18:15.000 --> 01:18:19.000 So the question was, why do we say PCA, don't we already have 2 dimensional data? 01:18:19.000 --> 01:18:23.000 Is this why you said you are just showing how it operates in a simple example. Yep, that's exactly it. 01:18:23.000 --> 01:18:31.000 It's just showing it in a way that I can visualize. So you see what's going on. 01:18:31.000 --> 01:18:39.000 Okay. So we are going to look at the maximal variance formulation of PCA just so we can see what's going on. 01:18:39.000 --> 01:18:48.000 It sort of makes it easier for the setup. There are other ways to set up PCA that are in the practice problems for the PCA notebooks. 01:18:48.000 --> 01:18:57.000 So let's suppose that we have n observations of m features x one through xm. Each of these are an n by one vector. 01:18:57.000 --> 01:19:02.000 Containing observations of the M features. So again, I'm assuming that they all have mean 0. 01:19:02.000 --> 01:19:12.000 We could do this simply just by subtracting off the mean and our data would be fine. Okay, so to find the first principle component. 01:19:12.000 --> 01:19:18.000 We're gonna do that and then you know you can extrapolate to extrapolate to get the next however many components. 01:19:18.000 --> 01:19:26.000 So we're gonna set up X as a matrix of an M by M matrix for each column is one of the features. 01:19:26.000 --> 01:19:31.000 And then our goal is to find a weight vector. Such that the norm of that vector is one, where the variance is as big as possible. 01:19:31.000 --> 01:19:41.000 So the variant here the W's times the X or X times W as a linear algebra expression. 01:19:41.000 --> 01:19:46.000 This is the projection. So you want to maximize this variance. So the variance of x times w is equal to this. 01:19:46.000 --> 01:19:58.000 And because we've centered the means, so because we've centered the columns, this is what the variance is equal to. 01:19:58.000 --> 01:20:09.000 And then you can bring out the weight vectors outside and what you're left with is W transpose times covariance of the matrix X times W. 01:20:09.000 --> 01:20:20.000 So then what we're trying to optimize. Is this W transpose sigma W and now we're constraining ourselves to the fact that W transpose minus one has to be 0. 01:20:20.000 --> 01:20:40.000 Where does this come from? That's just the norm has to be equal to one. So if you go back to calculus 3, you can use Lagrange multipliers and find out that this, maximizing, The variance doing this constrained optimization problem gives you, you want the W such that 01:20:40.000 --> 01:20:57.000 sigma W is equal to lambda W. So this is a standard eigenvalue setup. So the principal components are going to be Igan vectors that correspond to the eigenvalues of the covariance matrix. 01:20:57.000 --> 01:21:04.000 So the first piece, the one that maximizes the variance the most is going to be the Oh, eigenvector and eigenvector corresponding to the largest eigenvalue. 01:21:04.000 --> 01:21:22.000 So if you're a not a math person and all those words just sounded like foreign language to you, just remember it's the direction that maximizes the variance of the data of like the projection. 01:21:22.000 --> 01:21:28.000 So that's all that matters. Okay. So as a quick aside before we go back to. 01:21:28.000 --> 01:21:39.000 Implementing this stuff. You typically need to scale your data before you fit the PCA model in practice. 01:21:39.000 --> 01:21:53.000 So if you have one of your columns have a much larger scale than the other columns because we're maximizing variance, what tends to happen is that the PCA just picks up on that much larger scale column instead of doing, you know, what we would like it to do. 01:21:53.000 --> 01:22:10.000 So it's a common approach to, I run the data through the standard scalar object first and then, you know, accounting for the fact that you may have to ignore a categorical variables if you have those. 01:22:10.000 --> 01:22:22.000 Okay, so. I talked about this W. The W is known as the component vector. And so how we can actually get out the components with dot components underscore. 01:22:22.000 --> 01:22:28.000 So we're going to do PCA dot components. 01:22:28.000 --> 01:22:35.000 Okay, and so here. 01:22:35.000 --> 01:22:44.000 Okay, so we have, sorry, I was trying to read what I was writing earlier. So here we have a numpy array and then within that array we have 2 vectors. 01:22:44.000 --> 01:22:56.000 So this is the first. W so this is the eigenvector corresponding and eigenvector corresponding to the largest eigenvalue of the covariance matrix. 01:22:56.000 --> 01:23:05.000 So the first PCA component and then this the one entry is the second PCA component. Okay, so we can store these in a variable. 01:23:05.000 --> 01:23:10.000 So W one and W 2. 01:23:10.000 --> 01:23:15.000 And so now what I'm gonna go ahead and do. I've plotted these vectors. 01:23:15.000 --> 01:23:22.000 So this vector, the long vector represents W one, the shorter vector represents W 2 and I've got this red X. 01:23:22.000 --> 01:23:24.000 So this red X is one of the data points that I would like to get transformed through this PCA project. 01:23:24.000 --> 01:23:36.000 Not project. And so I, you see these red dotted lines that trace along to the vectors. 01:23:36.000 --> 01:23:46.000 So the Ex position and the PCA projected space is going to be given by the distance from 0 to this point. 01:23:46.000 --> 01:24:03.000 Onto the onto W one so it'll be this length And then the vertical access position of this red X and the PCA space is going to be the distance from 0 to this point on the vector W 2. 01:24:03.000 --> 01:24:10.000 And so now we can see this is what it looks like. So here my red X is that new position. 01:24:10.000 --> 01:24:16.000 So we're now we're in the PCA projected space. So the vectors have been brought over just to help as a reference point. 01:24:16.000 --> 01:24:25.000 So here is the PCA transform value that was done. By hand using, the projection. 01:24:25.000 --> 01:24:33.000 That we found and then the blue circle is what you'd get with the SQL and so they're slightly different because of, probably. 01:24:33.000 --> 01:24:36.000 Different levels of precision But they're essentially the same, just like very slight difference. 01:24:36.000 --> 01:24:45.000 Okay. So. Because we only have 3 min left, I'm not gonna pause for questions just yet. 01:24:45.000 --> 01:24:58.000 I do want to get through this idea of explained variance. So remember we said we want to preserve as much of the variance of the original data as possible. 01:24:58.000 --> 01:25:11.000 We can find how much of it we've captured. With explained variance. So we do explained. Variance and what this tells you is It's a breakdown of the variance of X times W. 01:25:11.000 --> 01:25:19.000 Yeah, it's a breakdown of the variance of X times W. So the variance of X times W one is 80.4 5. 01:25:19.000 --> 01:25:33.000 And the variance of X times W 2 is 4.2 5. This becomes much more. To you if you look at what's known as the explained variance ratio. 01:25:33.000 --> 01:25:41.000 And so what you would do is You can see here that, we have. 90 or point 9 4 9. 01:25:41.000 --> 01:25:57.000 So this is saying that 90 basically 95% of the original data sets variance is captured in the first principal component direction and then about 5% of the original variance is captured in the second principal components direction. 01:25:57.000 --> 01:26:05.000 So that's the idea of the explained variance ratio. 01:26:05.000 --> 01:26:19.000 Okay, so I wanted to keep going. I just wanted to see how far we got so what we're gonna do is we're gonna use this data set of images of famous people's faces and then use this to demonstrate the idea of the explained variance curve. 01:26:19.000 --> 01:26:20.000 So this is a lot of data set. So a lot of data set. This is a lot of data. 01:26:20.000 --> 01:26:31.000 So each image has 87 by 65 pixels, which is 5,655 features. 01:26:31.000 --> 01:26:36.000 And so we're going to run the data. Through PCA. 01:26:36.000 --> 01:26:44.000 And so for here, when you have image data, you tend to just scale it. So each pixel is represented in grayscale. 01:26:44.000 --> 01:26:54.000 It's represented of a value from 0 to 255 where 255 means like the most light it can show and 0 means a completely black pixel. 01:26:54.000 --> 01:26:58.000 With these types of problems, it's standard to scale it to do what's known as a mid max scaling. 01:26:58.000 --> 01:27:02.000 So the minimum value goes to 0 and the maximum value goes to one. So here I have not specified a number of components. 01:27:02.000 --> 01:27:21.000 So when you do this, what it does is it just fits as many components as it can. And then after it's done fitting, which maybe will take a little bit, we'll see by looking at the shape of components, we'll see how many components we fit. 01:27:21.000 --> 01:27:31.000 And then after we've done that, we'll talk about the idea of the explained variance, curve. 01:27:31.000 --> 01:27:37.000 So while that was fitting, you said had the question, it seems to me as if the second variance in the second component is small. 01:27:37.000 --> 01:27:45.000 So in this previous example, yes, it was small. So now, 95% of the variance was captured by the first direction, right? 01:27:45.000 --> 01:27:51.000 And because this is a 2 dimensional problem, that just means everything that's left has to go in the second one. 01:27:51.000 --> 01:27:56.000 Okay. 01:27:56.000 --> 01:28:07.000 Alright, so when we fit this, See what is the shape? So this is telling us that we CAD 3,023 featured or not features. 01:28:07.000 --> 01:28:16.000 He's Ca, right? PCA components. Yes. Okay, so. 01:28:16.000 --> 01:28:25.000 Oftentimes you may be wondering how many components should I use. So if you're trying to do something like visualization, maybe you're just gonna look at 2 components and be happy. 01:28:25.000 --> 01:28:36.000 If you're trying to find a way to reduce the dimension of the data like we're trying to here, while still maintaining as much variance as you can, you may want to look at the explained variance curve. 01:28:36.000 --> 01:28:47.000 So the explained variance curve. Plots the number of principal components you've used and then also the cumulative explained variance ratio. 01:28:47.000 --> 01:28:56.000 So you take this explained variance ratio that we looked at above here. Right. And the cumulative explained variance ratio. 01:28:56.000 --> 01:29:03.000 Just I won't look the same here, but it would be the cumulative sum of all the entries in here. 01:29:03.000 --> 01:29:09.000 So for this it would look kind of silly because we only have 2 so it would be the same thing here and then it'd be one here. 01:29:09.000 --> 01:29:28.000 So by plotting that it gives you a sense of seeing, okay, if I have, say, a hundred principal components, that means I'm capturing about 90% of the original data sets variance and so what you can do is sometimes you'll look for sort of an elbow in the curve and so we say elbow 01:29:28.000 --> 01:29:37.000 here because you sort of imagine putting your elbow up like this if you can look at my the picture of me right now and then like the elbow sort of where the. 01:29:37.000 --> 01:29:46.000 Increases in variance start to become so small that it's not worth the extra cost of either A keeping the date the component or computing the component. 01:29:46.000 --> 01:29:54.000 So for us maybe the elbow would be about. 0 point 9. It's sort of this idea of diminishing returns for keeping additional components. 01:29:54.000 --> 01:30:03.000 Another thing you could do is you can just say, well, I wanna keep 95% of the, of the original variance and then just set it to be 95%. 01:30:03.000 --> 01:30:13.000 So, When you do that, you can just say, okay. I want my number of components to be 95% and you can set the number of components equal to a fraction. 01:30:13.000 --> 01:30:23.000 Instead of, instead of an actual number like 10 or 15 or whatever. Okay. 01:30:23.000 --> 01:30:32.000 All right, so based on the time I'm gonna ask that you finish the rest of the PCA on your own except for. 01:30:32.000 --> 01:30:39.000 No, yeah, I think just Finish it on your own time. We got to the most important stuff, which is the model. 01:30:39.000 --> 01:30:56.000 And then. The explained variance ratio for the rest of the stuff. I would encourage you to go through it on your own either through just by reading through it on your own or by watching the pre recorded lecture. 01:30:56.000 --> 01:31:05.000 Okay. So I will stop recording and then I'll hang around for any questions that people have.