WEBVTT

00:00:14.000 --> 00:00:15.000
On with classification. So we have 3 notebooks that I want to try and get through today.

00:00:15.000 --> 00:00:26.000
All 3 of them are long. So we're gonna do our best to get through them. I will do, I will stop for questions, but I'll also try and move things along.

00:00:26.000 --> 00:00:36.000
Cause it's a lot of content. So we're in classification and we're gonna start with notebook number 6, base classifiers.

00:00:36.000 --> 00:00:42.000
And then we'll talk, move on from there. To support vector machines. So we are going to skip notebook number 7.

00:00:42.000 --> 00:00:50.000
I encourage you to go through and look at that on your own time, especially if you're doing a multi-class classification problem for your project.

00:00:50.000 --> 00:01:01.000
And then after that, we'll leave the world of supervised learning for a brief notebook in unsupervised learning to end today.

00:01:01.000 --> 00:01:10.000
So let's talk about Bayes based classifiers. We're here, Bayes refers to Bayes rule from probability theory.

00:01:10.000 --> 00:01:21.000
So I want to start just by reviewing it. So some of you are either, you know, don't remember Bayes rule, which is perfectly fine if you're not using it right like you're gonna forget it or maybe you've never been introduced to Bayes role before.

00:01:21.000 --> 00:01:23.000
So to understand the idea of Bayes rule, there's 2, things that you need to remember.

00:01:23.000 --> 00:01:34.000
Or no, if you don't remember it, you need to know the definition of conditional probability.

00:01:34.000 --> 00:01:50.000
So if we have 2 events, A and B, so again, think of events if you're new to probability theory, think of it like you're tossing a coin and seeing how it lands, one event would be that the coin lands tails, the other would be that the coin lands heads and then that's

00:01:50.000 --> 00:01:56.000
all the possibilities. You know, I guess in theory you could have a situation where the coin lands.

00:01:56.000 --> 00:02:01.000
You know, face up or sort of like on its side so that it's not heads or tails, but that's a probability.

00:02:01.000 --> 00:02:07.000
I'm going to say that's a probability of 0 and the real world. Probably. Okay.

00:02:07.000 --> 00:02:13.000
So then if you have these 2 events where you're saying that the probability of V is not 0.

00:02:13.000 --> 00:02:21.000
The probability of event A happening conditional on the fact that event B has happened. So this is P of A.

00:02:21.000 --> 00:02:30.000
Vertical line means given can or conditional on B. Is the probability of their intersection divided by the probability of B.

00:02:30.000 --> 00:02:45.000
And so one way to visualize this. With the picture. So if all the possible events that could happen or represented by this rectangle and All the situations where A happens is this left circle all the situation where B happens is this right circle the probability of A.

00:02:45.000 --> 00:02:56.000
Conditional on B is sort of this green circle or where the 2 circles intersect divided by the entirety of B. Right?

00:02:56.000 --> 00:03:04.000
And so that's because this is the only part of the space for B happens that A also happens.

00:03:04.000 --> 00:03:11.000
So then the other concept that's useful for base theorem gets used a lot is the law of total probability.

00:03:11.000 --> 00:03:15.000
So let's say we have a lot is the law of total probability. So let's say we have a sequence of disjoint events.

00:03:15.000 --> 00:03:20.000
So let's say we have a sequence of disjoint events, meaning that the intersection of disjoint events, meaning that the intersection between any 2 of these events is, meaning that the intersection between any 2 of these events is equal to the empty set.

00:03:20.000 --> 00:03:36.000
And also it holds that the union of all the sets covers the entire space. So one way to visualize it is here's that rectangle from before like the this is called the event space and then you have this sequence of smaller events be that when you take the union of all of them It's the entire space.

00:03:36.000 --> 00:03:47.000
Then it holds that the probability of any other of event A is equal to the sum of the probabilities of A intersection with those BI.

00:03:47.000 --> 00:03:59.000
So if you have a set of events that segment off the space. Then the probability of any event is just the probability of that event intersected with those different segments.

00:03:59.000 --> 00:04:03.000
Okay, so this is a nice graphic visualizing that. Well, I think it's nice I made it.

00:04:03.000 --> 00:04:13.000
This is a graphic visualizing that where you have, the circle representing event A and then all these different weird looking pieces representing the different bees.

00:04:13.000 --> 00:04:20.000
And you can see that if you were to take the intersection of A with the different pieces and add them up together, you'd get A back.

00:04:20.000 --> 00:04:39.000
So Bayes rule is sort of a combination of these. The main part of Bayes rule or also known as the Bayes price theorem is that the probability of A conditional on B is equal to the probability of B conditional on A times the probability of A divided by the probability of B.

00:04:39.000 --> 00:04:44.000
So this part here, like this is Bayes rule, has nothing to do with the law of total probability.

00:04:44.000 --> 00:04:51.000
It's just reworking the definition of conditional probability. So you rewrite this intersection as a different conditional probability.

00:04:51.000 --> 00:05:02.000
That's what's going on here. Now the part where The law of total probability comes in is it's often advantageous to rewrite the denominator according to the law of total probability.

00:05:02.000 --> 00:05:13.000
So you know that the probability of V is equal to the probability of B intersection A plus the probability of B intersection A complement.

00:05:13.000 --> 00:05:18.000
So that's all the things that A doesn't hold. And then you can once again take advantage of the fact that you know the what conditional probability is and rewrite it like this.

00:05:18.000 --> 00:05:30.000
So this is. Regardless of what or not you remember any of these classifiers that were about to learn today.

00:05:30.000 --> 00:05:31.000
Bayes rule is just like a useful thing to know for data science interviews or data analyst interviews.

00:05:31.000 --> 00:05:45.000
So a lot of questions you'll get if you get to a company that does sort of like Screener questions over the phone or on Zoom.

00:05:45.000 --> 00:05:51.000
Oh, there's like a whole class of questions that are basically just like you have to apply Bayes rule and then once you apply Bayes rule you can get the right answer.

00:05:51.000 --> 00:06:01.000
So this is a good thing to know both the original version and then the version that takes advantage of the law of total probability.

00:06:01.000 --> 00:06:07.000
And it's a good thing to practice like if you just did a web search for Bayes rule practice interview problems you'll find a whole bunch of them.

00:06:07.000 --> 00:06:16.000
This is just a good thing to know, regardless of if you remember the different classifiers. We're going to learn today.

00:06:16.000 --> 00:06:29.000
And then assuming this link still exists, I forgot to check it. Assuming this link still exists, there's like a nice visualization that I figured it was easier to give you the link than to try and drop myself, but there's like a nice visualization pictureing what Bayes rule looks like.

00:06:29.000 --> 00:06:36.000
And sort of this kind of setting. Okay, so now with that probability review. Aside, let's see how we can.

00:06:36.000 --> 00:06:41.000
Brooks is telling me that the link still works. Thanks, Brooks.

00:06:41.000 --> 00:06:48.000
So I wrote this notebook a few years ago, so it's always a gamble if I forget to check like does this link still go to what I think it goes to.

00:06:48.000 --> 00:06:59.000
So now that we remember or learned, have learned Bayes rule today, we can now see how you can use it for classification.

00:06:59.000 --> 00:07:10.000
So we're gonna for simplicity for the setup, imagine we are in a space where, you know, you have 2 possible classes, but in general.

00:07:10.000 --> 00:07:20.000
We have our matrix of features X. And an output variable Y. Or that in general. Can take on capital C possible categories.

00:07:20.000 --> 00:07:30.000
And then like just as a reminder, you know, when we had C capital S equals 2, We are trying to predict this, right?

00:07:30.000 --> 00:07:34.000
The probability that Y equals one, given the features. And when we say given the features, that means x equals x star.

00:07:34.000 --> 00:07:45.000
So x star is a particular set of features that we'd like to. To see like what's the probability that.

00:07:45.000 --> 00:07:51.000
Given these features, my output would be one. So you can use Bayes rule to rewrite this and it might look different than like this setup because this is for discrete events.

00:07:51.000 --> 00:08:05.000
And here we have to account for the fact that we have non discreet things possibly. So you can rewrite this.

00:08:05.000 --> 00:08:17.000
To be the probability that Y equals C, given the features. Is equal to pi sub c. So little c here denotes the particular category that y is equaling.

00:08:17.000 --> 00:08:28.000
So in binary it might be one. Pi sub C, F sub C, evaluated at X star divided by the sum of L equals one to capital C.

00:08:28.000 --> 00:08:35.000
So here we're just summing among all the categories of Pi sub L. F sub l evaluated at x star.

00:08:35.000 --> 00:08:41.000
So what are all these different pi's and f's? So pi here is known as a prior probability.

00:08:41.000 --> 00:08:50.000
So pi is the probability that a random observation would come from the SETH class. So you're just choosing an observation at random.

00:08:50.000 --> 00:08:56.000
What's the probability ignoring the features? What's the probability that it's from class C?

00:08:56.000 --> 00:09:18.000
So in practice, you typically just estimate this using the training test. So, training set. If your training set, for instance, had like a 50 50 split, both of your pies would be estimated at point 5 if your training class had a 60 40 split one pi 0 would be point 6 and pi one would be estimated at point

00:09:18.000 --> 00:09:34.000
4. Hey, and then. Here we can think of F sub C. This is the probability density function that of observing the X star so we can think of this.

00:09:34.000 --> 00:09:47.000
As the probability of observing x Given that we know the categories Y equal to C. So This is sort of assuming that X is a qualitative variable, but you could rewrite it.

00:09:47.000 --> 00:09:53.000
I just wanted to give you sort of like what it looks like to try and bring it back to this Bayes rules.

00:09:53.000 --> 00:09:57.000
Okay, so we can see how this is the Bayes rule. So the probability of X given Y times the probability of Y.

00:09:57.000 --> 00:10:15.000
That's what we've done here. We rewritten this according to Bayes rule. And so now the 3 different types of classifiers that we are going to learn in this notebook are all basically just giving us ways to estimate the F's.

00:10:15.000 --> 00:10:17.000
So we already have a way to estimate the pis. We just use whatever fraction of the categories exist in our training set.

00:10:17.000 --> 00:10:29.000
Now we have to figure out, okay, how do we estimate these probability conditional probability densities?

00:10:29.000 --> 00:10:43.000
Okay. So the reason that we were gonna have 3 different algorithms is because we're going to look at 3 different assumptions we make on F and that will result in different algorithms.

00:10:43.000 --> 00:11:02.000
Okay. So before we dive into showing you the 3 different algorithms, are there any questions just on phase rule and then like the setup for these 3 classifiers.

00:11:02.000 --> 00:11:10.000
Okay, so to help us demonstrate these 3 algorithms, we're gonna make a return to that Iris data set.

00:11:10.000 --> 00:11:17.000
And so, we're going to plot it and I just want to point out so I would just want to make sure we have a reminder.

00:11:17.000 --> 00:11:23.000
We're in classification world when we do classification, we want to stratify our data splits.

00:11:23.000 --> 00:11:32.000
So that means including the stratify argument when we do train test splits. And if we were doing cross validation doing stratified, KAY.

00:11:32.000 --> 00:11:43.000
So just as a reminder, cause we learned this stuff last week, we've had a whole weekend, maybe we had some fun and we forgot, just as a reminder for Classification.

00:11:43.000 --> 00:11:47.000
So here's the data we're looking at where I'm plotting pedal length against pedal width.

00:11:47.000 --> 00:11:54.000
So we've got y equals 0, y equals one, y equals 2. Each observation here represents a different type of Iris.

00:11:54.000 --> 00:12:04.000
I think it goes, or the blue circles. Verse to color are the orange triangles and virginica are the green X's.

00:12:04.000 --> 00:12:14.000
I believe is how it goes. And so our goal is we're going to try and try and build classifiers to make these classifications.

00:12:14.000 --> 00:12:28.000
Using this sort of Bayes rule setup. Okay, so for us it would be the probability that Y equals 0, given X, the probability that Y equals one given X and the probability that Y equals 2 given X.

00:12:28.000 --> 00:12:38.000
So the longest one we're gonna look at today is the linear discriminate analysis. So this is also abbreviated to LDA as a quick note, LDA is sort of an ambiguous acronym.

00:12:38.000 --> 00:12:45.000
So if someone tells you, oh, are you going to use LDA for this or I'm using LDA for this.

00:12:45.000 --> 00:12:51.000
Be aware of like what setting you're in. So LDA is also known as latent deerish layout location, which is from NLP.

00:12:51.000 --> 00:13:02.000
But for us when we say LDA, because we're just doing classification, we're gonna look at it as linear discriminate analysis.

00:13:02.000 --> 00:13:11.000
So the model assumption that gets made for LDA is that the distribution of X conditional on Y equals C is Gaussian.

00:13:11.000 --> 00:13:23.000
And what that means is dependent upon the number of features. And so for illustrative purposes, we are going to look at a single feature and then I'll present the extension and I'm just doing the single feature to help you get it intuition for what's going on in general you can have more than one feature.

00:13:23.000 --> 00:13:45.000
So for a single feature this assumption gives you the, following, density function where the standard deviation on the NOR or on the normal distribution depends on the class as does the mean.

00:13:45.000 --> 00:13:50.000
But in linear discriminate analysis, this sigma C, which is our standard deviation, actually is assumed to be the same for all the classes.

00:13:50.000 --> 00:14:06.000
So our distribution is going to allow for different means for the different classes, but we are going to assume that all of the classes have the same standard deviation.

00:14:06.000 --> 00:14:13.000
When you do this, you can rewrite your p of y equals c given x from the above expression like so.

00:14:13.000 --> 00:14:21.000
And if we want to estimate mu of C and Sigma. Who would use the following formulas that I've provided here?

00:14:21.000 --> 00:14:32.000
So you would just take the class dependent means for the Mu of C and then use this, I believe it's the pooled standard deviation formula for the standard deviations.

00:14:32.000 --> 00:14:40.000
So when we make classifications in the multi-class setting, you typically go with which probability is the largest.

00:14:40.000 --> 00:14:50.000
So for which of our 3 classes is the probability the estimated probability that Y is equal to C the largest, that's the one that you would you would predict for.

00:14:50.000 --> 00:15:09.000
So using some algebra and some like log logarithm. Manipulations you can show that the largest value of P is the same as choosing the class C for which the discriminate function is largest.

00:15:09.000 --> 00:15:22.000
So the discriminate function is delta C. Which is equal to your features. Times the mu for that particular class divided by the variants.

00:15:22.000 --> 00:15:29.000
Minus the mu or the mean for that particular class squared divided by 2 times the variance. Plus the log of the probability of being class C.

00:15:29.000 --> 00:15:42.000
Okay, so here you're using the estimates from above. So this is known as the discriminate function for linear discriminate analysis.

00:15:42.000 --> 00:15:48.000
So we're going to go through, show you how to fit this, an SK Learn.

00:15:48.000 --> 00:15:52.000
And then we'll do like a step by step breakdown of what's going on.

00:15:52.000 --> 00:15:58.000
So SK Learn has linear discriminate analysis stored in the linear discriminate analysis or scored and.

00:15:58.000 --> 00:15:59.000
Discriminate analysis and it's just called linear discriminate analysis. That was a mouthful.

00:15:59.000 --> 00:16:18.000
Okay, so we would do from SK Learn. Dot, discriminate analysis. Import linear discriminate.

00:16:18.000 --> 00:16:27.000
Analysis. Then to save myself typing I'm just gonna copy and paste that here. Then we're going to go ahead and fit the data.

00:16:27.000 --> 00:16:33.000
So we're going to do ld. Fit. And then did I make it X train and Y train?

00:16:33.000 --> 00:16:37.000
I sure did.

00:16:37.000 --> 00:16:48.000
So we do X train, Y train. Okay, and so just like. Logistic regression and c nearest neighbors.

00:16:48.000 --> 00:16:58.000
We would predict proba works the same way. So do LDA dot predict. Proba, X train.

00:16:58.000 --> 00:17:04.000
Okay, so the 0 column is that y equals 0 class, the one column is the y equal one class, probabilities, and then the 2 column is the y equal to class probabilities.

00:17:04.000 --> 00:17:26.000
So before we do like a step by step breakdown of what LDA is doing, are there any questions just about the SK Learn or about the setup of LDA?

00:17:26.000 --> 00:17:33.000
I'll just ask a quick question. So. For the classification, the discriminate function goes further.

00:17:33.000 --> 00:17:38.000
Typically for the class. Where it's largest.

00:17:38.000 --> 00:17:39.000
Sorry, give me 1 s.

00:17:39.000 --> 00:17:43.000
But then. Okay.

00:17:43.000 --> 00:17:45.000
Oh, say that again.

00:17:45.000 --> 00:17:54.000
The

00:17:54.000 --> 00:17:55.000
Yeah, when you have a multi-class, yeah.

00:17:55.000 --> 00:18:02.000
You choose the class C for which the discriminate function is the largest. No. Right. And but then there's gonna be another function to figure out.

00:18:02.000 --> 00:18:05.000
The other classes, right?

00:18:05.000 --> 00:18:19.000
So, this, is what you're estimating. And then basically if you're following the rule, the idea is if we're following the rule that we're going to for each observation assign the class where this expression is largest.

00:18:19.000 --> 00:18:28.000
As we're gonna show in a little bit, that's equivalent to showing. Where this discriminate function is largest.

00:18:28.000 --> 00:18:29.000
Oh, okay.

00:18:29.000 --> 00:18:37.000
So that's why it's, yeah, so that's why it's called linear discriminate analysis because your discriminate function, is linear, in the features.

00:18:37.000 --> 00:18:39.000
That makes sense.

00:18:39.000 --> 00:18:42.000
Yep.

00:18:42.000 --> 00:18:55.000
Yeah, and I, I think just the main idea is that like the dealing with the discriminate function is easier to, to like handle, get your head around them like dealing with this expression.

00:18:55.000 --> 00:19:04.000
So Ramazan's asking does Gaussian NB give the same results? So that's a different model we are going to learn and it does give different results as we'll see later in this notebook.

00:19:04.000 --> 00:19:15.000
So there are 2 different models. They're both based around this Bayes rule rework, but they make different assumptions.

00:19:15.000 --> 00:19:21.000
Okay, so I think this will now go get closer into Brooks question. So here is a function I've made that takes in your features.

00:19:21.000 --> 00:19:31.000
Your mu hat, your sigma hat, and your, estimated for being class P.

00:19:31.000 --> 00:19:41.000
And so what we're gonna do is look at the discriminate function as a single feature. So we're gonna imagine, you know, here we fit with multiple features, right?

00:19:41.000 --> 00:19:47.000
So we had multiple features here when we fit it. And I think maybe I wanted to do this with just a single feature.

00:19:47.000 --> 00:19:53.000
So let me redo this real quick because I just wanted pedal length and you might be wondering, well, why are we?

00:19:53.000 --> 00:20:00.000
Why are we doing this? Don't worry, it will make, it will make sense when I get through.

00:20:00.000 --> 00:20:07.000
If you're confused. So I meant to just do this with a single feature. So let me refit this.

00:20:07.000 --> 00:20:21.000
And then. Redo this part. Okay, and this will make sense. So I just wanted to make an apples to apples comparison of like doing it by hand versus the SK Learn one.

00:20:21.000 --> 00:20:36.000
And so basically what we're doing here is I have got my discriminate as a function here. It's going to take in a value of x, an estimated value for the mean, which again remembers to depend on the class, an estimated value for the sigma, which is dependent on no class that's the same for

00:20:36.000 --> 00:20:44.000
all 3 classes. And then the estimated probability for each class. And so here I calculate the estimates for those means.

00:20:44.000 --> 00:20:52.000
I calculate the estimate for the variance. And so that's what we do. And then here I'm going to go through and plot.

00:20:52.000 --> 00:21:01.000
So. We're just gonna go through what's being assumed. I'm doing it with a single feature because I can plot that and visualize it where with higher dimensions it gets harder.

00:21:01.000 --> 00:21:10.000
So these are the actual sample distributions from the training set. So we've got y equals 0 and all these have petal length on the horizontal axis.

00:21:10.000 --> 00:21:19.000
So these are the actual sample distributions for petal length, they're histograms. And so we can see we've got y equals 0, y equals one, y equals 2.

00:21:19.000 --> 00:21:30.000
So now what LDA is is that all of these are normal distributions and these are the normal distributions that we fit using the training data.

00:21:30.000 --> 00:21:32.000
So they all have the same variance or or standard deviation, whichever you like. But they have different means.

00:21:32.000 --> 00:21:47.000
Okay. So that's what we get from fitting. On the training data. And so from this, we get the resulting discriminate lines.

00:21:47.000 --> 00:21:54.000
So these are those delta C's, the fitted delta C's. And so then the class that we would predict.

00:21:54.000 --> 00:22:00.000
Is the one where the corresponding line is the highest. So here the solid blue line is the discriminate function for class 0.

00:22:00.000 --> 00:22:24.000
So everywhere from pedal length a little bit less than 3 and to the left we admit predict a class 0 for this very small region from about a little less than 3 to around a little less than 5, we would predict that we are class one and then from that point onward to the right we would predict that we're class 2.

00:22:24.000 --> 00:22:31.000
So that's because the corresponding discrimin function is the largest, in each of those regions.

00:22:31.000 --> 00:22:40.000
And so then we can see, you know, the corresponding predictions for each pedal length. So we can see the blue predictions are where the blue line was on top.

00:22:40.000 --> 00:22:51.000
The class one predictions are where the orange dotted line is on top and then the class 2 predictions are where the green dot dash line is on top.

00:22:51.000 --> 00:23:21.000
So that's the idea. With LDA. So are there any questions on hopefully this less confusing breakdown now that we're through it?

00:23:23.000 --> 00:23:35.000
Okay, I fit this using all of the training set all 4 columns and so if you're going along with me you have to go and re- do this with just the pedal length.

00:23:35.000 --> 00:23:43.000
And so then when you come to run the predict again, it would be fixed.

00:23:43.000 --> 00:23:48.000
So Jonathan is asking, so the fit normals aren't really from the data points, but the group variance and subgroup means from the data.

00:23:48.000 --> 00:23:55.000
So the fit normals here.

00:23:55.000 --> 00:24:06.000
To get these, you calculate the means and the variances. So the means for each individual class using the training data.

00:24:06.000 --> 00:24:13.000
So those means are coming from the training data and then you do the group. Standard deviation estimate using that from above.

00:24:13.000 --> 00:24:30.000
So that's what the fitting procedure is here. So they are coming from the data. But They were using the data to estimate the distributions.

00:24:30.000 --> 00:24:34.000
Are there any other questions?

00:24:34.000 --> 00:24:42.000
Ernesto is asking what happens if your points is far are far from the training distributions, then this wouldn't be a good model.

00:24:42.000 --> 00:24:59.000
So like if your, if your data is like really not like close to these assumptions. If you're in general, if you're fitting a model that has certain assumptions and your data like egregiously violate those assumptions, it's probably not going to be a good model.

00:24:59.000 --> 00:25:03.000
Okay, so to keep things moving along since we do have a lot to get through today, I'll hold any other questions until later.

00:25:03.000 --> 00:25:13.000
So that was for like seeing it for a single feature in general for multiple features, the assumption is that your conditional distribution is a multivariate normal with class dependent mean vector.

00:25:13.000 --> 00:25:27.000
So you have your means in each of the individual components and then a unified covariance matrix regardless of class.

00:25:27.000 --> 00:25:35.000
And so as an example, here's what a bivariate distribution looks like, a bivariate normal distribution.

00:25:35.000 --> 00:25:41.000
So imagine this if you if you can try imagine this in higher dimensions. That's what's going on here.

00:25:41.000 --> 00:25:47.000
And so here you can see like, if you'd like to go through for yourself, these are the formulas.

00:25:47.000 --> 00:25:59.000
So this is the FC of X and then this is the resulting discriminate. When it comes to multi- When it comes to higher than one dimension, I'm not going to plot it.

00:25:59.000 --> 00:26:08.000
Like what we just plotted for the discrimination, but I will plot, you know, showing you fitting the model, which we saw earlier.

00:26:08.000 --> 00:26:16.000
I'm restricting to pedal with and pedal length just because I want to be able to show you like what the classification regions look like.

00:26:16.000 --> 00:26:17.000
So here, this is the training set. So the training observations are the points that are outlined.

00:26:17.000 --> 00:26:25.000
So blue circles orange triangles, green X's, and then the shaded regions show you this is what would be predicted in this region by the algorithm.

00:26:25.000 --> 00:26:44.000
So here the algorithm is predicting everything in this blue shaded region. It's predicting a 0, everything in this orange shaded region, it's predicting a one and everything in this green shaded region, it's predicting a 2.

00:26:44.000 --> 00:26:55.000
And then these areas are white just because of the range of the plot. If I were to include these regions and the predictions, they would also be shaded.

00:26:55.000 --> 00:27:04.000
So just ignore like the regions of the plot that are white. Okay.

00:27:04.000 --> 00:27:13.000
So we should also point out. That linear discriminate analysis also results in linear decision boundaries. So these shaded regions represent what are known as decision boundaries.

00:27:13.000 --> 00:27:25.000
So the boundaries of like your function. So. That makes any sense. So like, Basically just the boundaries of where different classes are predicted.

00:27:25.000 --> 00:27:28.000
So a linear decision boundary doesn't always work well. So we're gonna learn 2 additional types of algorithms in this notebook.

00:27:28.000 --> 00:27:44.000
That make different assumptions on the of on the F that allow for nonlinear decision boundaries. So the first of these is quadratic discriminate analysis.

00:27:44.000 --> 00:27:53.000
So the assumptions are the same except for one very crucial detail. Here your covariance and maybe I'll zoom in so you can see this formula.

00:27:53.000 --> 00:28:07.000
The covariance here is also assumed to be class dependent. So in linear discriminative analysis, we assumed the same covariance matrix for all of the classes and quadratic discriminate analysis we're assuming.

00:28:07.000 --> 00:28:19.000
Different covariances depending on what the classes. Class that you're looking at is. So, when you're doing quadratic discriminate analysis.

00:28:19.000 --> 00:28:22.000
This is your discriminate function. So it's quadratic in the features here.

00:28:22.000 --> 00:28:32.000
We're going to do the exact same like I'm not going to go step by step like I did last time for LDA because the process would be similar.

00:28:32.000 --> 00:28:40.000
We would just have to go through and now estimate the variance for covariance for each individual class.

00:28:40.000 --> 00:28:57.000
But the fitting process using SK Learn is the same. So we would import quadratic discriminate analysis from SK Learn dot, discriminate We're going to import quadratic.

00:28:57.000 --> 00:29:11.000
Discrim, Nint, analysis. Okay, and so then once again, I'm just gonna copy and paste to save myself time.

00:29:11.000 --> 00:29:19.000
Then we're going to fit, so QDA. Dot fit. And once again, I will be restricting myself to.

00:29:19.000 --> 00:29:30.000
Heddle with and pedal length and I better put those in a list or I'll get an error.

00:29:30.000 --> 00:29:45.000
And then, And so now what we're gonna do is we're just gonna do this, but now I'm gonna plot the linear discriminate analysis on the left hand side and the quadratic discriminate analysis on the right hand side.

00:29:45.000 --> 00:29:51.000
Okay, so here we can see this is the same picture as before, but now on the right hand side we can see how the quadratic.

00:29:51.000 --> 00:30:12.000
Discriminate analysis allows for nonlinear decision boundaries. So just because it's not linear though doesn't mean it's better so like as you can see like the blue and the orange regions are like vastly limited and then most of the time we'd be predicting an iris and it doesn't necessarily seem

00:30:12.000 --> 00:30:22.000
Like, it's hard to tell without additional observations, but it's, it seems unlikely that Iris, type 2 would be taking over so much of the plot.

00:30:22.000 --> 00:30:29.000
In the real world so you know It's a different model, but it does because it is more complex remembering our very bias variance trade-off notebook because it's a more complicated model.

00:30:29.000 --> 00:30:40.000
Complex in this case because we're allowing for different covariance matrices depending on the class.

00:30:40.000 --> 00:30:57.000
It tends to maybe over fit to the data. But if you have data that very clearly does does have a nonlinear decision boundary, you may want to use a model that allows for that.

00:30:57.000 --> 00:31:08.000
Okay, so I saw that I have a question. So Ernesto is asking, can you provide an answer with these linear classifiers stating some level of confidence?

00:31:08.000 --> 00:31:13.000
For example, if the data point is classified as Y 2, but also close to the boundary of Y one.

00:31:13.000 --> 00:31:14.000
Will you be able to say some sort of error associated with that prediction off of the model?

00:31:14.000 --> 00:31:26.000
So it is possible. I do think it's probably possible that you could provide some sort of confidence interval or prediction interval.

00:31:26.000 --> 00:31:36.000
I'm not sure how frequently that sort of thing is implemented in industry or if they're just looking for probability, you know, I think you probably could get some sort of interval.

00:31:36.000 --> 00:31:37.000
I'm not sure how to do it without diving into, I'm not sure how to do it without diving into like the literature a little bit further.

00:31:37.000 --> 00:31:46.000
I'm not sure how to do it without diving into like the literature a little bit further. So that would be something that if you're interested in doing, you'd have to, you would have to look into it.

00:31:46.000 --> 00:32:07.000
So Brooks is saying, isn't the predict proba a kind of confidence. So predict Provo would be giving a point estimate of the probability of being a given class and so then using that point estimate you could provide like a confidence interval around that I I would say yeah.

00:32:07.000 --> 00:32:19.000
It is it is a measure of like confidence and like the. You know, human like 2 people talking to one another kind of sense of confidence and not like the statistical concept of confidence.

00:32:19.000 --> 00:32:42.000
Like if you see something with a much higher estimated probability, you're maybe personally you as a person is more confident that it's correct as being that class versus like one that has a lower probability, which is slightly different than the statistical concept of confidence.

00:32:42.000 --> 00:32:48.000
So sort of, you know, people might be wondering when would I use LDA versus QDA?

00:32:48.000 --> 00:32:55.000
So LDA works better than QDA for smaller data sets and why is that? So QDA.

00:32:55.000 --> 00:33:01.000
You have to estimate, more parameters. So the assumption in LDA is that the covariance matrix is the same for all the classes.

00:33:01.000 --> 00:33:10.000
So here you have to estimate only this many covariances. While in QDA, you have to estimate much more, many more because you're having a different covariance matrix for each class.

00:33:10.000 --> 00:33:26.000
You also may think that the data can be separated linearly, meaning a linear decision boundary. And so if that's the case, you should probably just use LDA.

00:33:26.000 --> 00:33:33.000
Qda will work maybe give you a better fit if you have a very large data set.

00:33:33.000 --> 00:33:41.000
And especially if you have data that you think is not separable by a linear boundary, you may wanna consider a QDA.

00:33:41.000 --> 00:33:47.000
And then another way to get a nonlinear decision boundary is what's known as the naive base classifier.

00:33:47.000 --> 00:33:50.000
So I believe this was, Ramazan's question earlier. We're gonna come back to that now.

00:33:50.000 --> 00:34:04.000
And so instead of making an explicit assumption on the form of the FC from the most general form of naive bays.

00:34:04.000 --> 00:34:13.000
What we're going to do is instead make an assumption on like how the FC is broken, broken up.

00:34:13.000 --> 00:34:30.000
So The FC is a joint probability density. So the assumption for naive phases, instead of having a joint, we're going, I mean, we're still gonna have the joint density, but in this case we're going to assume that all of the individual features are independent of one another.

00:34:30.000 --> 00:34:39.000
And so this joint density can then be broken down into individual univariate densities. For each of the features.

00:34:39.000 --> 00:34:49.000
And so typically what that means in the implementing, so like for things that are continuous features, you'd assume something like a Gaussian features you'd assume something like a Gaussian and then for things like categorical.

00:34:49.000 --> 00:34:57.000
And then for things like categorical, you would assume, like by, and then for things like categorical, you would assume, like by, binomial, not polynomial, so 0 ones.

00:34:57.000 --> 00:35:14.000
And so that's sort of the idea here is the big assumption for the naive bays is that you're maybe naively assuming because it might not be a good assumption, but you're naively assuming that each of the features are independent of one another, which then allows you to rewrite something

00:35:14.000 --> 00:35:18.000
like this. And then the idea is it's much easier to estimate the individual univariate densities than it is to estimate a giant joint distribution.

00:35:18.000 --> 00:35:31.000
And so that's the idea. So you might be thinking, well, this seems like a pretty strong assumption that probably isn't going to hold.

00:35:31.000 --> 00:35:33.000
So one reason why naive bays might work better than either LDA or QDA is because you're making this assumption.

00:35:33.000 --> 00:35:49.000
It actually implements a lot of bias into the model and sometimes remember the trade-off you can actually get better performance because maybe you're increasing bias, but you're doing in a way that decreases the variance enough so that the actual generalization error tends to go down.

00:35:49.000 --> 00:36:02.000
And so that's the idea here. And then I think I said this already, but typically what gets assumed is that if you have a quantitative, you would assume it's a normal distribution.

00:36:02.000 --> 00:36:13.000
And if you have a categorical, you would assume a Bernoulli. So Bernoulli, if you haven't heard that before, is just the name for a coin toss.

00:36:13.000 --> 00:36:21.000
So in this case, it would be a biased coin toss for your value of p is just your proportion of observations.

00:36:21.000 --> 00:36:27.000
Okay, so, so RAM is on has a question. Would Bayes models give confidence intervals?

00:36:27.000 --> 00:36:35.000
I thought we usually just interpret the posterior distribution as opposed to confidence intervals of frequentist approach.

00:36:35.000 --> 00:36:46.000
So I think that There's probably my guess. I don't know for sure, but my guess would be that they're probably you could do sort of the Bayesian statistics approach and then.

00:36:46.000 --> 00:36:55.000
Forget I forget what those are called if it's like a credibility interview interval or something and then there probably is a frequentist approach that would allow you to get confidence intervals on different things.

00:36:55.000 --> 00:37:05.000
But You know, cause I think people like LDA, I think was developed before Bayesian statistics became like a big thing.

00:37:05.000 --> 00:37:14.000
So I would imagine that there is a way to get confident, you know, classical confidence intervals or prediction intervals.

00:37:14.000 --> 00:37:20.000
Again, I don't know for sure, but I think there's probably a way to do it.

00:37:20.000 --> 00:37:21.000
Okay, so how can we implement this? So we can do it with Gaussian and B.

00:37:21.000 --> 00:37:41.000
So one big. Downside of SK Learns, naive Bayes implementations is that all of your variables have to be the same, type of, feature.

00:37:41.000 --> 00:37:50.000
So like they either all have to be continuous or all have to be categorical which is sort of a downside at least as of last year they may have updated it in this past year.

00:37:50.000 --> 00:37:55.000
You can't have a situation where you have some columns that are continuous and some columns that are categorical.

00:37:55.000 --> 00:38:07.000
I'm not sure that they have a version of the model that incorporates both of those. So our data has a, has both continuous features.

00:38:07.000 --> 00:38:15.000
So we're going to use the Gaussian NB model, which assumes that each of the individual distributions are Gaussian.

00:38:15.000 --> 00:38:22.000
So we're going to say from . K. Learn dot naive base

00:38:22.000 --> 00:38:30.000
Import. Gaussian and B. And then just like before we're gonna.

00:38:30.000 --> 00:38:40.000
Make the model so copy paste and B equals Gaussian and B And then we fit the model dot fit x train and then I'm gonna do pedal.

00:38:40.000 --> 00:39:00.000
With Petal links. And then why train? And then, okay, good. I called it.

00:39:00.000 --> 00:39:03.000
Why did that not work? Maybe let's do dot values.

00:39:03.000 --> 00:39:07.000
No, the issue Matt is that you're. Your figure code has size equals the letter S in every place.

00:39:07.000 --> 00:39:16.000
If you define a value for S before you plot it, then it works.

00:39:16.000 --> 00:39:21.000
Awesome.

00:39:21.000 --> 00:39:22.000
So just write like

00:39:22.000 --> 00:39:31.000
Okay, okay, I see. I see. Yeah, this is what happens. So when I was making these edits like a month ago I was trying to get through the rest of my day and so this is what happens.

00:39:31.000 --> 00:39:38.000
You do that to yourself. There we go. Okay, so we can sort of see the difference in the boundary.

00:39:38.000 --> 00:39:43.000
So like the naive base is probably an improvement over the QDA, sort of a little bit less overfitting to the training set.

00:39:43.000 --> 00:39:51.000
But you know, we still don't know whether or not it's, you know, it's gonna give you the best performance.

00:39:51.000 --> 00:39:58.000
We would have to do something like a stratified. And so I will say, you know, thanks Brooks for pointing that out.

00:39:58.000 --> 00:40:05.000
You saved me a lot of time of trying to figure out what was wrong.

00:40:05.000 --> 00:40:16.000
Okay, so maybe I'll pause for like one question, just because we, we're really rushed for time today.

00:40:16.000 --> 00:40:17.000
Yeah.

00:40:17.000 --> 00:40:23.000
I have a question. Is, is there particular class, problems which are more suited for days based classifiers?

00:40:23.000 --> 00:40:26.000
Then Canon, for example.

00:40:26.000 --> 00:40:34.000
So if you have data that tend to fit the assumptions of the different models, then they would probably perform well.

00:40:34.000 --> 00:40:40.000
I don't have like a good sense of like, oh, you always want to use this when you're doing like.

00:40:40.000 --> 00:40:45.000
Image classification or something like that. So I would say you probably just have to, you know, look and then see, look at some of the data and sort of get a sense.

00:40:45.000 --> 00:41:08.000
I think ultimately people probably just tend to like fit it and like sort of like in a cost foundation approach and then see like did it perform better than the other ones and then maybe after that they could go back and check like, okay, do the assumptions seem to be egregiously broken, like, okay, do the assumptions seem to be egregiously broken?

00:41:08.000 --> 00:41:12.000
Like, is it okay? Do the assumptions seem to be egregiously broken? Like, is it okay if this is the best one.

00:41:12.000 --> 00:41:14.000
Yeah.

00:41:14.000 --> 00:41:20.000
And for this model, the main assumption is that x given y is.

00:41:20.000 --> 00:41:28.000
So for LDA and QDA, it's different types of Gaussian. So here it's Gaussian with the same covariance.

00:41:28.000 --> 00:41:30.000
QDA, it's Gaussian. With different covariances for the classes and then naive base is the main assumption is of independence.

00:41:30.000 --> 00:41:46.000
And in this particular naive base, it's that it's independent and each of the distributions are Gaussian.

00:41:46.000 --> 00:41:47.000
Thanks.

00:41:47.000 --> 00:42:06.000
Yep. Awesome. Okay, so The next type of model we're gonna learn about today and Hopefully not take too too long because I want to get to principal components analysis is a support vector machine.

00:42:06.000 --> 00:42:12.000
So we're gonna go through like step by step the,

00:42:12.000 --> 00:42:19.000
We're gonna go through step by step the way that support vector machines were sort of built up over time.

00:42:19.000 --> 00:42:24.000
And see like the the development. So the first ones are known as we're kind of breaking this into 2 things.

00:42:24.000 --> 00:42:35.000
Linear support vector machines and then more general support vector machines. So linear support vector machines are designed for data sets that are linearly separable.

00:42:35.000 --> 00:42:38.000
And so what I remember linear means like. The decision boundary is linear. And so that's what's going on here.

00:42:38.000 --> 00:42:52.000
Linear support vector machines produce linear. So what do we mean by a linear decision boundary? It's data.

00:42:52.000 --> 00:42:58.000
That the classes can be separated by a hyperplane. So in 2 dimensions, think drawing a line.

00:42:58.000 --> 00:43:04.000
In 3 dimensions think drawing a plane. I don't really know how to show a plane. Let's say this is a plane.

00:43:04.000 --> 00:43:11.000
And then higher dimensions, it's separating by an n minus one degree subspace. So if you're in RN, a hyperplane is an N minus one subspace.

00:43:11.000 --> 00:43:24.000
So there are 2 types of linear support vector machines. The first one that got developed was a maximal margin classifier.

00:43:24.000 --> 00:43:32.000
And so here's some. Phony data that I've generated. So, this is another instance or not another.

00:43:32.000 --> 00:43:39.000
This is an instance of one of those problems where support vector machines were developed by the computer science community to my knowledge.

00:43:39.000 --> 00:43:45.000
And so there the 2 classes are typically negative one. And one. Whereas in other classifications, right, it's been 0 in one.

00:43:45.000 --> 00:43:53.000
So for us, we're gonna go with the formulation that comes from the computer science people because it allows the formulas to work out.

00:43:53.000 --> 00:44:02.000
More nicely than with 0 and one. But just know that we have 2 classes, negative one and one.

00:44:02.000 --> 00:44:06.000
Okay, and so the question here is, okay, if I were to try and come up with a rule to separate these, what would I do?

00:44:06.000 --> 00:44:18.000
Well, you could just draw a line separating them, right? I could just draw it. But then the question becomes, well, which line is the best line?

00:44:18.000 --> 00:44:22.000
So here are 3 different lines that separate the data and maybe it doesn't necessarily look like that because the edges touch but the center is where the data is at.

00:44:22.000 --> 00:44:31.000
So here are 3 different lines of black solid line, a blue dotted line and a red dash dot line.

00:44:31.000 --> 00:44:43.000
And then the idea for the maximum margin classifier is, well, the line that performs best or generalizes best is going to be the one that is as far away from the training data as possible.

00:44:43.000 --> 00:45:03.000
So here we might say that the red dot dash line isn't very good because we're very likely over here to maybe have a A one cross over and over here to have a negative one crossover and sort of the opposite sides for the blue dotted line.

00:45:03.000 --> 00:45:15.000
Both the black line, we have maximized the distance between all of our observed training set points. And the, dividing line.

00:45:15.000 --> 00:45:25.000
So that's sort of the idea behind a maximal margin classifier. As you try to maximize the distance from the points to the to the decision boundary.

00:45:25.000 --> 00:45:31.000
And then that distance is known as the margin. So that's why it's maximal margin.

00:45:31.000 --> 00:45:40.000
Okay. So here is sort of the setup. We're not going to dwell too much on this.

00:45:40.000 --> 00:45:47.000
And I believe is the distance for the margin. And so. You might be thinking, well, this looks weird.

00:45:47.000 --> 00:45:56.000
This is just like the formula for a hyper plane, right? X times beta equals 0 is the formula that defines the hyper plane.

00:45:56.000 --> 00:46:06.000
And so here you're basically just saying that I want a hyper plane, and then I want all of my points to fall outside of a margin of M on other side of other.

00:46:06.000 --> 00:46:10.000
Either side of the hyperplane.

00:46:10.000 --> 00:46:20.000
Okay, so we're going to go through and show you how to do this. Skeler and for linear support vector machines for classification, it's linear SVC, so support vector classifier.

00:46:20.000 --> 00:46:28.000
So from. SK Learn dot SVM. That's where they're stored.

00:46:28.000 --> 00:46:35.000
You import linear. SVC.

00:46:35.000 --> 00:46:41.000
So we're gonna make our model. So for now, We're gonna go ahead and ignore.

00:46:41.000 --> 00:46:47.000
I'm gonna put something in called capital C. I'm gonna ignore it for now. I'll touch on it.

00:46:47.000 --> 00:46:56.000
One, a little bit later in the notebook. Okay. So we're gonna do max underscore margin is equal to.

00:46:56.000 --> 00:47:03.000
Linear SVC. Or I'm gonna say C is equal to 1,000, which again, I'll talk about it in a little bit.

00:47:03.000 --> 00:47:05.000
And then I'm going to increase the maximum number of iterations. So notice here there's a difference in syntax.

00:47:05.000 --> 00:47:16.000
So in SK L and max iter has the underscore, whereas in stats models it did not.

00:47:16.000 --> 00:47:23.000
I'm gonna do max margin. X train or was it they have an x train

00:47:23.000 --> 00:47:34.000
I think it's just X. Okay. Why?

00:47:34.000 --> 00:47:40.000
Oh, got fit.

00:47:40.000 --> 00:47:46.000
Okay, so here is the line, so it's slightly different from the black line we had above.

00:47:46.000 --> 00:47:51.000
But here is the line that is the decision boundary. That's the solid black line. Then this black dotted line represents the margin.

00:47:51.000 --> 00:48:02.000
And so the margin is the distance. The, and so here we calculate distance as the minimal distance from the hyperplane to the points.

00:48:02.000 --> 00:48:13.000
And so that's the margin. And so these points that are touching the margin, so you have like about 4 blue dots touching and then 2 orange triangles touching.

00:48:13.000 --> 00:48:17.000
So these are what are known as the support factors. So that's why it's a support vector machine.

00:48:17.000 --> 00:48:22.000
It's because it's this algorithm where the support vectors are the things that are closest to the decision boundary or.

00:48:22.000 --> 00:48:39.000
Touching the margin. So they're called support vectors because if I were to move any one of them, it's likely that the decision boundary and the margin may change.

00:48:39.000 --> 00:48:51.000
And so it has nothing to do with if you're from coming from mathematics. Nothing to do with the concept of the support of a function, which is sort of annoying as a mathematician because I spend a very long time trying to see if there was a connection.

00:48:51.000 --> 00:49:02.000
There isn't. It's just called the support because in some sense these points are supporting the decision boundary just meaning that if I were to move them it would change the decision boundary.

00:49:02.000 --> 00:49:05.000
Okay.

00:49:05.000 --> 00:49:10.000
So that is a maximum margin classifier. You could see how this might not be the best classifier for every problem.

00:49:10.000 --> 00:49:24.000
So for instance, what if we had I situation like this where it's essentially the same, but you know, just a couple observations of either class, tends to be sort of commingled in.

00:49:24.000 --> 00:49:33.000
So they're not perfectly linearly separable. So in this example, the original example, we could draw a line between these 2 and separate them perfectly.

00:49:33.000 --> 00:49:46.000
And here we cannot draw a line between these 2 classes and separate them, but we could draw the same exact line for before and basically do What I would say is an okay job at separating them, right?

00:49:46.000 --> 00:49:55.000
So we'd have 3 miss classifications for the orange triangles and then 2 miss classifications for the blue circles that's probably not so bad.

00:49:55.000 --> 00:49:59.000
And so the idea here is, well, what if we allow some of our training points to cross over that decision boundary or cross over that margin.

00:49:59.000 --> 00:50:19.000
So. And the idea here is we're going to make our margin soft by allowing our points to cross over them the margin and also allowing it to cross over the decision boundary if it needs to.

00:50:19.000 --> 00:50:30.000
Okay, so it's still a linear vector machine, a linear support vector machine, but now the difference between the max margin and the linear, just the regular general linear is.

00:50:30.000 --> 00:50:47.000
We're gonna allow points to cross over if they need to. Okay, so the idea here that gets switched is you're still doing the exact same problem as before, but now we're multiplying our margin by this sort of one minus epsilon I.

00:50:47.000 --> 00:50:58.000
Where the epsilon i are determining a basically determining a budget of how much we're going to allow points to cross over both the margin and the.

00:50:58.000 --> 00:51:10.000
Decision boundary. And so these epsilon I are determined in the following way. So if the training point is on the correct side of the margin.

00:51:10.000 --> 00:51:19.000
In this example, the correct side would be over here and over here. If it's on the correct side, then the epsilon eyes are 0.

00:51:19.000 --> 00:51:26.000
If it's on the wrong side of the margin, but on the correct side of the hyperplane, it would be between 0 and one.

00:51:26.000 --> 00:51:37.000
And so what would that be? So like that would be if we were in here. So if an orange triangle was on this side of the decision boundary, but on the other side of the margin.

00:51:37.000 --> 00:51:48.000
Okay, so that would be a situation where it's between 0 and one. And it would be greater than one if the observation of I is on the completely wrong side of the hyperplane, meaning it's being misclassified.

00:51:48.000 --> 00:51:59.000
And so this is where that thing that I called C is coming from. So here the larger C is the less wiggle room you have for things being on the wrong side.

00:51:59.000 --> 00:52:04.000
And then the smaller C is the more you allow things to cross over both the margin and the hype and just the hyperplane altogether.

00:52:04.000 --> 00:52:16.000
Okay, so that's the idea. It's written like this because in the traditional setup it's like alpha or something but in this is corresponding with how it works for SK Learn.

00:52:16.000 --> 00:52:29.000
So for SK Learn, larger alpha means a smaller budget for crossing over and then smaller, sorry, larger C, means a smaller budget for crossing over and then smaller C means you're more likely to let things cross over.

00:52:29.000 --> 00:52:37.000
So to get a sense for how that works. Oh, we're doing here. Is going through these different values for C.

00:52:37.000 --> 00:52:45.000
Fitting a support vector machine and then showing you how the decision boundary and the margin changes, for different values.

00:52:45.000 --> 00:52:50.000
So here we've this is same point but I've changed it so you can actually see the boundary.

00:52:50.000 --> 00:52:58.000
So here this is pretty close to that original C is equal to 10. So this is a larger, meaning we have a smaller budget for crossovers.

00:52:58.000 --> 00:53:06.000
So this is pretty close to that original decision boundary. But then as C starts to get smaller, we'll start to see both the margin get wider.

00:53:06.000 --> 00:53:09.000
And the

00:53:09.000 --> 00:53:19.000
The boundary start to shift. Okay. So the smaller C is, remember the more wiggle room we have for things to be.

00:53:19.000 --> 00:53:27.000
On, on different sides. Okay.

00:53:27.000 --> 00:53:43.000
So this is the idea. So C is a hyper parameter. And just like with rich regression with alpha or K nearest neighbors with K, the value of C that works best for your problem can be determined with hyper parameter tuning and cross validation.

00:53:43.000 --> 00:53:49.000
So what we would probably do is, we would set up a grid of different values of C.

00:53:49.000 --> 00:53:51.000
So like we kinda did already. And then we would say like for each value of C, find the average cross validation.

00:53:51.000 --> 00:54:05.000
Accuracy, say, and then choose the one that had the best average cross validation accuracy.

00:54:05.000 --> 00:54:11.000
Okay, so that's the idea. So that is linear. Support vector classifiers. The first one, is, was developed for things that are linearly separable.

00:54:11.000 --> 00:54:27.000
The second one was for things that are close to linearly separable. They're the same thing in Skler and the only difference is what value of C you use.

00:54:27.000 --> 00:54:36.000
So what we saw first was a relaxation of like needing to be completely linearly separable versus close to linearly reciprocal.

00:54:36.000 --> 00:54:45.000
So then the next natural extension is, okay, well, what if you're not at all linearly separable, like you have a nonlinear decision boundary.

00:54:45.000 --> 00:55:00.000
And so that's the idea behind general support vector machines. So here. And some data. So here is a two-dimensional example of no matter how you draw it, there's no way you could get close to even correctly classifying these things with the linear boundary, right?

00:55:00.000 --> 00:55:07.000
There's no line. We could draw that would, do a good job of classifying this in 2 dimensions.

00:55:07.000 --> 00:55:19.000
Right. And then here's an even simpler example. Here's a one dimensional example where there's nowhere we could place a dividing line that would do a good job of classifying these, right?

00:55:19.000 --> 00:55:24.000
So ideally what we would want to do is put a sort of like, all right, these ones in the middle here.

00:55:24.000 --> 00:55:33.000
So between like negative. Point 3 and then positive point 3, you know, classified as one outside of that classified as negative one.

00:55:33.000 --> 00:55:41.000
But the 2 types of support vector machines that we just learned. Wouldn't be able to do that with the data as it is.

00:55:41.000 --> 00:55:50.000
So what can we do? You do this process known as sort of lifting. So In a situation like this.

00:55:50.000 --> 00:55:58.000
If we just take both the original and then the square of that. To a new data set. Like this.

00:55:58.000 --> 00:56:05.000
So we've got one feature that's the original X one and then the second feature that's the square of x one.

00:56:05.000 --> 00:56:11.000
We can now draw a linear. Cision boundary here, right? We can draw a straight line that would divide these 2.

00:56:11.000 --> 00:56:22.000
So just we can do that. So let's do linear SVC. And then I'll just do C equals one.

00:56:22.000 --> 00:56:31.000
Max iter equals. Bigger. And then we would do, what did I call this X new?

00:56:31.000 --> 00:56:37.000
And then why?

00:56:37.000 --> 00:56:47.000
Let's make see a little. Little bigger, so let's do 10. Okay, so now I can perfectly separate these 2, right?

00:56:47.000 --> 00:56:55.000
So this is the decision boundary that I've created now using this approach. And so this is really just the idea in general.

00:56:55.000 --> 00:57:03.000
So like, for instance, we could do it by hand with this previous 2D example, we could make what's known as a paraboloid, probably in that would work.

00:57:03.000 --> 00:57:16.000
But in general, like, You're not always gonna be able to do like, alright, I'm gonna find the perfect combination of nonlinear transformations of the original data that's going to allow me to use a linear support vector classifier.

00:57:16.000 --> 00:57:24.000
But luckily, There's a trick that does it for you. And so the trick is known as the kernel trick.

00:57:24.000 --> 00:57:26.000
And so

00:57:26.000 --> 00:57:41.000
We have this concept known as a kernel function. And then the idea behind these kernel functions is they're going to allow us to fit a linear support vector machine in a higher dimensional space, maybe infinitely dimensional.

00:57:41.000 --> 00:57:48.000
And then project the resulting decision boundary down into the original space. So. To get there, we're gonna have to review some concepts.

00:57:48.000 --> 00:58:02.000
So again, this is a little bit mathematical. So if you're not a math person, just try and hang on and take away the the bigger picture of lifting to a higher space and then going back down.

00:58:02.000 --> 00:58:10.000
So. We need to know something about inner products. So when you fit, we didn't really go over how to fit the algorithm.

00:58:10.000 --> 00:58:19.000
I just sort of showed you the problem and then didn't say much. So to fit the algorithm in the background, what's going on is some inner products are being calculated.

00:58:19.000 --> 00:58:28.000
And so what is an inner product? Well, For support vector machines, it's basically it's just taking the product of each individual component of the vectors multiplying them and then adding them together.

00:58:28.000 --> 00:58:47.000
Okay, so this gets calculated in the background when SVM is getting fit. So if we have sort of a function that's going to take our data and lift it into a higher dimension that I'm gonna call, let's, I think this is fee.

00:58:47.000 --> 00:58:52.000
I'm gonna call fee. So what we would do in the in the data is like when we're fitting our SVM.

00:58:52.000 --> 00:58:59.000
We would need to calculate this inner product. So as an example.

00:58:59.000 --> 00:59:05.000
Consider this fee that takes in two-dimensional data and produces three-dimensional data like so.

00:59:05.000 --> 00:59:15.000
Okay, so x one and x 2 goes to a three-dimensional vector x one squared square root of 2 times x one times x 2 and then x 2 squared.

00:59:15.000 --> 00:59:21.000
Then we would need to go through the process if we're fitting this of computing and inner product between the higher dimensional stuff.

00:59:21.000 --> 00:59:27.000
So if we did that, you know, this would give us A one squared times B one squared. So that's the first component, right?

00:59:27.000 --> 00:59:36.000
And then the next component would be 2 times A one times B one times A 2 times B 2. So that's the multiplying of the second components of each.

00:59:36.000 --> 00:59:47.000
And then the last of the plus. A 2 squared plus b 2 squared, but. It turns out that we can rewrite that as A one times B one plus A 2 times B 2 squared.

00:59:47.000 --> 00:59:56.000
And this is just the inner product of A and B squared. So it's the inner product in the lower dimensional space, right?

00:59:56.000 --> 01:00:11.000
So. What if we had something like this where yes, we have a higher dimensional transformation. But it turns out the inner product and the higher dimensional space is just some kind of function of the inner product in the original data space.

01:00:11.000 --> 01:00:19.000
I, I keep like moving my arms out like this today, so I'm very animated. And so this is the idea of a kernel function.

01:00:19.000 --> 01:00:29.000
So if you have a map that goes from the lower dimensional space to the higher dimensional space. So if we have like something like that taking the one dimension and squaring it into the second dimension.

01:00:29.000 --> 01:00:37.000
If we have this map that lifts our space into a higher dimensional space. And then that map also has a function K.

01:00:37.000 --> 01:00:43.000
So that the inner product and the higher dimensional space is just a function of the lower dimensional data.

01:00:43.000 --> 01:00:52.000
Then we would say that our map has a kernel function. Okay, so in this example, our kernel function is the square of the inner product.

01:00:52.000 --> 01:01:12.000
So that's the idea here is general support vector machines apply these maps that have kernel functions. So those maps allow us to lift it to a higher space, but then this higher space, the inner product there is just calculated in the lower dimensional space because of this kernel function idea.

01:01:12.000 --> 01:01:27.000
So that's known as the kernel trick. So the 4 most common kernels and the ones that you can implement in SK Learn are the linear kernel, which is not really doing what we just said, but it's what we kind of used for the last 2, right?

01:01:27.000 --> 01:01:34.000
So it's just the inner product of A and B. The polynomial kernel, which is what we use in our previous example.

01:01:34.000 --> 01:01:43.000
So it's, gamma times the inner product. Of A and B plus R. All of that raised to the dimension D.

01:01:43.000 --> 01:01:51.000
Gamma here and R here are more hyper parameters which you would tune and then D is the degree of the polynomial.

01:01:51.000 --> 01:02:03.000
So for us, we would have used degree 2. We've got the Gaussian radial kernel, or Gaussian radial, our G, sorry, Gaussian RBF kernel.

01:02:03.000 --> 01:02:11.000
So it's given by this where the norm here is the Euclidean norm. And then finally what's known as a sigmoid kernel, which is hyperbolic tangent of gamma times the inner product plus R.

01:02:11.000 --> 01:02:24.000
So. What you'll probably, the default is the radial basis function, Colonel. And so if you're doing this, you'd probably try that first, and if it doesn't work, you can try the other ones and play around.

01:02:24.000 --> 01:02:33.000
But this is probably your biggest go-to one. So we're going to show how to implement this in SK Learn.

01:02:33.000 --> 01:02:40.000
So from SK Learn dot SVM, we're gonna import capital SVC, all capital.

01:02:40.000 --> 01:02:52.000
So support vector classifier. So we're gonna go back to our one dimensional example. And so we're gonna set this up and remember I said we used a kernel.

01:02:52.000 --> 01:02:56.000
A polynomial kernel. So we would do SVC. Colonel is equal to the string poly for polynomial.

01:02:56.000 --> 01:03:12.000
We want to specify the degree, so we want a degree to polynomial and then I'm just setting C to be that and then I'll increase the max iter just to be safe.

01:03:12.000 --> 01:03:17.000
So we're gonna fit it. I think I called it. Maybe just X, is that what I have?

01:03:17.000 --> 01:03:24.000
Yes, just X.

01:03:24.000 --> 01:03:30.000
If you gotta reshape, that's what.

01:03:30.000 --> 01:03:37.000
Okay. And so now here I'll plot that decision boundary. So the big X's represent the boundary.

01:03:37.000 --> 01:03:38.000
So outside to the left and the right of the big X's, I'm classifying as a one.

01:03:38.000 --> 01:03:51.000
And on the inside of the big X's I'm classifying at or sorry outside negative one blue circle inside one orange triangle.

01:03:51.000 --> 01:04:02.000
So here we're going back to that other example. And so here we can show you how you could try, like I said earlier, right, you could do something where you make what's known as a paraboloid.

01:04:02.000 --> 01:04:08.000
So that is also. A polynomial kernel.

01:04:08.000 --> 01:04:13.000
And then it's a degree 2.

01:04:13.000 --> 01:04:19.000
And then I guess I put c equals 10 here. And then the other one we could try is that default one.

01:04:19.000 --> 01:04:27.000
So the RBF kernel, so you would just do SVC and because RBF is the default, you would just put in like your value of C that you want to use.

01:04:27.000 --> 01:04:32.000
So here I'll just use 10 for simplicity.

01:04:32.000 --> 01:04:39.000
And then here we've plotted the decision boundary as the solid line and then the dotted line is sort of.

01:04:39.000 --> 01:04:48.000
Equivalent to the margin. So here you can see with the polynomial kernel, I get sort of, you know, these lines, it's still sort of a linear decision boundary in some sense.

01:04:48.000 --> 01:04:54.000
And then on the RBF, you get sort of what looks like, you know, sort of just a Gaussian distribution almost.

01:04:54.000 --> 01:05:00.000
And so. You know, which one works best is something that you would wanna try with the cross validation.

01:05:00.000 --> 01:05:08.000
So maybe you want to try both and then see which one generalizes better. Yeah, and so here are some references.

01:05:08.000 --> 01:05:15.000
I didn't do a lot of the mathematical details for this one. So I've provided some references here.

01:05:15.000 --> 01:05:22.000
They're all chatters of elements of statistical learning. Which I believe here I have linked to the free web PDF of that.

01:05:22.000 --> 01:05:49.000
Okay, so maybe just if there are a few questions we can try and answer them and then we'll get to principal components analysis.

01:05:49.000 --> 01:05:50.000
Okay, yeah, yeah.

01:05:50.000 --> 01:05:55.000
I ask a question. Sorry, a little confused about these, kernels. And the inner product.

01:05:55.000 --> 01:06:06.000
So we the inner products come into the second part of the lecture. Like I'm a little confused why you need these still.

01:06:06.000 --> 01:06:34.000
Is it because it's you have a 2 dimension because you're going into hard dimensions is that why

01:06:34.000 --> 01:06:40.000
The one dimensional space. But that can be really cost prohibitive. So the larger the higher dimensional space that you lift into, the longer it would take to calculate the inner product, right?

01:06:40.000 --> 01:06:53.000
Because you have more dimensions. And especially if, for instance, there are some cases like the RBF, which are lifting to what's known as an infinite dimensional space.

01:06:53.000 --> 01:07:05.000
Hilbert space, where like it wouldn't be possible for your computer to calculate the inner product in sort of this way, right?

01:07:05.000 --> 01:07:16.000
So kernel functions are a trick where like if the if the mapping the function that goes from the lower dimensional space to the higher dimensional space.

01:07:16.000 --> 01:07:26.000
Has this kernel function where like the inner product is just a function of the original data. Then you don't have to, you don't have to compute the inner product up there.

01:07:26.000 --> 01:07:33.000
You can compute the inner product than the original space. So that's the idea.

01:07:33.000 --> 01:07:34.000
Okay, thanks.

01:07:34.000 --> 01:07:41.000
Yeah. So used to ask what type of classification problem do we use for the SVM?

01:07:41.000 --> 01:07:57.000
So you can try it on any classification problem. So there's some things to consider that like the fitting time and I don't have like the, How long like given the number of features and observations.

01:07:57.000 --> 01:07:59.000
I don't know like what the formula is for like how long it would take but that I know that's something you want to consider.

01:07:59.000 --> 01:08:17.000
You would just try it on a regular you can just try it on a classification problem and if it doesn't work well in the cross-validation or if you have other considerations like it takes too long to fit or something like that, then you wouldn't use it.

01:08:17.000 --> 01:08:30.000
And that sort of thing. And then Keithon's asking, in poly support vector classification, is it simply considering 2 instead of one, where they're both linear boundaries.

01:08:30.000 --> 01:08:32.000
So.

01:08:32.000 --> 01:08:37.000
It's still for these problems, it's still only computing a single one, right? So.

01:08:37.000 --> 01:08:46.000
Like here was when we did the polynomial kernel, right? And we got the what looks like 2 different boundaries in the original space.

01:08:46.000 --> 01:08:52.000
But remember what's going on is in the higher space. Because we've lifted it, we can now draw a single boundary.

01:08:52.000 --> 01:09:00.000
And then what happens is Like you, it gets projected back down. So like it goes backwards like through the, through the mapping.

01:09:00.000 --> 01:09:09.000
So like this line will get sort of projected back down. Into the one dimensions where it appears is like 2 decision points.

01:09:09.000 --> 01:09:30.000
Is the idea and it's the same idea here where in the polynomial kernel, what's happening is these are getting transformed through what's known as a paraboloid and then some you're like slicing it with a plane and then bringing it back down.

01:09:30.000 --> 01:09:34.000
Okay.

01:09:34.000 --> 01:09:37.000
So in this notebook, we're taking an aside, so if you're trying to keep track of where this is.

01:09:37.000 --> 01:09:45.000
It is in the unsupervised learning folder. This is going to be the only notebook we cover out of here in live lecture.

01:09:45.000 --> 01:09:51.000
So it's principal components analysis. The reason we're doing this is it's probably like one of the biggest.

01:09:51.000 --> 01:10:00.000
Dimension reduction. One of the biggest dimension reduction techniques you're gonna want to know. So sometimes you'll have data that's really high dimensional and for various reasons you want to.

01:10:00.000 --> 01:10:11.000
Reduce the dimension. So maybe the data, you have too much data to fit your algorithm in an efficient time.

01:10:11.000 --> 01:10:16.000
Maybe you want to get rid of some noise in the data and you want to just recover features that are, not noisy and actually giving you signal.

01:10:16.000 --> 01:10:30.000
So for all of these reasons, you want to apply various dimension reduction techniques. So the one we're going to cover is PCA or principal components analysis.

01:10:30.000 --> 01:10:39.000
So this is a technique. It's probably the most popular dimension reduction technique in data science. And we're gonna go through it.

01:10:39.000 --> 01:10:46.000
So for the sake of time, we're not gonna cover the entire notebook. There's probably one set that I'm going to skip through just for the sake of time.

01:10:46.000 --> 01:10:54.000
But remember there are already pre recorded so if you want to come back and go through the part that I end up skipping, you're more than welcome to do that.

01:10:54.000 --> 01:11:24.000
On your own time. Okay. So I'm going to. Take a drink of water and then we'll go through this.

01:11:25.000 --> 01:11:32.000
Information, but. And S like when you're reducing the dimension, you are losing what's known as information.

01:11:32.000 --> 01:11:37.000
And so the idea behind PCA or most techniques is you want to reduce it in a way that retains the important information as much of that important information as you can.

01:11:37.000 --> 01:11:53.000
So the way that PCA tackles this is very statistical in nature. So there's this idea in statistics that the information of a data set is located within the data sets variance.

01:11:53.000 --> 01:11:59.000
And so PCA looks to reduce the dimension of a data set by projecting the data from the higher dimensional space to a lower dimensional space while capturing as much of the original variance as possible.

01:11:59.000 --> 01:12:08.000
So your original data set has some variance. And what PCA tries to do is capture as much of that variance as it can.

01:12:08.000 --> 01:12:16.000
While giving you a lower dimensional data set. So the way it does this is sort of like an optimized optimization problem.

01:12:16.000 --> 01:12:27.000
It's trying to produce projections. In a way that maximizes the variance of those projections.

01:12:27.000 --> 01:12:32.000
So here's sort of the heuristic algorithm. So the first step is to send your data.

01:12:32.000 --> 01:12:35.000
So it has 0 mean. So you can do this without impacting the data. You're just gonna subtract off the mean and it's just done for convenience.

01:12:35.000 --> 01:12:43.000
It makes the formulation easier.

01:12:43.000 --> 01:12:52.000
Next, you find, and you know, this isn't you, this is the computer, but you find the direction in space along which projections have the highest variance.

01:12:52.000 --> 01:13:03.000
This is going to be your first principal component. Then among all the other directions, so if you have an, and dimensional data set, so M features, right?

01:13:03.000 --> 01:13:06.000
If you have M features, you have after the first principal component, you'll have m minus one possible directions left.

01:13:06.000 --> 01:13:16.000
So then the next step after finding the first principal component is you want to find the direction that is orthogonal.

01:13:16.000 --> 01:13:28.000
So add a right angle to the first principal component that maximizes the variance. This is your second principal component and you keep doing this until you get through as many principal components as you can.

01:13:28.000 --> 01:13:33.000
So every time you make a new direction, it's orthogonal to the all the previous directions.

01:13:33.000 --> 01:13:47.000
Okay, so. We're gonna look at a somewhat silly example, a 2D example, and we'll see why it's Silly because you might be wondering like oh why would I want to reduce the dimension of a two-dimensional data set, it's already pretty small.

01:13:47.000 --> 01:13:58.000
So here I have 2 dimensions, x one and x 2. So in unsupervised learning, we didn't talk about this, like we're just, we don't have any wise, we have no outputs for trying to predict.

01:13:58.000 --> 01:14:11.000
We are just trying to look at an X matrix. So a matrix X. So in the context of the supervised learning we've been doing up to this point, that X matrix might be our matrix of features.

01:14:11.000 --> 01:14:17.000
So the idea with PCA is like, I don't care about what Y is I'm just trying to get data out of the X.

01:14:17.000 --> 01:14:26.000
And so here we just have X here, and we have X 2 and X one.

01:14:26.000 --> 01:14:31.000
So Jacob is asking when you say scale the features so they have mean 0. Does this include the one hot encoded features so you can just use a standard scale on every single feature.

01:14:31.000 --> 01:14:47.000
So that's a good question, Jacob. So Here I'm assuming I guess I'm just implicitly assuming we have continuous features for the setup, just because it's easier to look at.

01:14:47.000 --> 01:14:54.000
So. In general, you would not, you do not want to scale. One hot encoded features.

01:14:54.000 --> 01:15:02.000
You never wanna scale those. You always keep them as zeros and ones. And you can apply PCA to a matrix of zeros and ones and it's fine.

01:15:02.000 --> 01:15:11.000
But for this particular setup to get the intuition behind PCA, I'm just restricting myself to continuous features.

01:15:11.000 --> 01:15:19.000
Okay, so remember I said for PCA our idea is we want to find the direction of mimal variance.

01:15:19.000 --> 01:15:29.000
And so variance means spread. So of these of the directions in this plane, the one where the data has the greatest spread is the one that I'm tracing out.

01:15:29.000 --> 01:15:41.000
With my mouse, my cursor. Okay. And then what's left after that is just going to be the direction orthogonal, which would be with the direction I'm currently tracing out.

01:15:41.000 --> 01:15:44.000
So let's go through.

01:15:44.000 --> 01:15:53.000
And see this by doing PCA. So PCA is stored in a sub-package of SKL and called decomposition.

01:15:53.000 --> 01:15:57.000
So from. S. K. Learn dot D composition.

01:15:57.000 --> 01:16:03.000
We're going to import capital P, capital C, capital A.

01:16:03.000 --> 01:16:14.000
Then we're gonna make our PCA objects. So we're gonna do little PCA equals Capitals, PCA, and then the number of dimensions we want to project down to is the second is the number that we put in.

01:16:14.000 --> 01:16:21.000
So for us, that will be 2. And then to fit the data, you call PCA dot fit.

01:16:21.000 --> 01:16:25.000
And then there's no Y here. So you just put in X.

01:16:25.000 --> 01:16:34.000
Okay. All right, so now this is a function that will take in our fitted PCA and then draw the vectors.

01:16:34.000 --> 01:16:43.000
So what does that mean? So these solid black lines, these are the directions in the.

01:16:43.000 --> 01:16:51.000
And the your data space that we would project onto. So this long black line represents the vector that we're going to.

01:16:51.000 --> 01:16:55.000
So this long black line represents the vector that we're going to project onto for the vector that we're going to project onto for the first PCA direction.

01:16:55.000 --> 01:16:59.000
This shorter black line represents the vector that we're gonna project onto for the first PCA direction.

01:16:59.000 --> 01:17:07.000
This shorter black line represents the direction of the second principal component, the second PCI Okay, so the way PCA takes your original data.

01:17:07.000 --> 01:17:18.000
And produces new transform data is it's going to take each of these observations. Project first onto bigger vector and then project second onto the second biggest vector.

01:17:18.000 --> 01:17:27.000
And then those projections will give us the coordinates for the new data. So. How do we get that with Skewer and we do.

01:17:27.000 --> 01:17:34.000
Fit equals PCA. So this works an awful lot like a scale or object. So you have fit and then you have transform.

01:17:34.000 --> 01:17:45.000
So you have PCA. Transform and then we input the X and I this may be as bad notation I call it fit just because I just what I've always done.

01:17:45.000 --> 01:18:03.000
So this is what's known as the PCA transformed data. So these represent the same observations that you see here, but now they've been projected onto a different space where if this was higher dimensions, it would be onto 2, but since it's 2 to 2, it's the same number of

01:18:03.000 --> 01:18:06.000
dimensions. But we'll see, We'll see sort of what's going on in a second.

01:18:06.000 --> 01:18:15.000
So I've got a question here. To do. Okay, so I believe I've answered the question.

01:18:15.000 --> 01:18:19.000
So the question was, why do we say PCA, don't we already have 2 dimensional data?

01:18:19.000 --> 01:18:23.000
Is this why you said you are just showing how it operates in a simple example. Yep, that's exactly it.

01:18:23.000 --> 01:18:31.000
It's just showing it in a way that I can visualize. So you see what's going on.

01:18:31.000 --> 01:18:39.000
Okay. So we are going to look at the maximal variance formulation of PCA just so we can see what's going on.

01:18:39.000 --> 01:18:48.000
It sort of makes it easier for the setup. There are other ways to set up PCA that are in the practice problems for the PCA notebooks.

01:18:48.000 --> 01:18:57.000
So let's suppose that we have n observations of m features x one through xm. Each of these are an n by one vector.

01:18:57.000 --> 01:19:02.000
Containing observations of the M features. So again, I'm assuming that they all have mean 0.

01:19:02.000 --> 01:19:12.000
We could do this simply just by subtracting off the mean and our data would be fine. Okay, so to find the first principle component.

01:19:12.000 --> 01:19:18.000
We're gonna do that and then you know you can extrapolate to extrapolate to get the next however many components.

01:19:18.000 --> 01:19:26.000
So we're gonna set up X as a matrix of an M by M matrix for each column is one of the features.

01:19:26.000 --> 01:19:31.000
And then our goal is to find a weight vector. Such that the norm of that vector is one, where the variance is as big as possible.

01:19:31.000 --> 01:19:41.000
So the variant here the W's times the X or X times W as a linear algebra expression.

01:19:41.000 --> 01:19:46.000
This is the projection. So you want to maximize this variance. So the variance of x times w is equal to this.

01:19:46.000 --> 01:19:58.000
And because we've centered the means, so because we've centered the columns, this is what the variance is equal to.

01:19:58.000 --> 01:20:09.000
And then you can bring out the weight vectors outside and what you're left with is W transpose times covariance of the matrix X times W.

01:20:09.000 --> 01:20:20.000
So then what we're trying to optimize. Is this W transpose sigma W and now we're constraining ourselves to the fact that W transpose minus one has to be 0.

01:20:20.000 --> 01:20:40.000
Where does this come from? That's just the norm has to be equal to one. So if you go back to calculus 3, you can use Lagrange multipliers and find out that this, maximizing, The variance doing this constrained optimization problem gives you, you want the W such that

01:20:40.000 --> 01:20:57.000
sigma W is equal to lambda W. So this is a standard eigenvalue setup. So the principal components are going to be Igan vectors that correspond to the eigenvalues of the covariance matrix.

01:20:57.000 --> 01:21:04.000
So the first piece, the one that maximizes the variance the most is going to be the Oh, eigenvector and eigenvector corresponding to the largest eigenvalue.

01:21:04.000 --> 01:21:22.000
So if you're a not a math person and all those words just sounded like foreign language to you, just remember it's the direction that maximizes the variance of the data of like the projection.

01:21:22.000 --> 01:21:28.000
So that's all that matters. Okay. So as a quick aside before we go back to.

01:21:28.000 --> 01:21:39.000
Implementing this stuff. You typically need to scale your data before you fit the PCA model in practice.

01:21:39.000 --> 01:21:53.000
So if you have one of your columns have a much larger scale than the other columns because we're maximizing variance, what tends to happen is that the PCA just picks up on that much larger scale column instead of doing, you know, what we would like it to do.

01:21:53.000 --> 01:22:10.000
So it's a common approach to, I run the data through the standard scalar object first and then, you know, accounting for the fact that you may have to ignore a categorical variables if you have those.

01:22:10.000 --> 01:22:22.000
Okay, so. I talked about this W. The W is known as the component vector. And so how we can actually get out the components with dot components underscore.

01:22:22.000 --> 01:22:28.000
So we're going to do PCA dot components.

01:22:28.000 --> 01:22:35.000
Okay, and so here.

01:22:35.000 --> 01:22:44.000
Okay, so we have, sorry, I was trying to read what I was writing earlier. So here we have a numpy array and then within that array we have 2 vectors.

01:22:44.000 --> 01:22:56.000
So this is the first. W so this is the eigenvector corresponding and eigenvector corresponding to the largest eigenvalue of the covariance matrix.

01:22:56.000 --> 01:23:05.000
So the first PCA component and then this the one entry is the second PCA component. Okay, so we can store these in a variable.

01:23:05.000 --> 01:23:10.000
So W one and W 2.

01:23:10.000 --> 01:23:15.000
And so now what I'm gonna go ahead and do. I've plotted these vectors.

01:23:15.000 --> 01:23:22.000
So this vector, the long vector represents W one, the shorter vector represents W 2 and I've got this red X.

01:23:22.000 --> 01:23:24.000
So this red X is one of the data points that I would like to get transformed through this PCA project.

01:23:24.000 --> 01:23:36.000
Not project. And so I, you see these red dotted lines that trace along to the vectors.

01:23:36.000 --> 01:23:46.000
So the Ex position and the PCA projected space is going to be given by the distance from 0 to this point.

01:23:46.000 --> 01:24:03.000
Onto the onto W one so it'll be this length And then the vertical access position of this red X and the PCA space is going to be the distance from 0 to this point on the vector W 2.

01:24:03.000 --> 01:24:10.000
And so now we can see this is what it looks like. So here my red X is that new position.

01:24:10.000 --> 01:24:16.000
So we're now we're in the PCA projected space. So the vectors have been brought over just to help as a reference point.

01:24:16.000 --> 01:24:25.000
So here is the PCA transform value that was done. By hand using, the projection.

01:24:25.000 --> 01:24:33.000
That we found and then the blue circle is what you'd get with the SQL and so they're slightly different because of, probably.

01:24:33.000 --> 01:24:36.000
Different levels of precision But they're essentially the same, just like very slight difference.

01:24:36.000 --> 01:24:45.000
Okay. So. Because we only have 3 min left, I'm not gonna pause for questions just yet.

01:24:45.000 --> 01:24:58.000
I do want to get through this idea of explained variance. So remember we said we want to preserve as much of the variance of the original data as possible.

01:24:58.000 --> 01:25:11.000
We can find how much of it we've captured. With explained variance. So we do explained. Variance and what this tells you is It's a breakdown of the variance of X times W.

01:25:11.000 --> 01:25:19.000
Yeah, it's a breakdown of the variance of X times W. So the variance of X times W one is 80.4 5.

01:25:19.000 --> 01:25:33.000
And the variance of X times W 2 is 4.2 5. This becomes much more. To you if you look at what's known as the explained variance ratio.

01:25:33.000 --> 01:25:41.000
And so what you would do is You can see here that, we have. 90 or point 9 4 9.

01:25:41.000 --> 01:25:57.000
So this is saying that 90 basically 95% of the original data sets variance is captured in the first principal component direction and then about 5% of the original variance is captured in the second principal components direction.

01:25:57.000 --> 01:26:05.000
So that's the idea of the explained variance ratio.

01:26:05.000 --> 01:26:19.000
Okay, so I wanted to keep going. I just wanted to see how far we got so what we're gonna do is we're gonna use this data set of images of famous people's faces and then use this to demonstrate the idea of the explained variance curve.

01:26:19.000 --> 01:26:20.000
So this is a lot of data set. So a lot of data set. This is a lot of data.

01:26:20.000 --> 01:26:31.000
So each image has 87 by 65 pixels, which is 5,655 features.

01:26:31.000 --> 01:26:36.000
And so we're going to run the data. Through PCA.

01:26:36.000 --> 01:26:44.000
And so for here, when you have image data, you tend to just scale it. So each pixel is represented in grayscale.

01:26:44.000 --> 01:26:54.000
It's represented of a value from 0 to 255 where 255 means like the most light it can show and 0 means a completely black pixel.

01:26:54.000 --> 01:26:58.000
With these types of problems, it's standard to scale it to do what's known as a mid max scaling.

01:26:58.000 --> 01:27:02.000
So the minimum value goes to 0 and the maximum value goes to one. So here I have not specified a number of components.

01:27:02.000 --> 01:27:21.000
So when you do this, what it does is it just fits as many components as it can. And then after it's done fitting, which maybe will take a little bit, we'll see by looking at the shape of components, we'll see how many components we fit.

01:27:21.000 --> 01:27:31.000
And then after we've done that, we'll talk about the idea of the explained variance, curve.

01:27:31.000 --> 01:27:37.000
So while that was fitting, you said had the question, it seems to me as if the second variance in the second component is small.

01:27:37.000 --> 01:27:45.000
So in this previous example, yes, it was small. So now, 95% of the variance was captured by the first direction, right?

01:27:45.000 --> 01:27:51.000
And because this is a 2 dimensional problem, that just means everything that's left has to go in the second one.

01:27:51.000 --> 01:27:56.000
Okay.

01:27:56.000 --> 01:28:07.000
Alright, so when we fit this, See what is the shape? So this is telling us that we CAD 3,023 featured or not features.

01:28:07.000 --> 01:28:16.000
He's Ca, right? PCA components. Yes. Okay, so.

01:28:16.000 --> 01:28:25.000
Oftentimes you may be wondering how many components should I use. So if you're trying to do something like visualization, maybe you're just gonna look at 2 components and be happy.

01:28:25.000 --> 01:28:36.000
If you're trying to find a way to reduce the dimension of the data like we're trying to here, while still maintaining as much variance as you can, you may want to look at the explained variance curve.

01:28:36.000 --> 01:28:47.000
So the explained variance curve. Plots the number of principal components you've used and then also the cumulative explained variance ratio.

01:28:47.000 --> 01:28:56.000
So you take this explained variance ratio that we looked at above here. Right. And the cumulative explained variance ratio.

01:28:56.000 --> 01:29:03.000
Just I won't look the same here, but it would be the cumulative sum of all the entries in here.

01:29:03.000 --> 01:29:09.000
So for this it would look kind of silly because we only have 2 so it would be the same thing here and then it'd be one here.

01:29:09.000 --> 01:29:28.000
So by plotting that it gives you a sense of seeing, okay, if I have, say, a hundred principal components, that means I'm capturing about 90% of the original data sets variance and so what you can do is sometimes you'll look for sort of an elbow in the curve and so we say elbow

01:29:28.000 --> 01:29:37.000
here because you sort of imagine putting your elbow up like this if you can look at my the picture of me right now and then like the elbow sort of where the.

01:29:37.000 --> 01:29:46.000
Increases in variance start to become so small that it's not worth the extra cost of either A keeping the date the component or computing the component.

01:29:46.000 --> 01:29:54.000
So for us maybe the elbow would be about. 0 point 9. It's sort of this idea of diminishing returns for keeping additional components.

01:29:54.000 --> 01:30:03.000
Another thing you could do is you can just say, well, I wanna keep 95% of the, of the original variance and then just set it to be 95%.

01:30:03.000 --> 01:30:13.000
So, When you do that, you can just say, okay. I want my number of components to be 95% and you can set the number of components equal to a fraction.

01:30:13.000 --> 01:30:23.000
Instead of, instead of an actual number like 10 or 15 or whatever. Okay.

01:30:23.000 --> 01:30:32.000
All right, so based on the time I'm gonna ask that you finish the rest of the PCA on your own except for.

01:30:32.000 --> 01:30:39.000
No, yeah, I think just Finish it on your own time. We got to the most important stuff, which is the model.

01:30:39.000 --> 01:30:56.000
And then. The explained variance ratio for the rest of the stuff. I would encourage you to go through it on your own either through just by reading through it on your own or by watching the pre recorded lecture.

01:30:56.000 --> 01:31:05.000
Okay. So I will stop recording and then I'll hang around for any questions that people have.