Principal Components Analysis I Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video, we'll learn about a dimension reduction technique called PC A which is short for principal components analysis. This will be the first in a series of videos on this topic. Let me go ahead and share my Jupiter notebook and we can get started. So throughout this series of videos and in this particular notebook, which is stored in the unsupervised learning lectures under dimension reduction, uh we will learn the concept behind PC A as well as the mathematics that underlie it. Uh We'll show you how to implement PC A and S K learn, we'll demonstrate things like explained variants and the explained variances curve as well as introduce a labeled faces in the wild data set. And then we'll show you how you can interpret the results of PC A using an NBA data set. So a couple of different data set. So that's why we have a couple of different videos on this subject. So in any sort of dimension reduction algorithm or, or progress or uh procedure, uh you're going to lose some information in the data set. So there's no way to take a high super high dimensional data set and then reduce it to a smaller dimension without losing at least a little bit of the information stored within the data set. So typically, these algorithms and methods try to reduce the data as dimension in a way that retains as much information as possible. And so what does that mean? Well, typically what that is taken to mean depends upon the approach. So PC A tries to tackle this problem in a very statistical manner. Uh And what do I mean by that? Well, let's see. So in statistics, there's sort of this idea that information of a data set is stored within that data sets variance. Uh So the information that data gives you uh is found within the variances of that data. So uh PC A tries to look to uh take advantage of that and then looks to reduce the dimension of a data set. Uh by projecting uh we're here, we mean sort of the linear algebra sense of the word uh projecting the data onto a lower dimensional space while trying to capture as much of the original variants in the data set as possible. So in terms of an optimization problem from math or statistics, uh your goal is to project into a lower dimensional space while maximizing variants. So heuristically it works like so step one uh you center your data so that each feature has a zero mean. So maybe your original data set each of the columns has a non-zero mean, you can just subtract off the mean and you still have essentially the same data set. Uh This is really just done for convenience. Um It's a lot more complicated to look at it if it's not with zero mean, um you can then you want to try and find the the direction in space. So think of that as like a vector. So the vector in space uh along which projections have the highest variants. So I take this high dimensional data and if I project it onto this vector, this is gonna have the highest variants possible. This is the first principal component. Now you're going to find the direction that's orthogonal to that, that also maximizes the variants. This is the second principle component. So essentially think of this as doing the same thing as in step one. But now you need to limit yourselves to spaces that are are orthogonal to the first space. Uh And then you continue in this way uh where the Keith principle component is the direction the vector uh this the direction spanned by the vector that maximizes the variants while still being orthogonal to the previous K minus one components. So that might sound hard. And I think it may be as hard to visualize if you're a visual learner. Uh So I think it's useful to look, to look at a picture and we're gonna look at this two D example. OK. So here's the next one, here's the next two. And it, it looks like maybe if we're thinking of maximal variants, variance means spread. So the data spreads a lot along the axis that my mouse is tracing out. And then if you know in two dimensions, once you have one direction, your other dimension is set uh by the orthogonal uh the orthogonal restriction or yeah, the orthogonal restriction. So the other direction would just be this one. And so we have this data and we're gonna demonstrate how to use ESQ learns PC A process to sort of give you a better visual sense of what's going on when you do PC A. ... So first, we're gonna import PC A, it's stored in the sub package called decomposition. So from S K learn dot decomposition, we import PC A and PC A is sort of both a model and a transformer if that makes sense. And so we'll see what we mean in a second. The first thing we need to do is make a PC, a object. Um So little PC A is equal to capital PC A. And then the first argument you put in is the number of dimensions you'd like to project onto. So we're gonna choose two. Uh It might seem silly because we have two dimensional data, but we'll see why we choose two in a second. Um Just to visualize what's going on and then we can fit the data just like we would an algorithm or a scale or we just call PC A dot Fit. Uh And then I think I stored this in capital X capital X. OK. Now, this is what you should see if the last thing you ran was like a dot Fit. This is a function that's gonna go ahead and draw those direction vectors that I was talking about in the heuristic. Uh And then, so I'm gonna plot the scatter data scatter plot that we just saw. But then on top of that, I'm gonna draw the vectors that the PC A found. ... So here's X one, here's X two, these are our points that we just ran through PC A up above with S K learns PC A object. And this long vector that I'm tracing out with my mouse right now, this is the first principal component direction. So this is the direction uh along which if we were to project onto uh would have the maximum variants possible. And then this is the second principal component vector uh which again is essentially uh this direction is more or less chosen for us in two dimensions. Once we have one uh by the orthogonality constraint, ... so the vectors drawn above are called the component vectors. As I think I just said, uh without telling them to you first, uh when we want to get the transformed version of the data, uh which we're gonna show you in a second. So once we want to get the PC A transform version. So these are the vectors that we're gonna project onto, we have to get the scaler projections of each point onto the vector. And then that's gonna give us our new coordinates in the PC A transform space. So let's visualize this. So uh you know, we said PC A, it's kind of like a model, it's kind of like a transformer. So we'll see what we mean there now. So fit uh these are gonna be the projected values that we're talking about. You call PC A and instead of uh dot predict because we're not predicting anything we call dot transform. So just like with a standard scaler and this is gonna take the fitted PC A which is what we fit above and then trans take in X and transform it, meaning it's gonna take all of these points project it down onto the corresponding vectors. And then that will give us our new PC A PC A data like we've applied PC A to it. ... So this is the PC A transform data and this is what it looks like. This is called the first PC A value that I've planted on the point that I plotted on the horizontal axis. So that's the say the projection of this point onto this vector. Uh So like here would be the value for um for that one. And then the second PC A value is the projection of this point onto this vector. OK. Uh So we'll have a nice plot that traces that out step by step. But first, we're gonna try and give you an overview of the mathematics behind this that made this work. Uh Remember it's OK if you're not comfortable trying to understand the mathematics or frankly, you're not interested in it. But there are gonna be people who watch these videos and go through these lectures that are interested in it. So we are gonna talk about the math behind PC A uh for just a little bit. So to understand the math, we're gonna consider the maximal variance formulation of the problem. So meaning we're gonna try and optimize by finding the vector on which if we were to project would give us maximum variants. There's equivalent formulations that we go over in the practice problems notebook uh which you can find in that folder. So we're gonna suppose we have N observations of little M features X one through X M. Uh These are all N by one vectors containing observations uh for the M features. And so for ease of notation, uh we're again going to assume that these have already been centered to have mean zero, arithmetic, mean zero. So we're gonna show you how to find the first principal component and then the other ones can kind of be found in a similar fashion. So we're gonna have X B and N by M feature matrix. So N observations on the rows, M features on the columns. And what we wanna do is we want to find a, a weight vector or a projection vector. A little W that's W one through W M so that the uh norm of W is constrained to be one. Uh So we make this constraint because otherwise we could just choose uh a vector with a very large norm and that will give us a large variance uh so that the variants of W one times X one plus W-2 times X two, all the way up to W M times X M uh which is equal to the variance of X times W is maximized. And so, uh because we're restricting our norm to be one X times W is, remember we talked about taking this point and projecting it on uh X times W is that vector of scalar projections. So this is gonna take each of the points projected it onto the uh vector W. And then essentially what we're doing is we're just maximizing the variants of all those projections. And because we've centered the columns of X to have mean zero, the variance of X times W is just equal to the expected value of W transpose X transpose times X times W. So this is a fact from statistics, probability theory and linear algebra. Um And then the WS themselves are fixed numbers meaning that they're not random. And so you can pull them out of the expected value. Uh And what you're left with is W transpose uh times the covariance matrix of X times W Y, the covariance matrix. Well, that's because we have centered the columns of X. OK. So essentially what we've got here is we're trying to optimize some function little F of W which is a function of W because remember the W is the thing we can change the sigma is set uh W transpose Sigma W we're trying to make this as large as possible. But we're constrained to the fact that W transpose W which is the norm of W minus one has to be equal to zero. So that's this constraint, right? That the norm is equal to one. This is if you remember your CALC three, this is a Lagrange multiplier problem. And so you can go through and show that this is equivalent uh this is sort of the derivative of this, you know, when the Lagrange multiplier set up the derivative of all this stuff with respect to W uh what ends up being two times Sigma W minus two? Lambda. W lambda is the Lagrange multiplier. When you set this equal to zero and solve, you end up with the covariance matrix times W equal to lambda times W. And if you take a linear algebra, this is your standard eigenvalue setup. So it turns out that the vector W that maximizes the variants has to be an eigenvector. And which eigenvector is it? Well, it's the one that corresponds to the largest Eigen value of the covariance matrix of X. So this is known as the first principle component. And if you go through this process and find the remaining principal components, um you're going to, it turns out that they are the corresponding uh eigenvectors for the remaining eigenvalues of sigma in decreasing order. So the second principle component vector is the eigenvector that ha corresponds to the second largest eigenvalue. Uh And then, because Sigma is an M by M real positive symmetric matrix, it has a set of M eigenvalues if we assume that N is greater than M and these are orthogonal eigenvectors. So uh that's the math behind it. Uh If you weren't interested in that, I hope you fast forwarded through it. So that now you're listening to me tell you that we're done with the math. Um in practice. Uh So we did this X could have been anything in practice, you usually have to scale the data. Uh So why do we have to scale it? Well, let's say you have a, a feature matrix where the scale of the data is one of the columns is like in the hundreds of thousands and everything else is in the tens or ones. Then essentially what PC A actually just picks up on is is that uh the column that has the much larger scale is the one that maximizes the variants. Um So by doing Standard Scaler, you make sure that you're actually uncovering which direction is the variance maximizing one and not just the one that corresponds to the one with larger scale. So we did a lot of math, we gave you a heri heuristic, let's try and combine them now and give you a better sense of what's going on. Uh So again, we're gonna do this in S K, learn with S G learns PC A. Uh So these little W vectors that we talked about in the math uh these are stored in PC A dot Components. So here is PC A dot Components, they're called this because they're the component vectors. ... OK. So uh this is ... W one if I'm reading that correctly and this is W-2. OK. So W one is the one corresponding to the largest eigenvalue. That's the longer of the two vectors. W-2 is the shorter of the two vectors. ... Now, this is the original plot, the original data is faded to give more uh a better sense of the vectors. So this long vectors, W one, the shorter vector is W-2. And then how would we get? So, for instance, at this point red X, this is one of our observations. Uh how do we get the PC A transform version? Well, you do the projection of this point onto W one which is this red dotted line here. And then you do the projection of the point onto W-2 which is this red dotted line right here and then this figure rotates it around. So this is sort of the corresponding part in the projected data. So this red X was found by doing a hand calculation following the mathematics up here. While the blue dot is what you've got with S K learn. So they're slightly off from one another. That's just sort of a difference in the computation approach. I believe S K learns probably does a slightly different one, which is why there's a slight difference, but this is the idea. OK. All right. So that's gonna be the end of this first video. In the next video, we'll talk about something called explained variants uh which allows you to sort of get an idea of how many uh dimensions you should project down to. All right. So I hope you enjoyed learning about the idea behind PC A as well as seeing the math behind how the idea is implemented. Uh I hope to see you in the next video where we talk about the explained variance curve if you continue on with this PC A series. All right. Have a great rest of your day. Bye.