Principal Components Analysis II Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, welcome back. This is our second video in a series of videos on PC A. Uh Let me go ahead and share my Jupiter notebook and we'll pick up where we left off in PC A one. So in P CAA one, we sort of introduced the idea behind PC A. Uh We talked about uh the heuristic behind how we get these vectors and then we project onto them to get the transform data. We talked about the math behind it. Uh And then showed you how the math was implemented by hand. So each observation we try and find these vectors that maximize the variants. Uh And then once they to get the um transformed version, the PC A transformed version, we project onto it with the scale or projection because these are unit vectors. And then that's sort of the PC A transformed data. OK. So now we're going to talk about explained variants. If you have taken a break between watching PC A one and watching PC A two, go through, make sure you have rerun all of the code up to this point before continuing on. So for each weight vector W uh we call variants of X times W uh the explained variants due to principal component W OK. So the variances of this projection, so this projection will give me a vector of data points. We can get the variants for that. Uh the projection. The variance for this projection is called the explained variances due to principal component. W we think of this in some sense as the variance of X, the original data explained by the direction W so in S K learn this is accessible with explained variants. So for instance, here's PC a dot explained underscore variants. ... And so we had two, uh the first one had an explained variance of 80.45 and then 4.25, it can be useful to get what's known as the explained variance ratio which turns the explained variance itself into a fraction. So you would add these two together to get the total variants of the data. And then for instance, 80.45 is about 95% of the variants in the data while 4.25 is about 5% of the variance in the data. And so to sort of understand why this is useful, we're gonna look at this data set called labeled faces in the wild. This is a data set that has a bunch of images of people stored in. Uh what was it? Let me see by 87 by 65 pixelated images. Uh You can find the original data set here at this link. But S K learn has a nice version of it. If this is your first time running this chunk of code ESQ learn dot datasets import this might take a little bit of time. So feel free to pause the video while the data is loading. If you're trying to go through this step by step, so what we're gonna do is show the data. ... So here is an example of the data. Here is 12345, here are 10 faces. These are supposed to be notable figures. They used faces that were public or images that were publicly available and zoomed in on their faces. And this is a gray scale version. So here's Winona Ryder. If you've seen stranger things, I think is the most recent thing she's been in uh Colin Powell. Um And I, I can't say that I necessarily know the rest of these people, but these are the types of images within this data set. Uh And what we're gonna do is we're gonna use them to demonstrate the explained variance features of PC A. So each of these, as I said is a gray scale pixelated image with 87 by 65 pixels. Now 87 by 65 is 5655 features. Uh So each of the features is a pixel in the image that goes from ze- zero uh which would be like totally black. Like this corner here would be a zero. I believe it's either totally black or totally white. I can't remember which one. Uh And then it goes all the way up to 2 55 which is the maximum possible value. So this is a lot of features. So if we wanted to do something like build a classifier, using this data, some of the methods that we have learned in this course might struggle with this much data to run in a timely fashion. Uh Particularly if you're just using it as sort of a, an opportunity to learn how the models work. So uh for that, it might be of interest to compress the amount of features we need or do a dime a dimension reduction. Uh So how can we decide though? What's a good number? Do we go down to 10? Do we go down to 100? What about 1000 dimensions? What's the correct number of dimensions to reduce down to? Well, that's where the explained variance ratio can be helpful. So we're gonna apply PC A uh to this data set of labeled faces in the wild. Uh we're first going to scale the data. So instead of a standard scale or we're gonna do what's called the min max scaling. And so this min max scaling is gonna scale uh the minimum value which is 0 to 0 uh and the maximum value of 2 55 to 1 and to do that, we're just gonna divide by 2 55. This is a very standard approach. When you have sort of gray scale images, you just scale by the maximum value of the gray scale. Uh and then it'll go from 0 to 1. Now we're going to go through and fit. If we don't specify the number of components or the number of dimensions, we want to uh reduce down to PC A does not uh just fits the maximum number of possible components. ... OK? And this might take a little bit uh but once it runs, it'll print out the shape of the components which will tell us how many components uh we reduce down to. So what the dimension of the final data set is. And so that's this uh first entry we get 3023 is the um uh the number of dimensions we've reduced down to I believe ... or it could be 65 5005. Is that yeah, so I think actually never mind. I think it's 65 5005, never mind. OK. Uh So the explained variance curve is a way to look at how much increased variances you get by adding on an additional component or an additional dimension. So here's what I mean. So what we have in this curve is you plot the number of principal components on the horizontal axis and something called the accumulative explained variance ratio. And so what is the cumulative explained variance ratio. It's this highlighted part. So you're calling Nuys cumulative sum function and then you put in the explained variance ratio and now the explained variances ratio, if we can remember PC A dot Explained variance ratio, it's this array of the ratio of total variants that each component adds. So by doing a cumulative sum, you're going to get the total fraction of variants uh accounted for with the number of components we have. So for instance, here, with 100 components, it looks like we get about 90% of the original variants in the data set. Um And so this is a way to look at a curve of the cumulative explained variance ratio and then see how that changes as we increase the number of dimensions that we um reduce down to typically the way we can use. This is we wanna look for what's known as an elbow in the curve, uh an elbow because sort of you can like imagine if you look at my little image in their upper right hand corner, uh like the curve kind of looks like this where it's an elbow. And so we're looking essentially for the point where adding an additional dimension provides diminishing returns. So it might be more work for the algorithm to give us that dimension uh than it's worth to get the additional variants. And so for instance, here, uh it looks like maybe an elbow occurs around 100 So that's when it sort of stops from like rapid increase to much slower increase. So maybe 100 is a reasonable amount. So that's sort of how we use this cumulative explained variance ratio. So we did it this way. Uh and then like, maybe we looked through and saw 100. Another way you could do it is you could just say that. Well, I want to make sure that whatever many dimensions I um compress it down to, it preserves X fraction of the original variants. And so you can do that in PC A and S can learn by setting the number of components, not equal to an integer uh but to a fraction. And so if we do here uh 0.95, this is going to give us the number of components uh that we would need to get to at least 95% of the original uh original variants. And so again, this will take a little bit, but we'll look at the shape uh of the explained variance ratio to tell us how many components this took. And so here we can see it took 205 components to get 95% of the original variants. Uh Another way that might be useful for thinking about this explained variances and uh thinking about how many components that we need to project down to is thinking of PC A as a way to sort of compress the data. So what do we mean by that? Well, one neat way to think about it is to sort of think of uh the PC A components as a way to reconstruct the original data set. So a quick reminder from linear algebra, uh if you're in R two and you have two vectors V and U that are perpendicular to one another, it's true that for any real vector X, uh little X here, uh little X is then equal to the projection of little axon to U plus the projection of little axon to V because U and V are per perpendicular. So essentially what this is saying is if you think of, you know, sort of your horizontal axis and here's a vector that's represented by my pencil. Uh The distance that this er here we go, I stick it in my, in my watch. Um This vector represented by the from here to the tip of my pencil is given by how much we have to travel over horizontally plus how much we need to travel vertically. So the projection onto my thumb would be maybe the projection onto you and the projection onto my pointer finger uh is the projection onto uh out to V. So maybe U V uh and then my pencil is the sum of those two projections. So this idea can extend to multiple dimensions. And so the idea here here is for PC A, uh if we have some observation X star then for some principal component vector W L, we know from what we've derived above that the projection of X star onto the elf principal component. Uh Well, that's given by X star dot product with W L times W L. But remember because W L is a unit vector that's just equal to um do to do ... that's just equal to the elf principle. I just had to remind myself of my own notation was that's just equal to X tilda star. Uh L where I'm taking this to denote the Elf principle value for observation star. OK. So you find the principal value uh and then you multiply it by the principal component vector and then that is the projection of X onto that vector. And so you can reconstruct X star uh or approximate X star by taking a sum of the first capital L say uh principle components. For here, I'm taking capital L to be the number of principal components. So you can sort of think is like when we compress down to dimensions, capital L dimensions, um this is sort of an approximation of the original X uh this sum here. And so the nice feature with this image data set is we can use this to sort of judge how good that number of dimensions is by looking at the image they produce. And so what we're gonna do once this is done fitting as we're gonna plot the original image on the left and then show you how the image changes as we increase the number of dimensions that we project down to. And so this right here we do that, this like dot product that we do here is gonna give us those reconstructions. ... OK. So we can first, for instance, or maybe I can zoom out so you can see all four of them and then I'll move my little image over here for now. Uh We can see here, you know, this is the original image and then as we have 10 components, which is actually 60 about 61% of the original variants. Uh it does not look at all like Winona Rider, it kind of looks like this horrific monster uh same with some of these other ones. Um But then as we start to increase the number of components that we have, until we get to say, uh this is about 90% you can, can start to see some of that Winona Rider, some of um Nester if I'm reading this correctly, Nester Kirchner, Tony Blair and David Beckham. Uh And then you can see like if I go to 500 now it actually does looks pretty close to the original image. So with 500 100 components, we get relatively close uh to the original image. And so you could use this, let's say we're using this in a project where we're trying to predict which one of these people this is using some inputs. Uh We could see like, well, maybe this is why there's such a huge difference between 100 components and their performance on 500 components. Uh Because these look much more like the original images than these projections. OK. All right. So that's it for this video where we've talked about the explained variances ratio to help us try and decide how many components to project down to. And we also learned about sort of this like reconstruction approximation version of that to also give us a sense of uh how many components we should project down to in instances where we can make sense of that like with image data. OK. So that's it for this video. We're gonna have one more PC A video where we show you how to interpret the output of PC A uh in a way that might be useful to try and explain to others uh what's going on. So I hope you enjoyed learning more about PC A. I enjoyed teaching you about PC A and I hope to see you in the next video. Bye.