Principal Components Analysis III Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We finish our brief series on PC A with PC A three. So in the previous two videos, we talked about the heuristics and mathematics behind the idea for the PC A reduction uh dimension reduction technique. And the second video, we looked at the explained variance ratio and the cumulative explained variance curve as well as sort of a way to think of A PC A as compressing the data and approximating the original version of the data.
Uh And now in this last video, we'll talk about interpreting PC A uh to uh see how you can take the output of a PC A fit and try and interpret it with some basketball data. Uh If you didn't, if you took a break between PC A two and PC A three, or if you skipped PC A two and went straight from PC A one to PC A three, go back and rerun all of the code chunks um or make sure all the code chunks uh were run from the beginning of the notebook until interpreting PC A down here.
Uh And then we will be ready to go. OK. So to try and show you how you can use the component vectors, W to interpret PC A. We're gonna use this data set on basketball shot distributions in particular, we're gonna show you how you can um use this data that has uh shot distributions from the 2000 to 2001 NBA season to the 2018 to 2019 NBA season, uh where each of the teams for a particular season are gonna have their distribution of total shots taken.
So remember basketball, you've got your ball, you gotta shoot it, you want to get it in the hoop. Uh They can shoot from one of these 15 zones in the basketball court where the hoop is typically here in zone one or at least, maybe here in zone two depending on. Um I think it's in zone one. Uh So the idea here in this data set is you're gonna have a bunch of different teams for these two seasons and it's gonna give you a percentage breakdown of where all their shots came from.
So for instance, if zone 13 had a value of 130.20, that would mean 20% of their total shots taken over that season came from zone 13. This data is stored in the file NBA teams shots dot CS V in the data folder and here's what it looks like. So we have a team for each season and these first two teams are the Atlanta Hawks and we can see that in 2000 to 2001 27% of their shots were from zone one, this little circle down here, whereas in 2018 2019 36% of their shots were from zone one.
So, uh it's possible that and can see, I think it's completely conceivable that somebody might want to see if they could understand the changes in shot distribution from 2001 to 2018 to 2019 by using PC A. So it's 15 variables, but it's, that's harder, not super, it's not as bad as our labeled faces in the wild data set uh in terms of algorithms.
Uh but it or compute times or memory storage uh But it is slightly more difficult to understand uh what's going on simultaneously in 15 variables. Maybe there's a way we could get one or two variables that would be more informative to the average basketball stats viewer. And so the idea here is we're gonna run this data through PC A to see if that's possible.
Uh So this is an example where we're going to standard scale the data first. So we import our standard scaler. I'm gonna make the scaler then the PC A, I'm gonna then scale my data and then I'm gonna fit the PC A and then get the transformed data. OK. So this is should be uh old hat for us by A PC A three. Uh Now, let's go ahead and plot the results. So this is the first PC A direction uh on the horizontal axis.
This is the second PC A direction and you see this blob and you take this blob and you go show it to your favorite NBA stat head or your favorite NBA watcher. Uh Maybe if you have a mom or a dad or a parent who's really into the NBA and you go show this to them, you're really excited and they're like, well, what the heck does that mean? What is first PC A dimension, second PC A dimension?
I don't understand any of this. Uh Well, that's where this interpretation stuff is gonna be nice. So we don't know what this means uh with respect to actual basketball yet, but we can use the component vectors to try and figure this out. So if we remember from our mathematical formulation, the component vector W one all the way through W M with uh the restraint that it was a unit vector, uh We wanted to maximize that variance, right.
And so as we can look at this, we can see, well, the larger an indep independent uh an individual component of W is the more impact a particular feature in the feature matrix has. So for instance, if W one is super large, that means X one has a bigger impact on this variance. OK. And so the idea here is we can look at our component vectors and then see which ones are largest in magnitude uh positive or negative to help us see which ones have the biggest impact on the variants.
And then we can use that as a proxy as something the PC A is picking up on or is detecting to help us see what was most important and increase in the variants for that particular component. The idea here being the larger the value of if for large values of W I, if you were to take an observation and increase the corresponding value of X I, you would move its placement on the chart either for in this instance, in the first PC A, either right or left or in the second PC A up or down.
So here are the component vectors and here's what they look like and these are sorted in according to most negative to most positive uh component one value. But this might be hard to interpret. Uh You still, you go show this to your uh parent who is not a very data person but is maybe uh a big basketball fan, they're still not gonna understand it.
So sometimes it's useful to make heat maps. Uh And in this instance, we have this really nice feature of all the variables correspond to an actual physical zone in a, in a graphic that you can show them. Uh And so this was made by a former boot camp member Patrick Valley current, you know, air institute member Patrick Valley. If you click here, you can go to his linkedin.
Uh he made these nice graphics for us off of an older version of the notebook where we just had the colors on a nice bar heat map. Here's what they look like in the basketball context. So the scale goes from about negative 0.3 at the bottom up to point a little bit above 0.0.6 being the highest and they're or they're uh segmented off as first component, second component.
So in the first component, we can see that the uh the areas that are the most positive, so have the brightest orange or brightest yellow regions are these regions 11 10, 12, 13 and 14. And then also maybe region one and we could see that here from the chart as well. Uh But more importantly, or maybe not more importantly, maybe it's easier to see what the visual with this visualization.
So, uh for those of you that are not, if you're familiar to basketball, you might understand what this means if you're not as familiar to basketball. Um It's been shown that these regions that are colored uh here in the orange, uh the orange and then slightly uh salmony color with one, maybe that's the right color. These are the regions where if you were to form an expected or points scored per region of the court.
Uh If you look at the work of Kirk Goldsberry, these are the regions where um you have an expected value greater than one. So these are the regions where if you take a shot, uh your expected value is greater than one. Whereas in these other regions that are dark purple or dark blue, your expectation is less than one. OK. And so essentially basketball experts uh and statisticians and data, people have shown that these regions out here which are worth three or this region which is
right under the hoop. Uh it's much more lucrative from a a points getting perspective uh to shoot from here. These are the more efficient zones to shoot from. And so this is really more of a recent trend. So if you were to color these years by the year or the season, they occurred all of the points over here to the left uh are occurred in 2000 to 2001 and all of the points over here on the right occurred in 2018 to 2019.
And so what was going on is people started to realize that it's much more efficient to shoot threes. So over time, they started shooting more from outside uh in these orangeish regions and from um uh and less in what's known as the midrange. So these dark purple regions, OK. And then the second component, it looks like it's um picking up on whether or not a team shoots mower from this region region here, which would uh be be known as the paint.
OK. So this is a way that we can sort of uh interpret this. Uh And so, for instance, if I had a team that I did not run through this, but I would like to project onto the PC A. Um You could say that if, for that hypothetical team, they took 100% of their shots from zone 12, they would be very far right, uh to the very far right on this chart. Uh And then looking at this may be uh slightly negative on the second component.
So someone who shoots uh hypothetically all of their shots from zone 12 might meet down here. OK. So that's a nice feature about PC A is that we can interpret them. Uh If you're really interested in learning more about this particular data project, I did write a blog post about this years ago. I've put a link to that blog post. Uh Here, I also have a lot of nice references that are PC, a specific uh any number of these go through PC A stuff.
Uh And then this University of Waterloo Matrix cookbook uh is goes through and shows you how to do nice derivatives with matrices which I find useful. OK. So I hope you'll learn, learning, uh enjoyed learning about how you might interpret PC A. And if you're a basketball fan, even if you're not a basketball fan, I hope you found this nice implementation of um of PC A and how to
interpret it. Uh, interesting. All right. So that's it for PC A. I hope you enjoyed learning about it and I hope to see you in our next video. Uh, have a great rest of your day. Bye.