tSNE Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We'll learn about another dimension reduction technique called T S uh which stands for something that I don't remember yet. OK. T distributed stochastic neighbor and betting or T. Uh So sometimes uh data dimension reduction is performed mainly for data visualization purposes, which is the case uh with T S E.
And so in this notebook will introduce T S I will try and explain the mathematics behind it. It's slightly more dense than with PC A. Uh And then we'll show you how to implement it in S K learn. I'm also gonna go through and clear my colonel. So that way we start with a fresh notebook. So T distributed stochastic neighbor embedding or as I'm going to call it from now on T S reduces the dimension of a set of M features X.
And typically this is done to reduce it down to two or maybe three dimensions for the purposes of data visualization. And so a primary goal of T S is to ensure that points that are close to one another in the higher dimensional space remain close to one another in the lower dimensional projection. That's sort of the main thing that we're trying to preserve uh so much so that maybe things that weren't as close to one another uh in the higher dimensional space get very far apart in the
lower dimensional space. Um And so this is the idea, uh this is really the whole idea behind it. And now we're gonna give you a sense of the math behind it. We're not gonna derive anything like we did in the PC A one video if you watched that. Uh But we are going to give the basic outline of how the math works. If you're interested in learning more uh in depth.
There is I have a link to the original TNE paper here by Law Vader Matton and Joffrey Hinton. Uh hopefully I said those names correctly. So uh the idea for the math behind T C is you're going to take each pair of points. Uh And so in this um set up, little X is going to denote the points in the original data space. So think of them as uh the feature matrix X and when you see little Y that will be the naming of the points in the lower dimensional space.
So little X I is the ith observation in the original data set. So the I throw of the matrix X and the little Y I is going to be the ith transformed version. So the version of observation I that has been projected down to two or three uh dimensions. So for any pair of points in the original data space X I X J, you take their Euclidean distance. So think of this as the distance between points that you learned in uh high school or maybe even elementary school.
Uh you take this Euclidean distance and you turn it into a conditional probability P subscript J given I. And so the idea here is you're going to imagine a Gauss a Gaussian or a normal distribution around X I a Multivariate normal. And then you're going to compare the normal distance of X J versus the sum of all other normal distances. So here is the precise mathematical formula where the top is the uh actually not multi, I said Multivariate normal.
It's just a normal, regular normal Gaussian. Uh The top numerator here is the density function for a normal distribution. And then down here you have a sum of density functions for normal distributions. OK. So essentially, in essence, what we're thinking is that probability P J sub I, we want to think of this as the probability that X sub I would choose X sub J as its neighbor.
Uh So the closer X sub J is to X sub I, the larger, this is the, the, the further away X sub I from X sub J is the smaller piece of J underscore I is. So then we take uh you know, for notation's sake, we take P I give and I, we take it to be equal to zero. ... Uh And then for every, as I was saying, at the beginning, for every point in the higher dimensional space, little X to, there has to be a higher dimensional uh or a lower dimensional counterpart Y so I uh that we're going to map Xabi to.
So in whatever process we come up with Xabi has to get pushed down to Y Sabi. So in a similar fashion to piece of J sub uh give and I, we're gonna have little Q J give and I, so this would be the probability that Y sub I chooses Y subj as a neighbor. And so this is the precise mathematical formula they set up in the paper. And the reason why this is called T distributed stochastic neighbor embedding uh is because these distributions here, the numerators and the denominators give the
density function for a student size T distribution uh where you can find described here at this Wikipedia link. OK. And so the idea here is if you have two pairs of points X I and X J, that would have a high probability of considering themselves as neighbors in the original data space, you want to choose Q subj give an I in such a way that it, it will also be a high value.
... And so in order to make sure that this is the in the distance or that, that, that this is the case that if P J Sabi uh if the P sub J give and I is large, then Q subj give and I is large, they go ahead and minimize a cost function that measures the uh distributional differences between these two uh using some sort of gradient descent called the coach leble divergence.
Uh And so then the optimal Y I S uh the ones that correspond to the Q I S being very Q J given I S being incredibly close or as close as they can possibly get uh to the B the P J given eyes. Um these are the Ys that are spit out by the algorithm. These are the projections down into the lower space. OK. Now, the Sigma I here is also is a hyper parameter that is helpfully determined by something called the comple or the perplexity of T C.
Uh So smaller perplexity values tend to focus on more local behavior. So closer neighbors, while larger perplexity values make a greater focus on more global behavior. And so in their paper, Vader Maten and Hinton suggest that values from 5 to 50 tend to work well. But the perplexity value you choose can impact the results of T S uh uh greatly.
And we'll touch on this more thoroughly in a corresponding practice problems notebook. So if you go to the practice problems folder, there will be a problem in the T S notebook uh that goes into the effect that perplexity has um on the output of the algorithm in more detail. So we're gonna demonstrate how T is implemented in S K learn by performing it on something called the M N I S T data set.
So there's a link about this here. Uh The M N I S T data set is a, a set of pixelated hand drawn images. So these are numbers 0123456789. They were hand drawn by people, they were then scanned and then pixelated in the S K learn version that we're going to look at. These are eight by eight grids uh And they're gray scale images. So for instance, here's an eight by eight gray scale image of a of A zero.
Uh If we chose a different one, maybe like 89 this is I believe a three. Um And so that's what this data set is looking like. And so what we're going to do is we're gonna essentially compare this to PC A to give you a reference point for what T S does. But the idea is this is uh a somewhat high dimensional data set and it's at least high in the sense that we can't visualize uh a data set that has 64 features.
Um So we're going to run this through T S and what T S should do well is what we're hoping it will do well is that it groups, the numbers that are similar looking to one another. So for instance, if all the zeroes look roughly the same, we should expect to see a, a cluster of zeros in two D. Uh For comparison purposes, we'll run this through PC A first and show you what you get there.
So this image here is what you get with PC A. Uh And each of the different uh markers and colors represent a different uh digit. So for instance, zero are these blue circles up here. Uh And so for the most part, a lot of the points do tend to be close to one another, but there's a ton of overlap, meaning that for instance, uh at one, it's visually hard to see.
There's this big blur of numbers that are maybe similar looking to one another in the middle. Uh But also, let's say we wanted to use this for um some kind of prediction algorithm uh to help the postal service in reading envelopes say um this wouldn't be great. Granted, we would probably use more than two dimensions in that setting. But for visualization purposes, it is difficult to see the unique types of digits using PC A.
Now we can do this in T S using the T S object from S K learn, which is stored in the manifold uh sub package. So here we import it from S K learn dot manifold import T S ... uh And we can make a T S argument as follows, we call T S with a capital T. Uh The first thing we put in is the perplexity which we're gonna set equal to uh let's say five. And then we have to input an argument called in it equals random.
... And I'll explain what this is after I type so, and it stands for initialization. And so for the algorithm to work, you have to have a random initialization as a guessing point for the gradient descent. Uh What this does is it sets that point to just be a randomly generated uh point. Uh And then it runs the gradient descent from there. Uh We need learning rates equals to auto.
And so uh gradient descent has something called the learning rates. By setting learning rate equals to auto, you allow the algorithm to automatically choose a learning rate as opposed to you just feeding it. Um So now we've set up a T object and then we can use T S in much the same way as PC A or a standard scale or you call fit transform. And so this will fit the T S object which will you know, determine the PS and the QS and then transform will spit out the projected data.
... Yeah. Ah OK. Sorry. This is not the perplexity that it gets set. It's the uh number of components. Sorry. Two, I don't know why I said uh perplexity. OK. So now we go through and we can see here that the grouping of the digits is much better than in comparison to um uh PC A. So we have our distinct zero cloud, our distinct uh these are purple stars, our distinct four cloud, uh our distinct this is a green star or green triangle.
So twos, now we might also notice that there is still some intermingling. So there's some blue stars and nine, there's some nines with the threes, some nines, with the eight, some nines with the fives. And so if you were and even their own group of nines over here, that's far away from the bigger cluster of nines. Uh So there might be reasons for this, right?
People don't always draw nines the same way. And so maybe these nines are closer to the common uh five or the common eight than they are to what seems to be the common pattern for nine. So you could go through and look for these observations that seem to be rogue digits and see what's going on with them and see that. Well, these actually do look closer to a five than they do to your standard nine.
... Uh or it's possible that the data set is mislabeled and these actually are fives are mislabeled as NS. So uh we're going to now that we've seen how cool T S is, I think T S is one of the algorithms that gives you just really cool images to look at. Um we're gonna see some of these disadvantages though about why you might want to do something other than T S or why you want to be careful interpreting these cool images.
Uh So here are some disadvantages of T S. So the S stands for stochastic. So stochastic means random essentially. So that means your results will not always be the same. So for instance, in this loop, I go through five times and I run the T S algorithm on that data set. And what I'm gonna show you is that you get different results each time. Now you're gonna get roughly sort of the same clusters, but the positions aren't gonna be the same in the space.
Uh So, for instance, here, uh whatever the green one was, I believe it was two, uh the green twos are in a sort of different location every time. OK. Uh So that means essentially one, you're not gonna, you can maybe set a random state to ensure that each time you run it, you get the same thing. Um But you can't really make, you can't use this to make predictions on new data.
Uh Because unlike PC A, there's not a procedure to directly get results from the T S C that will map new points into the lower dimensional space. So for instance, if someone were then to give me uh a newly hand drawn five that has been pixelated in the same way, I couldn't then run that through the t that we just got and get a five in the, a new date, the lower dimensional space that makes sense.
Um There is a paper from Vander Matt that might be useful for seeing if there is a way to do that uh that you might be interested in looking at uh to see. Um But as it stands in the way it's being presented here, the original T S was not set up to do such a thing. Uh Also, you shouldn't make any inferences about the distance between clusters. So for instance, this group of uh what did we say?
This was twos, this group of twos being so far away from this group of I believe fives, no of fours. This group of twos being this far away from this group of fours doesn't necessarily mean that the twos are more different than the fours. It's just the way it worked out in the algorithm. Um uh as a as a warning T S E results shouldn't be used as statistical evidence or proof of anything.
These are usually just nice visualizations that could be the result of some sort of weird randomness that from the algorithm. Uh So again, uh why? Well, sometimes T S can produce clusters on data that are not actually clustered in the original data space. And so how does that happen? Well, usually it's uh the result of having a weird perplexity value.
So one way to sort of try and be robust against this is to run your data through T S a few different times, trying different perplexities along the way to see. Well, what clusters tend to persist versus what clusters are, maybe only there for a handful, a small handful of perplexity values and are not there for most others. Uh The researcher, one of the two researchers that have their names on the original T S paper also has a very useful, frequently asked questions post on his
website, which I've linked to here, which may be useful to you if you're interested in learning more about T S. So in this notebook, we learned about T S and showed how good it can be for making fun cluster images and data visualization. We also talked about some of the limitations of T S E as an algorithm um for other things. Uh So I hope you enjoy learning about T C and I do think it is a very cool algorithm for visualizations and then can sometimes give you a sense of maybe clusterings
that happen in the data set uh that you're looking at uh which may feed into another unsupervised learning task called clustering, which we learn about in other videos and lecture notebooks in this content. Uh So I hope you enjoyed learning about this. I enjoyed teaching you about it and I hope to see you in the next video. Bye.