k Means Clustering Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi everybody in this video, we're going to learn a clustering algorithm known as K means clustering. Uh And now, while we still have the little K here, it's actually not at all related to K nearest neighbors. Um This method doesn't have anything to do with K nearest neighbors other than they both have the letter K in them. So in this notebook specifically, we'll lay out the idea of what A K means clustering algorithm works and how it uh how it does. Um There we go. Uh And then we will demonstrate it on some synthetic data just because it's easiest to do that and give you a sense of how the algorithm works. And then we will discuss different ways that you can choose the algorithm's K parameter. Um That sort of determines the number of clusters that are detected. So uh let's suppose that we have N observations of M features and we're stored in a matrix X so N rows by M columns. Uh Maybe we suspect that this matrix, uh this set of data points can be segmented into groups. And maybe we think that it's actually K groups. So the K means algorithm gives us a way to attempt and to attempt and find the groups that are in our data. Uh with K means clustering you first select K initial, well, essentially what the K stands for is the number of clusters that you're gonna end up with. And then, then in this four step process here, that actually is more than four steps. But this is how we can explain it. We're gonna describe how you identify those K clusters um in a, in an algorithmic way. So uh because it's K means clustering, you have chosen the number K ahead of time, maybe it's five, maybe it's 17, whatever K you think will work best and we'll talk about how to choose that in a little bit. Uh You select sometimes randomly or you can do it by hand, you select K initial points at a first guess for the K group's Centroid. So the Centroid is just the average value of all the points in the cluster. So as the step one is, or step zero, even you just randomly choose or you can pre specify points uh randomly choose K initial points. These are going to serve as the K Centroid. Um And again, these are the average position of all the points within a group. Now, this first group, uh this first position isn't actually the average position of whatever cluster we're gonna end up with because we just need to start somewhere. So this is our first initial guess, then you're going to group all of the points, all of your end points according to which Centroid is closest. So all of the points will then uh you'll see the distance from each point to each of the K Centroid and then you will see which one is smallest and then assign the initial clustering accordingly. So for instance, if X one is closest to Centroid two, which you randomly chose at the beginning, then X one would be assigned to cluster two. If X three is closest to Centroid four, then X three would be assigned to cluster four, you then go through whatever clusterings you end up with. In step two, you can then use this to recalculate the K Centroid. So essentially you're gonna have K clusters at the end of step two, you then compute the average value of the points. So what's the average of the first feature? What's the average value of the second feature? And so on, uh you can calculate the average feature of those points and those are the new Centroid. Uh So now you have new Centroid. So you can essentially just repeat step two, regroup all the points and then you repeat step three. And you're gonna as for your step four, you repeat steps two and three over and over and over again until you reach a point at which no observations change groups. So eventually you'll settle down uh to a point where you're no longer going to change your groups by changing the Centroid. Uh And at that point, you stop, I believe you can also put in a stopping criterion of just stop after however many steps. Um But in practice, you just run it until it converges into a, a stable clustering. So we're gonna do this by hand first just to demonstrate it. And then eventually, when you actually implement it, in practice, you'll use S K learn. So by hand, I'm gonna have this image where this is my data set. So I've got all these points down here. And if you look at how the data was generated, we can see that we should in essence have three clusters. And so we're going to choose for our K, we're gonna choose three because I knew ahead of time that three was the correct number. Um We'll see again, we'll see later on in the notebook how you can go about choosing K when you, you don't know it ahead of time. ... So the first step, right was we do a random choice. So here, I'm just randomly assigning three points to be my Centroid. I then go through and I calculate the distances from each point to each Centroid. So there's the zero entry of this column will be the distance of observation zero to the zeroth Centroid. Uh This will be the distance from all these points to the one Centroid. And then this will be the distance from all these points to the two Centroid. Then I go ahead and I assign the clusters accordingly. So whichever one you're closest to, that's your cluster. And then once I have the new clusters, I make a new Centroid guess by taking the average value across all the clusters. ... So here we have all these points. Remember this is what the points look like. And now we have the initial guess which is the red X, these were the current Centroid guesses. So these three points and then all of the points in our data set are colored, either blue circles or green triangles or black uh crosses black plus symbols uh according to which one they're closest to. So if you're a blue circle, you are closest to cluster zero, which I believe is this one, if you were a green star, you are closest to cluster one or the Centroid one, which I believe is this one. And if you are a black cross, you are closest to Centroid two initial Centroid two, which is this one. OK. So we can see here sort of where we're coming from. ... And then once we had these clusterings, the orange stars represent what the next Centroid guess would be because these are the average values of the points within the initial clustering. So for instance, this star here is the average of all of the black pluses. This star is the average of all of the blue circles and this star is the average of all of the green triangles. OK. And so then if we go through and do another guess we have to go through it for all the points, calculate the distances to the orange stars and then reassign our clusterings accordingly. ... All right. So the orange stars of the previous picture now become the red xes of the second round. So the second time through this was remember uh Centroid one, this was Centroid zero, this was Centroid two. And you can see here that Centroid two and Centroid uh the second guess of Centroid two and the first guess of Centroid two uh beyond the initial uh doesn't change at all. So cluster two is set and we can see here the change was uh from cluster zero and cluster one. Remember cluster zero used to have some points over here, right? But now they've been uh reassigned as uh cluster one because they were closer to this red X OK. And then we would keep doing this process until we reach a point uh at which no more points would be reassigned if we keep kept going. Uh I believe that we have now reached that point. I don't see um any points, you know, changing allegiances to different clusters now. So I believe if we did this one more time, that would be our last time through. Uh And we would end up with these clusters that we've seen that we see below. ... So that's how you do it by hand, that's uh annoying to program. So you can just use S K learns K means object. So from S K learn uh dot cluster, you import capital K, capital M K means and just like we've made PC A objects. If you did dimension reduction, uh uh model objects and regression or classification or ensemble learning, you do the same thing here. So you make a K means objects and you just have to choose the value of K which I'm gonna choose as three, then we fit the model. And so you call dot fit uh and you put in the data X and then here uh this is closer to uh classification or regression instead of like a transform. If you've watched the PC A or T E, you call K means not predict. And that gives you the clusters and I'm gonna then store these clusters in a, in a variable. OK. So the labeling has changed. And so that's something with K means the labels might stay like the actual label names. So 01 and two are now different than what we had above. Uh But they still give the same clusters. Um It's just called something different. Uh And then we can also see that you can get the Centroid of the cluster here if you use the art or the uh attribute dot cluster underscore centers underscore and so we can actually demonstrate that as its own thing. So if I call Kines dot after it's been fit, if I call Kines dot cluster underscore centers underscore that gives us the Centroid um the points for each Centroid. So this is the zeroth Centroid, the one Centroid and the two Centroid. ... So how uh as I've been saying a couple of times at the beginning, how do you choose K uh how did you know it was three? Why I chose three, right? Because I knew it from the problem set up, it was synthetic data. But in practice, how do you tell what K is? So there's two different ways uh which is essentially, you're gonna look at two different types of metrics. Uh One, you can either look at one or both of them. Um But they both are standard use for choosing the number of clusters. So I'm gonna load this data set. X where I uh I know what the number of clusters is supposed to be because I generated it. But you didn't see the generation process. So you don't know, should this be uh three clusters or maybe five clusters or maybe four clusters? We don't know. Um You know, I do but you don't. Uh So we're gonna show you now two methods of trying to choose the value of K um which one you use or if you use both of them is up to you. So the first method is similar to PC A, it's called the elbow method. And what is the elbow method? Well, you do something. Uh You're gonna have a metric which turns out to be called inertia. Uh You're gonna have this metric inertia or at below will have a metric I believe called silhouette score. Um You do this metric inertia and you plot it as a number of the clusters. So you're gonna compute the outcome of different clusterings. Uh K one K equals two K equals three and so on. And then plot the inertia as a function of the number of clusters you use. And so, and then essentially you look for the elbow in the plot just like with PC A where you're no longer where you're starting to get diminishing returns from increasing the number of clusters. So for a given clustering with K clusters, the inertia, the metric we're gonna look at is defined to be the following the sum from I equals one to end. So the sum over all of your observations of the distance between the observation I and the Centroid of the cluster to which I is assigned and then you square that distance. So you're gonna have uh in this case K clusters and scent and observation I has to belong to one of them C superscripts uh I denotes the Centroid of the cluster which observation I is assigned. OK. And uh this distance here could be any distance metric you like I believe it's standard to use Euclidean distance, but it could be any other kind of distant uh distance that you would like. OK. Um Do, do, do so in essence, you want an inertia that's low, you want a low value for the inertia, right? Because that means all of the points are relatively close to their Centroid, but there's a caveat. You don't want to just choose the value of K that minimizes the inertia because you could get a value of K or you could get an inertia that's equal to zero if you just set K equals to N. So each, each point is its own cluster. And that's like not, that's not useful. So there's a tradeoff between a low inertia and a value of K that's reasonable as a clustering. Um So what you do is you will go through and um get the inertia for different values of K. And as I said, you'll look for an elbow in this plot. So we're gonna go through now and uh we're gonna go through 11 1 through 11 values for K. We're gonna apply K means to this data set that I'm highlighting with my mouse now and then we're going to record the inertia score which you can get with K means after it's been fit ... after you fit the K means you can get the inertia by doing K means dot inertia underscore. And so now I'm gonna plot the inertia on the vertical axis against the K on the horizontal. And again, remember we're looking for that elbow. And so from this plot, it looks like you're starting to get diminishing returns and decreasing inertia. Uh At either maybe four or 534 or five are candidates here, I would say uh four, I might hone in on the four of the five because there's a huge drop off between the distance from 3 to 4 and 4 to 5 as well as from 5 to 6. So it's probably gonna be either four or five. Um We can also look at the silhouette method. This is gonna be a second method. So the inertia method gives you a choice between four or five and you could maybe go ahead and plot the clusterings from both and see which one you might like better. Um That's not an algorithmic approach and it's more subjective. Um But this is unsupervised learnings and sometimes it's a little bit of a, it's a little more subjective than say, uh supervised learning can be. The second method is to look at a silhouette score. Uh So what you do here is you can make either plots or you can look at sort of an inertia style plot with silhouette score as well. So silhouette score for a given observation is defined to be B minus A divided by the maximum of A or B. And so what are, what is B, what is A, A is the average distance of observation I to all the other points and it's assigned cluster. So you look at all the points within the cluster that observation I goes to and you calculate the distance between those. Uh and then you take the average and then B is the average distance of observation I to all the observations and the next closest cluster. So, and for B, you take the um distance between I and all other non cluster points. So I will have a given number of points within its cluster. And then you look at all other points that are not in the cluster. OK. And you B is the average of non cluster distances A is the average of within cluster differences and you look at B minus A divided by the maximum of those two. So for a given clustering, you then take uh for average silhouette score, you average it over all an observation. So each silhouette score is a metric for observation. I, if you want the flat average silhouette score for a given cluster, you could just take the average of each individual observation silhouette score. So silhouette score is a metric uh from S K learn. So you can do uh from S K learn dot metrics import silhouette score. So this will be for individual points and silhouette samples. Sorry silhouette score is for the actual clustering itself. So you'll get a single number and silhouette samples uh is gonna give you these individual uh metrics for each observation. OK. So here's a sample clustering. Uh So for the cluster wide score, you put in silhouette score, ... you put in the uh data first and then the clusters you get second. ... Oh Did I spell it wrong? Uh Oh I just forgot to import it. Oh Did I also I probably spelled it wrong. So let's just copy. ... OK. Then what is it stored in? ... Cannot name? Hm. ... Ok. So I paused the recording and I found out I was just spelling both uh silhouette incorrectly in both the silhouette score and the samples. So now I've changed my spelling and it works. Uh OK. So here is the single silhouette score for this clustering here where I fit four clusters. Uh And then I can also show the silhouette sample so I'll just copy and paste to save myself spelling embarrassment. Uh And again, you put in the data first and then the clusters second, um ... silhouette stamp poles. ... And so here's what you get you have for each observation, they have their own in a unique silhouette score. And then I remember if I took the mean here, I should uh again, I don't know for sure if that's what it's doing. Yeah. OK. So I get uh the average value of the score if I just take the mean of the samples. OK. So once again, uh for the actual score, you could do another elbow plot with silhouette score. ... And uh and you can kind of see here while the silhouette score goes up back down for four. And so silhouettes where you want to be high, why do you want it to be high? Well, uh if it's large, that means the distance uh from outside of your um the distance from each observation is larger to those outside the cluster than within the cluster. If it's small, that means the distance within the cluster is getting closer to the distance from outside the cluster. Uh And so here with the silhouette score, this maybe confirms for us that remember for inert show, we thought it should be maybe four or five, maybe three. Uh This silhouette score maybe confirms for us that it should be five because that was the one with largest silhouette score. Um You can also uh look at what's known as a silhouette diagram. So this plots the distribution of silhouette scores for each cluster along with the sample average score. So on the, we're gonna have uh dude let's go smaller. So for clusterings 345 and six, each of these plots has the silhouette score on the horizontal axis and the cluster label on the vertical. And then this red dotted line, this vertical red dotted line gives the average score for the clustering. And so the way to interpret these plots is here, you have for instance, the silhouette score for cluster three and the K equals three. And so what you're looking for is you want a vast majority of the points to have a silhouette score uh beyond the um the average. And you don't want to see something where you have a single cluster where a majority of their scores are below the average. So this helps rule out three. You can see here that in cluster four, we have one where it is largely uh all of it is below the average, some of which is even negative uh cluster five then seems to be the better one because most of the clusters cross over that silhouette score average. Um while in, you know, K equal six, we have another cluster where it's entirely underneath the average. OK. ... So from this, we can determine that let me move my guy my little screen back over where I like it uh from, you know, both the above plots. Uh So the silhouette plots, uh the silhouette against clustering plots and the inertia plot, uh we can see that K equals five is probably the one we want to go to because it was a potential elbow from the inertial plot inertia plot. Uh It had the highest average silhouette score across the cluster. And also it didn't look like it had any bad clusters where again, a bad cluster is one where almost all of it is below the average. OK. So for this reason, in this particular data set, uh that we have up here, we would choose five clusters as the value for K. OK. So in summary, we learned about K means in this notebook, we, it's nice because it's pretty fast and it's scalable. Uh that's easy to implement. Thanks to S K learn some of the limitations of this though, are this random initialization of Centroid can lead to a suboptimal solution. So if you have a bad initial assignment of Centroid, you're gonna end up with clusters that aren't the best clusters, you could end up with. Um choosing K is kind of tricky because we are doing it slightly by I. Um uh So, for instance, if we had a data set with a very large number of observations that silhouette method we're talking about can take a long time because you're computing a ton of distances. And on the contrast, if you only look at the inertia, which we did at first, uh you might be choosing between a couple different KS and not sure which one to choose. Uh We also have the, we didn't talk about this, but Kines clustering can be ill behaved if you have clusters that are of varying sizes. So if your data set has a very small cluster and a very large cluster, Kines might not work very well. Um clusters have different densities. So for instance, uh maybe one is some very widely sparse, uh sparsely populated and one is very dense. Uh And what if you uh maybe they have ... a potato versus like the nice kind of circles we had above. Um And you also need to worry about scales. So make sure you scale your data before fitting canines. OK. All right. So that's it for this clustering algorithm. Kines clustering. Uh I hope you enjoyed learning about this. These are fun techniques and we get do get to make fun plots like the one below. Um I hope you enjoyed seeing these and enjoyed learning about Kines clustering. I enjoyed telling you about Kines clustering. Uh I hope to see you in the next video. Have a great rest of your day. Bye.