Hierarchical Clustering Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're going to learn another clustering Algo algorithm. Uh Assuming you watched the K means first, uh this algorithm is known as hierarchical clustering for reasons that will be made clear soon. So we're going to introduce the idea about hierarchical hierarchical clustering and learn about dendron grams and cut points. And then we'll show you how to implement it in sie which is a different Python package. But if you have numb pie installed, you should have sci pi installed. Uh And we'll learn how to plot a dentro gram in Python. ... So hierarchical clustering is another technique for grouping N observations of M features that are stored in a matrix X. Uh So a nice feature of hierarchical clustering is that in comparison to K means you don't need to have an idea of the number of clusters ahead of time uh prior to fitting the algorithm. OK. So instead of um instead we can examine something known as the dendron gram and then use that to make an informed choice after running the algorithm. And so we're gonna, you know, show you how the algorithm works with a, an illustrated example. And then hopefully that makes it crystal clear of what's going on. So let's assume that we have a data set, a very simple data set with five observations like so uh we have two features, maybe an X one and an X two that have been plotted here. So at the beginning of hierarchical clustering, you assume that every single point is a part of its own cluster. And then what you're essentially doing is you're going to look at the distance between any pair of clusters. Where uh for si for simplicity of this argument, I'm going to take the distance to mean the minimum distance, which is standard the minimum distance between any pair of points within the two clusters. So the distance between cluster A and cluster B, you look at all the distances between the points and cluster A and the points and cluster B and you find the minimum. Uh So when you uh start with D equals zero, you're saying that you're going to join any cluster uh that where the, the distance between the two is zero. And so with that, you end up with just all the points, you slowly start to increase this value of D where if the clusters are D units away from one another, you would join them together into a new cluster. So for instance, maybe after a while I increase this distance, cut off and now B and C are close enough together to where, well, the distance between them is smaller than this cut off. So I'm going to cluster them uh same for this D N E. The distance between D N E is small enough that I'm now going to cluster them together. So now with uh we started with five clusters where each point was its own cluster. And now we have three clusters B and CD N E and A on its own. So then you start to increase this distance threshold, cut off again. And eventually you'll get to a point where the distance between the clusters containing B and C and the cluster containing D N E is smaller than that cut-off at which point you would merge them together. OK. So now I have two clusters, B CD and E are clustered together and I've retained the original clustering shape just to show that these are the two that we're joining together and then A is still off on its own. Eventually that distance threshold that I keep increasing A will get close enough to one of the points in this larger cluster uh until we have to decide. All right. Well, now I'm just gonna make a larger cluster like this. OK. And so you start off with a distance cut off of zero, which corresponds to each point being its own cluster and work your way up until you get to the point where all points are in a single cluster. And so this clustering information is then stored in what is known as a dentro gram. And so if we go back through, remember everything was on its own BC and D E joined B CD E were joined and then a joined in. And so how do we visualize this as a Dendron gram? Well, each point is gonna have its own unique line in the dendron gram. This can be a vertical line or a horizontal line and then you follow along. And once you get to uh on this vertical axis, in this example, if it was horizontal, it would be on the horizontal axis. This corresponds to the distance at which you uh were to combine the clusters together. So once we get to the distance where B and C and D N E were joined together, you then merge the two points, uh the vertical lines of the two points of the horizontal line, uh horizontal and vertical switched if it's on a vertical axis and then now B and C, if you were to make the cut off here, you'd have the clustering of ABC and D E. Then as we kept increasing, right, these two clusters joined together to be B CD E. So if we were to make the cut off here, we'd have the clustering of A and then B CD E and then eventually A joined the group into the much larger cluster resulting in this single vertical line. And so that's how a dendron gram can be read and that's how it encodes all the information that we had with these four distinct plots. So as we saw, you can uh you can have a variety of clusterings that you don't need to have a uh an idea of ahead of time. Uh And so you can use something like inertia or a silhouette score once again to get a figure uh to get an idea of maybe what's the best cut point because the clusterings are determined by a cut point in the dendron gram. So let's say, I thought that the best clustering was this one uh here. Maybe I thought this was the best clustering. Then I would make a cut point that I'm tracing out with my mouse here. And at this cut point A would be its own cluster B and C would be a cluster and D and E would be a cluster. So we can use SCI pi to perform this sort of procedure in S K learn. So here's my data. I believe this is a similar setup to what we had with the K means notebook, uh maybe different numbers because it's randomly generated. Uh sci pi keeps the hierarchical clustering functions and objects in the cluster dot hierarchy module. So we're going to import two things. First, we're gonna import Dero gram which I will uh use second and then we're gonna import linkage which we will use first. Uh So from si pi dot cluster dot high er AY, we're gonna import linkage and dendron gram. Now, the first thing we have to do is we have to create what is known as a linkage matrix, a link with a linkage matrix stores all of this data that is represented by these images. Uh And so we'll see how, so the way you do linkage is you call linkage and then you put in the data which is stored as capital X. And then uh we're gonna store this actually in something called Z. This is just standard notation from sci pi uh then you're gonna put in what method it's going to use. And so for us, we're gonna call the method equals to single. And so what does this do? Well, there's a couple of different methods at which you can choose when you're gonna merge the points. And these methods are just the way that you compute the distance between two clusters. So single uses the distance that I mentioned above where you're gonna take the minimum distance between any pair of points in the two clusters. OK. And so then once I run this, it's going to return what's known as a linkage matrix which I've stored in capital Z. So here is a demonstration of the linkage matrix here where it gives you uh maybe the Centroid of the two clusters that or no, this is just the labeling of the cluster. So remember we start off every points, a cluster. And so here we can see the first emerging of points is between points 11 and points 13 that merging occurred at a cut-off distance of 130.242 oh 44, creating this new cluster of size two. And then we have something similar between points zero and one points four and six points two and four and so forth. And then we get to a point where uh you stop labeling points and start labeling clusters. So for instance, here, when whatever, either cluster uh eight or 80.8 and cluster two or 20.21 sorry cluster 21 or 210.21 whichever one we're in. Uh so we'll merge and now we have a cluster of size three, which is how we know that one of these two was a cluster while the um while the other must have been a point. Uh And so what this linkage matrix gives you is it gives you the two clusters that are merging the distance at which they merged under the single distance metric. Uh And then the new, the size of the new cluster that was formed. And so this linkage matrix um and then here's a breakdown. So we chose single, but here are some different uh ways at which um cut points are chosen and things are merged. Um Typically I've used single in the times that I've used it. But these are also useful uh this linkage matrix can then be fed in to make a dendron gram. So you call dendron gram and then we put in Z. So Z was what we stored the linkage matrix in. And here's the resulting dendron gram for these points. Uh They will arrange them in a way so that the clusters uh like the merging makes sense. So it's not a range zero all the way up to I think how many points are there in 19? It's not arranged from 190.0 to 0.19. It's arranged so that the clusterings uh are visually making sense. And now you might think to yourself. Well, hey, it even tells me what the best clustering is. That's great. These colors are entirely coincidental. So the fact that we have orange over here, green in the middle and red over here for these three clusters, which seem to be the main clusters uh in the data set is completely coincidental. Uh It just chooses a threshold and that's an input that you can put in the dendron gram. As you can see, there's an input called color threshold which gives you colors, the den, the clusters found in the dendron gram, given the cut point. And the default for this value is 0.0.7 times the maximum distance at which uh the clusters were formed. So it would take here the maximum distance is 3.832524. It takes 0.7 times that and then draws the cut off. So somewhere around here, so we can sort of demonstrate that here. Uh So we will call dendron gram color threshold and we'll just put in the default for, oh, actually, no, let's leave this as it is because the default is already uh this. So this is uh the cut point at which the default is. So this is 0.0.7 times the maximum distance, which is up here, uh 0.7 times the maximum distance at which a merge happens. But you could go through and set your own. Uh So we could set our own color threshold. ... And then what if uh we said it was 0.4 instead. Uh ... So now this is the cut point and you can see here now we have a, a much different coloring. So the coloring that you get by default is just some random value that they chose 0.0.7 times the maximum distance. It doesn't mean that that's the best clustering. Uh So once you do have a cut point you like, so what if we, we will use the threshold that the default 1.7 times the maximum distance, if you do have a threshold that you like, you can use F cluster to get the clusterings uh without having to write it down. So you don't have to write it down by hand. They will give you like 123 stuff like that. So we can do this with the function F cluster, which is also stored in cluster dot hierarchy. So from sci pi dot cluster dot high or Arkie, we import F cluster. And then you can get the clusters by doing F cluster. First, you put in the linkage matrix, then you put in the cut-off point which we're gonna make 0.0.7 times the max because that's just what the default was. I thought it looked nice. And then you put in an argument called criterion equals to distance. And so this is using the same criterion that the dendron gram plot was using. There are other types of criterion that you could use. Um I encourage you to look at the documentation if you're interested in seeing what those were. OK. So here's the clustering and then I will go ahead and store that in a variable called a Python variable called clusters. OK? And so now we can draw the points with the corresponding clustering. So cluster one up here, two over here and three over here. So similar to K means you could do something like an inertia plot or a silhouette plot or a silhouette score. Uh Whatever you would like to see which one is the best clustering. Sometimes you'll just use uh you'll use some a subjective measure by looking at the Genro gram and saying this seems reasonable. It depends upon the sayings uh on the setting. So if you're, for instance, using this as a research project, you're probably, and you want to publish that research project, you probably want something that's defensible other than and can be reproduced. So using your subjective judgment may not be best and you might want to fall back on something like a silhouette score in industry. Maybe you can get by with subjective subject and with a personal project or subjective judgment and with a personal project, you more than likely can also just use your subjective judgment. Um But you can use things like the silhouette score and the inertia plots as well. OK. So that's it for hierarchical clustering. That's the second and final clustering algorithm that we will learn about uh in these materials. I hope you enjoyed learning about hierarchical clustering and hopefully you have a project where you can use it. Uh So I enjoy teaching you. I hope you have a great rest of your day and I will see you next time. Bye.