Charts Involving Rectangles Video Lecture Transcript This transcript was automatically generated by Zoom, so there may be discrepancies between the video and the text. 16:47:14 Hi! Everybody! Welcome back in this video, we continue to learn about Matt. 16:47:18 Plot lib. Where we are in particularly, we're gonna learn how to make various charts that use rectangles to convey information. 16:47:27 So let's go ahead and share those Jupiter notebooks and get started 16:47:34 Did you 16:47:37 So, we're in this notebook. We're really just gonna go through a couple different or a few different plot types. 16:47:43 I'm gonna tell you about the plot type in abstract. 16:47:48 And then I'm going to show you how to make it in Matt Plotlin. 16:47:51 So the first one we're gonna talk about are histograms. 16:47:55 So histograms are await to visualize the distribution of a data where the distribution is the different types of values that a variable can take on along with the probability that it takes on those values so in real life, unless we have some knowledge about the actual random variable we can't 16:48:14 really plot that but for observations we're plotting what's known as the empirical Distribution, which is just a way to plot the values that we observe so we go out and we record the height of a 100 people. 16:48:27 We can make a histogram of the imperial distribution of those heights. 16:48:32 Say so we're gonna use speaking about the height example, we're gonna use that Galton Dot, Csv data file we used in a previous Jupiter notebook for our examples for the histogram. 16:48:45 So histograms take the range of a variable, they break it up into smaller intervals, and then they count up the number of options within each of those intervals. 16:48:56 Then they draw rectangles where the base of a rectangle is those smaller intervals, and then the height of the rectangle, or the number of observations with that fall within there. 16:49:06 So here's an example. Histogram where, let's say I have this data set given by this list. 16:49:13 A histogram for this with 4 bins, could looks what looks like this. 16:49:16 Okay, so we have 2 observations that fall from between one and 3, where this is being inclusive, one is included in here we have 2 that fall from 3 to 5. 16:49:31 Yep. We have 2. We have 3 that fall from 5 to 7, and we have 4 that fall from 7 to 9. 16:49:39 So I say a histogram, because there is no just single histogram. 16:49:48 The way the histogram is drawn depends upon the number of bins, the width of the bins, as well as the range of the bins. So most of the time what happens is you set a uniform, been width or a certain number of bins that will each have the same 16:50:04 width, and then the histogram is drawn by that other times you can specify the actual ranges of these intervals, and then that's how the histogram is drawn. 16:50:13 There are various formulas. There's really no one correct answer for the number of bins or the width of bins that you're gonna want to use. 16:50:21 There are various formulas that you can find here to get good rules of thumb depending on what you'd like to see, but in general, what I found is, you kind of just have to play around with a different number of bins and see what works best for your data. 16:50:35 Set, so a Matt plot web, you can make a histogram with the hissed function. 16:50:41 The documentation of which is found at this link 16:50:45 We're gonna demonstrate with this Galton. 16:50:47 Dot, Csv data set. So what I'm gonna do is I'm gonna make a distribute a histogram of the height column. 16:50:56 So first I make a figure, then I call hissed, and then all I have to do is place the desired variables that's Galton dot height. 16:51:05 So now I have a histogram. Looks something like this. 16:51:10 Okay, I can control how many bins are drawn with the bins. 16:51:14 Object or arguments. So, for instance, I can draw one with 5 bins or 5 rectangles by setting bins equal to 5. 16:51:24 So now I have one with 5 rectangles here's the rectangle rectangle, rectangle or I can have one with 50 bins, or as close to 50 bins as it came to get by setting bins equals to 50 and you can see sort of the 2 16:51:37 Extremes that goes on. So if you have too few bins, you're you're kind of missing the signal through the noise. 16:51:44 You're just kind of getting a shape that isn't really as helpful to you as you might want it to be. 16:51:49 But if you put too many bins, you're really kind of just getting noise. 16:51:53 So you kind of have to try different values of bins to see like typing that sweet spot between signal and noise. 16:52:00 In addition to giving just a number of bins, you can also provide a list, tuple or array, where you specify the endpoints of the bins. 16:52:10 So for instance, here, maybe i' the bins to cover these arrays that I've highlighted 55 to 65, 65 to 70, 70, 70 to 72.5 72.5 to 75 and then finally 75 to 16:52:25 90. So you can do that by just putting in an array of all the desired endpoints. 16:52:30 So 55 comma. 65 comma. 16:52:34 70 comma 72.5 comma. 75 comma. 16:52:38 90, and now I'll have these custom bins, where I've also included this argument. 16:52:44 Edge color equals black. So this will color the edges of the bins black. 16:52:48 And I did this. You can see the different width of my bins 16:52:53 So there are other customizations. You can do so. Color will change the color of the rectangle, so I could set them to be red if I wanted edge color, as I just mentioned will change the color of the lining on the outside. Alpha. 16:53:08 Will change the opacity, like we've seen before, and then finally, label would provide a histogram, a label for legends so like I could put in label and say, Well, I guess we'll see an example where I do that later. 16:53:21 So I won't bother about doing it now. There's another argument that we haven't seen before, called Hatch. 16:53:28 Hatch takes in a string like these following options, one of these following options, and then it will draw lines or shapes on the bin in order to provide hatching. 16:53:40 So here's an example where I'm gonna set hatch equal to just the string of the backslash. 16:53:47 I think, and so now you can see it's got these. 16:53:49 Each rectangle has these lines drawn on it, so this can be useful when you're gonna draw 2 histograms on top of each other. 16:53:57 So why it helps people distinguish between one Instagram and another. 16:54:03 Remember, we're trying to be mindful of people who maybe experience color, blindness and have a harder time seeing the differences between the color. 16:54:11 So here I'm going to do 2 histograms, one for those individuals that were identified as having of being females, and the other for those individuals in the data that were identified as a male. 16:54:24 So the females were drawn with a blue solid rectangle, with hatching, and then the males were drawn. 16:54:33 Their distribution was drawn with an orange solid background that has a little bit of c 3. 16:54:40 So you can see the rest of the female histogram. 16:54:44 Okay, so that's that's one type. Histograms are very popular chart type. 16:54:47 When you're doing exploratory data now, or trying to present the distribution of 2 variables. 16:54:53 Don't overlap. Say kinda like this. So like here, you can kind of see the height distribution where those individuals identifying it, or identified as female tend to have lower height than those that have been identified as mail. 16:55:10 Another type of chart that uses rectangles to convey information are bar charts, so bar charts are used to compare the value of some variable between various groups. 16:55:22 So here's an example where I've got some variable on the vertical axis, and then it's comparing that variable among these different groups. 16:55:29 Ab, and C. So B has the highest score. See the middle score and A. 16:55:34 The lowest, so Matt, plot lib. You can make this with the bar. 16:55:38 Command. There's actually 2 ways. But the first way is the Bar command. 16:55:44 This is, you use plt, bar or ax stop bar, so this makes vertical bar charts like this one pictured here the vertical, because the height is what conveys the number. 16:55:56 Here is a link to the documentation if you'd like to learn more. 16:56:00 So the bar takes in like, can either a take in labels for the horizontal axis like this group, A group B group C, or followed by heights of the rectangle. 16:56:14 So 1050 and 30, and this is, you know, the chart we see here. 16:56:19 You can also, instead of labels, group A, B and C, you can put in horizontal positions. 16:56:25 Of the center of the bar, so here I've put in 1, 3, and 10, and we can see that I've got a bar centered at one a bar centered at 3 in a bar sentence at 10, so you might be wondering why We'd want to do something like this, instead of just 16:56:40 using the group labels. So sometimes this is useful. If like, let's say, we wanted to break group A and group B and group C with into like 2 subgroups like, for instance, male and female or old and young, something like that. 16:56:56 So that can be useful because you can't put something like you'd have to put something like group a comma. 16:57:06 Old, and then on top of that they'd all be the same color, which, as if you specify the the horizontal position you can make it so that sub group one is a different color from subgroup to say so there are different ways to customize bar charts in that plot 16:57:26 Lib color is the color of the bars alphas! 16:57:29 The opacity edge colors the edge color hatch is the hatch and line width is the width of the lines drawn on the outside of the bar. 16:57:35 And on the hatching. So another thing you can do is we can adjust the width of the bar with the width argument, and we can also, if we have, for example, like this, we have bars that were not given labels but instead. 16:57:51 Given horizontal positions, we can go ahead and provide labels with the tick label argument. 16:57:58 So, for instance, we could do Group A to be and group C, okay. 16:58:07 And now the labels are there, and the excess tick marks have been removed. 16:58:10 So here's sort of that example. I was trying to maybe not as well explain. 16:58:17 So here's an instance where I have the different groups. 16:58:21 But then I wanted to break them down into subgroups of like, maybe female and male, and so we can see that we've got the solid blue and the orange hatch for these different groups. 16:58:34 Now, here's an issue, though, when I do something like this I can't use the label argument, the tick label argument to get this because the tick labels would either be drawn on the blue or the orange. 16:58:45 But ideally we want them drawn in the middle of the blue, and the orange, because this that I'm circling right now. 16:58:50 The small blue, and the small orange. This is one group group A, this is group B, this is group C, so what we have to do is set our own tick labels. 16:59:01 And we can do that by using either plt, dot X ticks for horizontal labels or ax set X ticks and ax set X tick labels. 16:59:12 So here's an example where I use plt, dot ticks. 16:59:16 So the first argument here are where you want the tick marks to be made, and I calculated that this was the middle, the middle of the 2 bars for all 3 of the bars groupings, and then you put in the labels. You want so group. 16:59:31 A group B group, C, okay, so now we have the labels drawn in the middle of the 2 bars for ax dot set. This is slightly different. 16:59:42 So first, you have to call ax dot set X ticks and then put in the positions of the major ticket, and then you have to put in set X tick labels where you then feed in the labels, but if you do it this way, you still you know you get the same exact plot you just have to do it. 16:59:58 In 2 steps with ax as opposed to one step with Plt. 17:00:04 Now you might be saying yourself, why would I ever use ax instead of plt, that was way easier. 17:00:10 Sometimes you're forced to use the axes version of the function because you're doing something like subplots or something like that. 17:00:18 So you can. Also, you know, maybe you'll have data for which the group labels don't render nicely along the horizontal axis. 17:00:26 So here's an example. Where the group labels, because there are so many bars overlap one another, and you can't read them. 17:00:34 Instead of some people might recommend that you rotate the group labels. 17:00:38 So they're sort of vertical. But this goes back to that legend problem, where, if you do that, then your audience is going to have to tilt their head like this to try and read them, and then it's it makes it harder to read so the best Friday. 17:00:50 With a situation like this is instead of flipping the labels alone, you just fly the entire bar plot. 17:00:57 So it's no longer a vertical bar chart, but a horizontal one. 17:01:01 So in Matt plot Lib, you can make a horizontal bar plot with Bar H. 17:01:06 Same exact arguments as bar, but now, instead of having vertical bars, they'll have horizontal ones. 17:01:13 Okay. So here's the horizontal version of this previous plot. 17:01:20 And then I believe this is the phone one we learn. Yeah. 17:01:24 So the final rectangular rectangle based chart type. 17:01:29 We're going to learn our boxing whisker plot. 17:01:31 So Boston box and whisker plots are slightly different. 17:01:35 Rectangle, base plot. So what these do is, it's a sort of a way to represent the 5 number summary or the inter quartile range which we'll talk about that a little bit more in a second of the distribution of the variable and so the rectangle part of 17:01:52 It represents the bottom of the rectangle or the left. 17:01:54 If you're drawing a horizontal one. 17:01:56 But the bottom of the rectangle represents the 20 fifth percentile. 17:02:00 The top of the rectangle, or the rightmost side, represents the 70 fifth percentile, the middle. 17:02:07 There's always a line drawn in the middle which represents the median of the distribution. 17:02:12 And then in this example that I've drawn, you have the whiskers which are vertical lines that go up to a horizontal line, and these horizontal lines then represent either the maximum of the observed values and the minimum of the observed values this is the way 17:02:28 I have seen them drawn. I like to draw them this way, but the software you use, and like what your conventions are, change with these whiskers represent so in Matt Plot Lib, for instance, they're actually gonna be something called it's. 17:02:44 I believe 2 times, or 1.7 5 times the inter quartile range. 17:02:48 So what's the intercourseile range? It's the length between the 20 fifth percentile and the 70 fifth percentile, which is the height or or width of your rectangle depending on whether or not it's a vertical or 17:03:06 Horizontal box and whisker plot so, and Matt plot lib. 17:03:10 You can make a box and whisker plot with the Box plot command, whose documentation is found here. 17:03:18 So let's make some random data. So here's X, just a random draw from a random normal. 17:03:24 And what you do is you call plt figure that just sets the fig size. 17:03:29 And then I'm gonna call a box plot, and then you just put the data in. 17:03:33 And once you do that, boom, you've got yourself a box and whisker plot. 17:03:38 Now again, remember in this, example is Max to men, but this is 1.5 sorry 1.5 times the under quartile range. 17:03:47 So you go from the minimum, you do 1.5 times the length of the rectangle, and that gets you your lower. 17:03:55 Go to your 70 fifth quartile, the top of the rectangle, and then you do 1.5 times the length of the rectangle or the height of the rectangle, and then go up, and then that's your other whisker now if you have points in the distribution like we 17:04:09 Do here that don't fall within 1.5. The intercourseile range they're just drawn as individual points, and maybe sometimes people think of these as outliers. 17:04:19 Maybe that's not always the best practice. But sometimes people do 17:04:25 So it doesn't always have to be 1.5. 17:04:29 So you could control the multiplier on the intercourseile range with the whites for whist or whisker plot. 17:04:37 That argument. So, for instance, I could send it to be point 5 times the intercourseile range by it setting wh. 17:04:44 Is equal to point 5 and now you can see my bars are a little bit lower, so maybe what what's common is you'll have a bunch of different distributions that you want to compare. 17:04:57 Maybe a distribution for grant A for group B for group C, and in Matt plot lib. 17:05:02 If you want to draw those 3 different box plots, you just set an array of values. 17:05:06 So here I make a different, random, variable X 2 different draws from a random variable, and I stored in x 2. 17:05:15 And now, instead of doing x one x comma x. 2, I put in a list, and then I put X as the 0 entry, and x 2 as the one entry, and when I draw that X's box plot is drawn on the left X two's box plot is drawn on the right 17:05:35 So before wrapping up this notebook, I want to take 1 s for an aside on the box and whisker plot. 17:05:42 So box and whisker plots have come under some scrutiny since their introduction. 17:05:46 The problem is, they've been really widely adopted as one of the most common distribution blots an issue with them, though, is they're breaking down the entire distribution into a 5 digit summary, which is the you know, the men like the numbers. 17:06:02 We talked about in the example. So this can sort of distort our understanding of what the true nature of the variables. 17:06:09 Distribution is so like, is it? You know, front loaded? 17:06:13 Is it backload that kind of stuff? And I would encourage you. 17:06:15 There's this nice article from the Data Visualization Society 17:06:20 Sort of going over. Why, they've stopped using box plots right? 17:06:26 And so here's sort of a nice example. Where? Here's this distribution, where where the box plot makes you think that the meat of the distribution is actually here. 17:06:35 But maybe what you're actually seeing is, there's a really large lower thing, and then some space, and then a bunch of a bunch of well, somewhat well distributed points up above. 17:06:52 So I would encourage you to read through this article, to give you a sense of like what people think about the box plot, and why they maybe don't like it. 17:07:00 Yeah, so in this notebook we talked about a bunch of different rectangular based chart types that you can make in that Potlib. 17:07:10 All of these chart types are pretty popular, and used quite a bit in an now I'll send in. 17:07:16 You know just graphics that we present to other people. 17:07:20 I hope you enjoyed learning about these in the next notebook. 17:07:23 We're gonna continue learning more about plot types and Matt plot lib with something called Imshow. 17:07:30 That's also where we're going to learn more about color bars and how those work.