WEBVTT 00:00:00.000 --> 00:00:02.000 Alright. So I'm gonna start recording now. Alright, hi! 00:00:02.000 --> 00:00:03.000 Everybody! Welcome back. This is Lecture day number 4, so today we're going to continue learning about regression. 00:00:03.000 --> 00:00:07.000 So we'll hopefully get through 4 or so lectures, and then that will be it. 00:00:07.000 --> 00:00:31.000 But before we get started with any of the lectures, does anyone have questions about anything? 00:00:31.000 --> 00:00:35.000 Okay. 00:00:35.000 --> 00:00:39.000 So I'm gonna go ahead. And this is my, your notebook. 00:00:39.000 --> 00:00:46.000 Let me get my chat window open. Okay? So we're gonna start. 00:00:46.000 --> 00:00:49.000 Continue on with what we are learning about regression. So yesterday we ended with a first predictive modeling project but in terms of regression we had only learned simple linear regression. 00:00:49.000 --> 00:01:09.000 We also talked about things like data splits. And the whole idea of supervised learning and predictive modeling so today, we're going to continue on with regression and learn about multiple linear regression. 00:01:09.000 --> 00:01:24.000 As our first step. So multiple linear regression are just regression models using more than one feature in our data sets more than one column in our data, sets more than one column to predict the output so that's what we're gonna learn about 00:01:24.000 --> 00:01:32.000 today in this first notebook. So the multiple linear regression model is just a quick extension of simple linear regression model. 00:01:32.000 --> 00:01:36.000 So we're going to assume that we have n observations of some output. 00:01:36.000 --> 00:01:42.000 Y, and then along with that, a set of M features stored in Vectors X. 00:01:42.000 --> 00:02:02.000 One x 2. All the way up to X M. Then the multiple linear regression model regressing Y on those Xi is given by Y equals beta 0 plus beta one x one plus beta 2 x 2 plus dot dot dot plus beta m xm plus the same random noise normally distributed 00:02:02.000 --> 00:02:20.000 0 standard deviation constant. So you can rewrite this using linear algebra and sort of as a matrix, just a matrix x times, a vector beta, where here Beta is a vector of the Beta 0 through beta, M. 00:02:20.000 --> 00:02:30.000 X is a matrix that has the first column of the matrix, being ones, and then their remaining columns of the matrix being those vectors. 00:02:30.000 --> 00:02:34.000 So x, one, x, 2, all the way to xm. So this is just a really quick extension of that simple linear regression model. 00:02:34.000 --> 00:02:47.000 Before we talk about how to fit this with sk, learn, we're going to talk about how you in theory fit it. 00:02:47.000 --> 00:02:57.000 Remember if you're not someone who's interested and like the mathematics behind fitting a model that's perfectly fine you can stay tuned until we get back to the escalar inside of things. 00:02:57.000 --> 00:03:03.000 So the same. The idea of fitting this model is the exact same idea as fitting the simple linear regression model. 00:03:03.000 --> 00:03:11.000 We want to minimize a loss function. And in regression problems, the most common loss function is mean, square error. 00:03:11.000 --> 00:03:17.000 So to admit. Remember, this is the sum from one equals, or I equals one to N. 00:03:17.000 --> 00:03:25.000 Of the actual observation, minus the predicated observation, squared, and then you add all those up, divide by N, that's the mean part. 00:03:25.000 --> 00:03:31.000 The sum part is the adding of it up. Sorry, mean, squared error. 00:03:31.000 --> 00:03:32.000 So the square part is the square on the difference, and the error part is the actual minus the predicted. 00:03:32.000 --> 00:03:41.000 So if I put this in using the model, the prediction is going to be X times Beta hat or beta hat is the estimate of my coefficients. 00:03:41.000 --> 00:03:42.000 So that's how you get this. And then, if you rewrite it with some linear algebra, you'll get that. 00:03:42.000 --> 00:03:55.000 The mean squared error is equivalent to the mean squared error is equivalent to the mean squared error is equivalent. 00:03:55.000 --> 00:04:00.000 Long expression here that I'm not gonna say with my words. 00:04:00.000 --> 00:04:17.000 You can then go ahead and use a matrix calculus to take the derivative of this with respect to the coefficients, set it equal to 0, and you find out that the estimates of the betas are given by this formula X transpose times X you take the inverse of that times. 00:04:17.000 --> 00:04:34.000 X transpose times. Y. So this is also known. This estimate of the betas is known as the ordinary least squares estimate so least 2 squares because we're missing the sum of squares of the coefficient vector. 00:04:34.000 --> 00:04:36.000 This formula is also sometimes called the normal equation. So this beta hat equals. 00:04:36.000 --> 00:04:56.000 This is known as the normal equation solution. To estimating the beta's so let's go back and show how you would do this and escalate and then possibly also a numpy. 00:04:56.000 --> 00:05:00.000 Is there a question? 00:05:00.000 --> 00:05:05.000 Okay, so we're gonna use that baseball data again. 00:05:05.000 --> 00:05:11.000 So just as a reminder. Here's what that data looks like. 00:05:11.000 --> 00:05:20.000 Okay. So we've got teams, years, leagues, games played, wins, losses, runs and runs allowed. 00:05:20.000 --> 00:05:25.000 So as these is asking, why would X transpose X be invertible? 00:05:25.000 --> 00:05:31.000 So it maybe isn't invertible. But, like in most like most real world applications of X of the model, you're gonna have it invertible matrix. 00:05:31.000 --> 00:05:40.000 It's very common for it to be invertible, but it doesn't necessarily have to be invertible. 00:05:40.000 --> 00:05:55.000 No! 00:05:55.000 --> 00:05:59.000 Okay, so let's make our train test split. So we did that. 00:05:59.000 --> 00:06:03.000 We saw this in an earlier notebook, and so we're gonna fit the file. 00:06:03.000 --> 00:06:04.000 Yeah. 00:06:04.000 --> 00:06:09.000 Why do you use dot copy here? I mean, I haven't seen that in early May. 00:06:09.000 --> 00:06:14.000 I mean earlier places. I'm not in this course in other works. 00:06:14.000 --> 00:06:15.000 What is a? 00:06:15.000 --> 00:06:28.000 Yeah, so so baseball. If you were to put this in here, the train and test split would give you sorry the train and test split. 00:06:28.000 --> 00:06:38.000 Would give you like, technically references of the original rows, so like if I didn't have a copy here. 00:06:38.000 --> 00:06:46.000 Rate Bb, train and Bb test would end up, pointing back to the rows of the original data frame. 00:06:46.000 --> 00:06:56.000 So if I did something to Bb train or to Bb tests rows, I would accidentally potentially be altogether the original data frame. 00:06:56.000 --> 00:07:01.000 So the way python stores things in memory as you can kind of imagine. 00:07:01.000 --> 00:07:04.000 Iphone when you define an object, Python will put that object in like a hypothetical box in your computer. 00:07:04.000 --> 00:07:16.000 Then the variable is just a name pointing to the box, and so one you make like. 00:07:16.000 --> 00:07:17.000 Let's say, I just did. Bp train bb, test. 00:07:17.000 --> 00:07:19.000 Those would then just be pointing to the relevant rows of the original data frame. 00:07:19.000 --> 00:07:29.000 So this box. But it's not a unique copy of it. In order to make sure you have a unique copy, you have to put in a an actual copy of the object. 00:07:29.000 --> 00:07:37.000 So you have to do baseball, dot copy which is enforcing the baseball copy which is enforcing the data frame to actually have a unique new copy of it be made. 00:07:37.000 --> 00:07:55.000 And then those are the things that are being pointed to by Bb train and Bb tests. 00:07:55.000 --> 00:07:58.000 So now we're gonna go ahead and try and fit this model. 00:07:58.000 --> 00:08:01.000 So we're gonna regress the number of wins on. Yeah. 00:08:01.000 --> 00:08:02.000 Hmm! Sorry. I think, that we tend to speech. 00:08:02.000 --> 00:08:10.000 Probably I think, if I remember it's probably worked directly on a copy instead of. 00:08:10.000 --> 00:08:17.000 So instead of working on the original one, and then speed out the Vv training vv test, I do not. 00:08:17.000 --> 00:08:21.000 So my confusion is, is, is it still necessary? 00:08:21.000 --> 00:08:38.000 So am I experienced. It is necessary, if I've done this in the path, and it's possible that Sk Learn has updated its dot, its function. But in the past, when I've done it without the copy, and then I would try and do something on, the train set or the test set it would give me the 00:08:38.000 --> 00:08:43.000 warning of your doing, you know, basically your changing the original data frame. 00:08:43.000 --> 00:08:46.000 When you do this I forget what the wording of the warning was. 00:08:46.000 --> 00:08:48.000 So in my past experience it has not been making a hard copy in the background. So that's why I do it. 00:08:48.000 --> 00:08:54.000 Okay. Let me see. Okay. Okay. Okay. 00:08:54.000 --> 00:08:59.000 Okay. So I'm regressing W. On R and R. A. 00:08:59.000 --> 00:09:11.000 So how do I do that? Well, first I'm gonna put it into the form that we talked about here so I'm gonna get a matrix X where the first column that we talked about here. So I'm going to get a matrix X where the first column is a column of ones. 00:09:11.000 --> 00:09:12.000 And then the other 2 columns will be my R. 00:09:12.000 --> 00:09:15.000 And my R. A, so that's what I'm doing here. 00:09:15.000 --> 00:09:16.000 So I'm making a matrix X trained I'm setting it to be a matrix of all ones. 00:09:16.000 --> 00:09:25.000 So 3 columns of ones, and then the other 2 columns, columns, one and 2. 00:09:25.000 --> 00:09:26.000 How are we gonna get overwritten with the R. 00:09:26.000 --> 00:09:35.000 And the Ra values from my training set, and then Y train is going to be the the vector of W's from my training set. 00:09:35.000 --> 00:09:43.000 So we could look at it if we'd like. So here is X train. 00:09:43.000 --> 00:09:50.000 Okay. So you see the column of ones. And then here is y train. 00:09:50.000 --> 00:09:53.000 Okay. 00:09:53.000 --> 00:10:01.000 So Zack's asking why we need to call M. Of one. So if we go back to our formula here, notice there's a beta 0 out front. 00:10:01.000 --> 00:10:02.000 And so, in order to be able to write it as X times, Beta. 00:10:02.000 --> 00:10:12.000 X. Has to have a column of ones out front, and then Beta's first entry has to be Beta 0. 00:10:12.000 --> 00:10:16.000 In that case should fit intercept, be turned off. 00:10:16.000 --> 00:10:20.000 So we're first doing it with Numpy, and then we'll talk about how to do it with. 00:10:20.000 --> 00:10:22.000 I see. Okay. 00:10:22.000 --> 00:10:23.000 Yup, okay, so how do we do this with numpy? 00:10:23.000 --> 00:10:26.000 Is he just use the normal equation. So let's go ahead and refer back. 00:10:26.000 --> 00:10:41.000 It's X transpose X inverse times. X transpose times y, so we have x dot transpose. 00:10:41.000 --> 00:10:45.000 And then dotted with X, and then around that we want to do. 00:10:45.000 --> 00:10:54.000 Np, dot Lynn, alge.in V, which stands for inverse, and then that will be dotted again with X dot. 00:10:54.000 --> 00:10:57.000 Transpose. 00:10:57.000 --> 00:11:05.000 And I guess this should also have the parentheses, and then that will be multiplied with y, and it. 00:11:05.000 --> 00:11:12.000 These should all be X trains, not X's, and y change. 00:11:12.000 --> 00:11:16.000 Okay. So now we can look at our estimates from the normal equation. 00:11:16.000 --> 00:11:24.000 So we estimate an intercept of 84.0 8 a beta, one hat of point O let's say just point 1. 00:11:24.000 --> 00:11:25.000 If we round up, and then a beta, 2 hat of negative point 1. 00:11:25.000 --> 00:11:39.000 If we're rounding. So to make the predictions, then, with the training set or sort of the fitted values we would just do the Beta hat at 0, which I have. 00:11:39.000 --> 00:11:42.000 You know we haven't. Had a Beta hat at one times. 00:11:42.000 --> 00:11:43.000 The one column of X train times. Beta had it 2 times. 00:11:43.000 --> 00:11:53.000 The 2 column of X strain. And that would allow us to calculate the Msc. 00:11:53.000 --> 00:11:56.000 On the Training Set, which is 16.9 5. Okay? 00:11:56.000 --> 00:12:00.000 So most times, you're not going to do this. You're just going to use Sk, wherein? 00:12:00.000 --> 00:12:15.000 But I think it's good to see the implementation of the normal equation sort of by hand, with numpy doing all the linear algebra for us. 00:12:15.000 --> 00:12:21.000 Sklearn dot linear model. We import linear regression. 00:12:21.000 --> 00:12:31.000 So we're still using the linear regression object. So sk, learn simple linear regression is the same as multiple linear regression. 00:12:31.000 --> 00:12:39.000 This was Zach's question earlier. So because X. Train has a column of ones in it, when we define our linear regression objects, we're going to put fit intercept equals false. 00:12:39.000 --> 00:12:50.000 So why do we want to do this? So by default, fit intercept equals? 00:12:50.000 --> 00:13:00.000 True, is assuming that your columns of X only have features, but because we have a column of ones out front in order to allow us to demonstrate the normal equations. 00:13:00.000 --> 00:13:10.000 Here we have to say fit intercepts equals, false, because the intercept in this particular case is going to be absorbed into the coefficients. 00:13:10.000 --> 00:13:17.000 Okay. And so then we'll do reg that fit X train y chain. 00:13:17.000 --> 00:13:21.000 Another thing you might notice is, remember, in simple linear regression. 00:13:21.000 --> 00:13:27.000 We had to do reshape. Here we don't have to do reshape, because X is a 2D. Array. 00:13:27.000 --> 00:13:35.000 So it is a 2D. Array. So it is a matrix, whereas before it was originally so, it is a matrix, whereas before it was a like a row vector so because it's a 2D array already, we don't need to do reshap, hey? 00:13:35.000 --> 00:13:38.000 So now we've fit our linear regression, and we can look at the coefficients and see, compare and contrast. 00:13:38.000 --> 00:13:46.000 So we've got Beta, 0 hat equals 84, 84 is the first coefficient in this array. 00:13:46.000 --> 00:13:54.000 Then we've got point O. 9 7 point O. 9 7, and then finally negative point 101, and then negative point 1. 00:13:54.000 --> 00:14:09.000 Oh, one. Okay. If we wanted to make a prediction and escalar, we do the model dot predict, and then we could do X train. 00:14:09.000 --> 00:14:12.000 So we're just using the training set in this notebook. 00:14:12.000 --> 00:14:14.000 We're not gonna look at the test set at all. 00:14:14.000 --> 00:14:18.000 And then the Msc's also match. Okay? 00:14:18.000 --> 00:14:24.000 So it's possible. Hopefully, this wasn't too confusing with the adding of the calm of ones in order to allow us to see like how it relates to the normal equation setup. 00:14:24.000 --> 00:14:28.000 And also with this fit intercept, hopefully, that's not too confusing. 00:14:28.000 --> 00:14:43.000 But that's just it for multiple linear regression using continuous fe, and the next couple notebooks will expand upon this model to include captorical features. 00:14:43.000 --> 00:15:00.000 But before we do any of that, are there any questions about what we learned in this notebook? 00:15:00.000 --> 00:15:02.000 Awesome. Okay? So that's multiple linear regression. Where all of your features are continuous. 00:15:02.000 --> 00:15:13.000 So now, like it's totally possible that we want features that aren't continuously categorical features. 00:15:13.000 --> 00:15:15.000 So how do I deal with that? Well, we have to learn how we can add categorical variables and also interactions to our models. 00:15:15.000 --> 00:15:26.000 And to do that, we're going to use a new data set about some beer. 00:15:26.000 --> 00:15:32.000 So this data set, once it loads. 00:15:32.000 --> 00:15:39.000 What's my notebook's ready to go. We'll see what this data set looks like. 00:15:39.000 --> 00:15:43.000 Okay, so making my train test split from the very beginning. 00:15:43.000 --> 00:15:44.000 So, for now you might notice something new called Stratify. 00:15:44.000 --> 00:15:51.000 I'm gonna ask that. Ignore it for now, and we're gonna come back to it next week. 00:15:51.000 --> 00:15:57.000 Just stratifies the thing. We'll talk about it next week. 00:15:57.000 --> 00:15:58.000 Okay, so this data set has each row represents a different beer that exists in the world. 00:15:58.000 --> 00:16:09.000 Or at least it did. At 1 point it has the ibu, which stands for international bitterness, units. 00:16:09.000 --> 00:16:10.000 That's a column Abv, which stands for alcohol by volume. 00:16:10.000 --> 00:16:20.000 The rating from the website that it came by for so this was a website where you could users could input their rating out of 5 stars for different beers. 00:16:20.000 --> 00:16:30.000 And then the type of beer which is either a Ipa or a Stout. 00:16:30.000 --> 00:16:43.000 So we're gonna think of, we wanna build a model like maybe we have noticed that the more alcohol that is inside of a beer, the more bitter it tastes. 00:16:43.000 --> 00:16:53.000 And so we wanted to look at building a model that predicts Ibu using Abv, so here's what that data looks like. 00:16:53.000 --> 00:17:09.000 So I've got ibu on my vertical axis, because it's what I'm going to try and predict and abev on my horizontal cause. That's the feature. So right now, I'd say, this looks like it has you know, a linear relationship. 00:17:09.000 --> 00:17:10.000 It's not like the strongest linear relationship in the world, but it looks like to me like it would have a linear relationship. 00:17:10.000 --> 00:17:14.000 So, you know, we might have a baseline model if we're thinking of this from a predictive modeling standpoint. 00:17:14.000 --> 00:17:28.000 Our baseline model might be alright. Ibu is independent of Abv, so it's just the expected value of ibu plus random noise. 00:17:28.000 --> 00:17:34.000 Then our next model. We might be interested in is the simple linear regression model that we talked about yesterday, where I'm just gonna regress ibu onto Abv. 00:17:34.000 --> 00:17:45.000 Now let's go ahead, and then change this plot so that we also include information about beer types. 00:17:45.000 --> 00:17:52.000 So the 2 types of beers in this data set are stouts and Stouts and Ipa's. 00:17:52.000 --> 00:17:53.000 So this plot will make. Oh, sorry. So we're trying to look to see like if this will have an impact on Ibu. 00:17:53.000 --> 00:18:02.000 So one way to do this is something called a swarm plot. 00:18:02.000 --> 00:18:08.000 So here is a swarm plot of the Ibu for the 2 different beers stouts on the left. 00:18:08.000 --> 00:18:16.000 Ipa is on the right, and so in a swarm plot, each point or each observation in the data set is represented by a point. 00:18:16.000 --> 00:18:27.000 So this point here is one of the stouts. This point here is one of the Ipas, and it's just a way to visualize the distribution of the of the variable. 00:18:27.000 --> 00:18:34.000 Your interested in and what you're looking for when you're looking to see if categorical variables maybe have an impact on what you're interested in. 00:18:34.000 --> 00:18:38.000 You want to see like, do the distributions tend to overlap a lot? 00:18:38.000 --> 00:18:43.000 So, if they overlap quite a bit, then it's possible that they don't have an effect. 00:18:43.000 --> 00:18:55.000 If they're sort of offset from one another that indicates that maybe they do have an effect, and you'd like to keep it in the model another way that you could do this is just to recreate your earlier scatter plot. 00:18:55.000 --> 00:19:01.000 But then color the points by the category. Okay? So here's that plot from earlier. 00:19:01.000 --> 00:19:16.000 But now the Ipas are orange triangles, and the stouts are blue circles, and so we can see here that it appears that the Ipas tend to live higher up on the plot than the stouts of an equivalent AV v so this suggests to me that we might want 00:19:16.000 --> 00:19:20.000 to try, including the beer type category in our model. 00:19:20.000 --> 00:19:24.000 So how do we do that? We're going to learn how to do that. 00:19:24.000 --> 00:19:39.000 After I pause for questions. 00:19:39.000 --> 00:19:40.000 Yeah. 00:19:40.000 --> 00:19:43.000 Hi! Just a general question. So when you're building a model, is it important to check before? 00:19:43.000 --> 00:19:46.000 If the relationships are linear. 00:19:46.000 --> 00:19:54.000 Yeah. So if you're building a linear regression model, one of the key assumptions and that is that there is a linear relationship. 00:19:54.000 --> 00:19:59.000 So if you're thing that you're trying to predict does not have a linear relationship with the features you're trying to use to predict it. 00:19:59.000 --> 00:20:07.000 Then the linear regression model is probably not going to be a very good model. 00:20:07.000 --> 00:20:14.000 So that's one of the things you'd want to check. If you're considering using a linear regression model. 00:20:14.000 --> 00:20:22.000 So like from today's practice. For example, we were asked to calculate the Pearson coefficients and get the correlation. 00:20:22.000 --> 00:20:25.000 So is there like a particular value, that if it's like below point 5 then don't bother with like linear regression. 00:20:25.000 --> 00:20:33.000 Do something else like. There's some threshold where you can toggle. 00:20:33.000 --> 00:20:53.000 So I think I don't know of like this is the general rule of thumb, so I think like in my head, I usually use point 3 or like point 2, and if it's bigger and magnitude so either below negative point 3 or above positive point 3, I'll consider including it cause it might 00:20:53.000 --> 00:20:56.000 help and if it's like above, like point 5, I think, okay, this is a relatively strong correlation in magnitude so negative or positive. 00:20:56.000 --> 00:21:06.000 This is a relatively strong correlation, so we should probably include it. 00:21:06.000 --> 00:21:13.000 And then, if it's bigger than like, let's say point 7, then that's like a really strong correlation and we should definitely try and include it. 00:21:13.000 --> 00:21:22.000 So that's sort of like the thought process I go through. I don't know that there's like a general rule of thumb, for, like every. 00:21:22.000 --> 00:21:39.000 Just a quick question related to that. So you're gonna wanna use like all of the features that you have right as long as you think they're contributing to the result right? 00:21:39.000 --> 00:21:42.000 But like if one of them isn't linearly related to what you're trying to predict, does that just throw everything off? 00:21:42.000 --> 00:21:55.000 Would you just not include it, or would you use some kind of log to try and change it so that it would be linear? 00:21:55.000 --> 00:21:56.000 Yes. 00:21:56.000 --> 00:22:01.000 Or what do you do there? It just seems like it's like there's going to be one that's not linear. 00:22:01.000 --> 00:22:06.000 With respect to what you're trying to predict. So how do you deal with those situations? 00:22:06.000 --> 00:22:13.000 Yeah. So that's a good question. So like, typically, you're not going to use every single feature that you have. 00:22:13.000 --> 00:22:19.000 So a lot, we collect a lot of data in the world, and only a little bit of it is like relevant for what we want to do. 00:22:19.000 --> 00:22:25.000 So there's this process called feature selection, that we are gonna learn a couple of different approaches to feature. 00:22:25.000 --> 00:22:29.000 So I'm with the tool set that we've built up so far. 00:22:29.000 --> 00:22:30.000 Those approaches are really limited to just exploratory data. 00:22:30.000 --> 00:22:35.000 Analysis. So basically with that, you know, you might see something where it doesn't appear that there's a linear relationship. 00:22:35.000 --> 00:22:50.000 But like you said as we'll learn in the next notebook sometimes, just because it doesn't appear that there's a linear relationship doesn't mean that there's no relationship. 00:22:50.000 --> 00:22:51.000 Right. 00:22:51.000 --> 00:23:08.000 So, for instance, maybe a logarithmic relationship is appropriate, or maybe something like taking the square or the square root as appropriate so that comes in like, that's really hard to tell without doing like these making these types of plots and looking visually because if you only look at correlations something might have 0 00:23:08.000 --> 00:23:11.000 correlation, but then still have a relationship. There are other. 00:23:11.000 --> 00:23:17.000 So? Is is it not best practice, then, to use like as much cause? 00:23:17.000 --> 00:23:23.000 I feel like you want to use as much data as you have as long as it's relevant to the situation. 00:23:23.000 --> 00:23:24.000 Right. You want. 00:23:24.000 --> 00:23:34.000 Well, so that yeah, so like what you're saying, there is like the huge like, as long as so like, if the data you're using is improving, the predictive power of your model then you want to use it. 00:23:34.000 --> 00:23:35.000 But it's not always like throwing all the features into your model isn't going to always give you the best model. 00:23:35.000 --> 00:23:48.000 So sometimes there are features that are just not at all correlated with the thing that you're interested in. 00:23:48.000 --> 00:23:50.000 Then those aren't going to be included in the model. 00:23:50.000 --> 00:24:06.000 So like, for instance, like, for example, maybe somehow in this data set, you are able to get information on like the color of the beer label that's not at all gonna be related to the Ivu, so we wouldn't want to include it, even though, it's data that we might have but 00:24:06.000 --> 00:24:07.000 like. 00:24:07.000 --> 00:24:14.000 So is that where you just use the core like matrix, and you just pick out the ones that are above a certain threshold value. 00:24:14.000 --> 00:24:15.000 So! 00:24:15.000 --> 00:24:18.000 Or does that only tell you whether they're linearly related? 00:24:18.000 --> 00:24:19.000 Yeah. So like, I said earlier, like, correlation isn't the end all be all you have to do things like exploratory data. 00:24:19.000 --> 00:24:33.000 Analysis and look to see if there are other relationships, and you'll have to try other approaches that we're going to learn in later lectures. 00:24:33.000 --> 00:24:36.000 Okay, I'll look forward to those then. Sorry. 00:24:36.000 --> 00:24:38.000 Yeah, yeah. 00:24:38.000 --> 00:24:39.000 Can I meet? 00:24:39.000 --> 00:24:40.000 It's kind of a big, broad question, but. 00:24:40.000 --> 00:24:42.000 Yeah. 00:24:42.000 --> 00:24:43.000 Sure! 00:24:43.000 --> 00:24:52.000 Matthew, can I make a comment? I think I think if you restrict Jacob, if you restrict yourself only to linear recreations, then you will have. 00:24:52.000 --> 00:24:57.000 You will be having all these questions. I think Matthew was also saying that there are other approaches. 00:24:57.000 --> 00:25:01.000 Are not model that you can adopt at that stage where all these nonlinear relationship can be taken care of. 00:25:01.000 --> 00:25:18.000 So this cost and so will be kind of later at later point will be relevant when you learn. And when we learn more different types of complicated and more complex model. 00:25:18.000 --> 00:25:19.000 Yeah. 00:25:19.000 --> 00:25:30.000 Right? So the answer is, basically, if you have some data you want to use some features you want to use that are not linear, linearly related to the predicted predictive feature. 00:25:30.000 --> 00:25:31.000 But you want, like, you want to use them because they are related. 00:25:31.000 --> 00:25:40.000 In some way. You just wouldn't use a linear model. 00:25:40.000 --> 00:25:41.000 Alright! 00:25:41.000 --> 00:25:45.000 You would just use some of them, or you would use logs or whatever to turn it into linear. 00:25:45.000 --> 00:25:46.000 Yeah, right? So when you regret. 00:25:46.000 --> 00:25:49.000 I just wouldn't wanna throw it out, you know, to use a linear model. 00:25:49.000 --> 00:25:58.000 Yeah, so, yeah, yeah. So like, if you're seeing through your exploratory data analysis, that kind of thing like, there is a relationship. 00:25:58.000 --> 00:25:59.000 It's just not linear like you would want to change your model. 00:25:59.000 --> 00:26:09.000 But if you do, those explorations and it looks like that, it doesn't appear to be any relationship, then you would not use it. 00:26:09.000 --> 00:26:10.000 Yeah. 00:26:10.000 --> 00:26:11.000 Alright. Thank you. 00:26:11.000 --> 00:26:27.000 And then Melanie was asking to discuss again what we're checking for, and sort of these plots, and so to the idea here is, and I think part of it was I didn't explain it very well, because I was sort of remembering what I wrote earlier in the month so basically, the idea is 00:26:27.000 --> 00:26:32.000 we're just. I'm trying to demonstrate the process of like investing, whether or not it's worthwhile to include a categorical, variable in a linear regression model. 00:26:32.000 --> 00:26:44.000 And so one way to do that is to examine the distribution of the thing we're trying to predict which for us is ibu, and see if it's impacted in any way with the values of the categorical variable. So for us. 00:26:44.000 --> 00:26:54.000 That's either stout or Ipa. So here because these 2 appear to be slightly offset. 00:26:54.000 --> 00:27:01.000 So the mean or median appearing to happen here, for Ipas, and maybe over here for Stouts. 00:27:01.000 --> 00:27:07.000 That suggests to me that we might consider including this another plot type that you could make. 00:27:07.000 --> 00:27:11.000 That is more popular in different circles is, instead of a swarm plot. 00:27:11.000 --> 00:27:17.000 You could use a box plot, but you'll have to. 00:27:17.000 --> 00:27:20.000 This is why you should never like just, you know, Freewheel it, let's go ahead and try and see if that works. 00:27:20.000 --> 00:27:30.000 There you go. So this is like a box spot where this is showing the inter-core tile range, and I think I go over this in the problem session. 00:27:30.000 --> 00:27:37.000 You'll work on a Monday, but like so if you're into quartile ranges are almost not overlapping at all. 00:27:37.000 --> 00:27:38.000 This is so also sort of a suggestion that you'd want to include it. 00:27:38.000 --> 00:27:50.000 So basically you're just trying to look for evidence that okay, this category this categorical variable, does seem to be impacting my predictor or not. 00:27:50.000 --> 00:27:56.000 My predictor, my output variable that I want to predict, and if that's the case, then you want to consider using it. 00:27:56.000 --> 00:28:02.000 So that's sort of one way the other way is, you're basically just remaking the scatter plot. 00:28:02.000 --> 00:28:06.000 And then looking at the colors of the scatter like coloring. 00:28:06.000 --> 00:28:07.000 The markers according to the values of the categorical, variable. 00:28:07.000 --> 00:28:27.000 You're kind of seeing. Okay for a equivalent values of the Iba or the Abv Ipas tend to live above the stouts which suggests that there is sort of an impact on the Ibu by the type of beer. 00:28:27.000 --> 00:28:35.000 Okay, so now that we're happy, and we want to include beer type, we actually have to go through the logistical process of how do you include categorical variables in a linear regression model? 00:28:35.000 --> 00:28:52.000 And some of you may be familiar with this from statistics, coursework. Some of you may not be familiar with this, so the way to include a categorical variable in a model is you first have to do some data pre-processing. 00:28:52.000 --> 00:28:58.000 So these categorical variables are typically stored as either strings or sort of indicator numbers. 00:28:58.000 --> 00:29:01.000 So like they'll be numbers, but they don't actually mean number of things. 00:29:01.000 --> 00:29:09.000 They're just indicators. And so basically, like, strings are great for human readability. Right? 00:29:09.000 --> 00:29:11.000 I can look at this and say, Okay, this is an Ipa. 00:29:11.000 --> 00:29:14.000 This is a stack but they're really bad for regression models. 00:29:14.000 --> 00:29:18.000 So for regression models, so for regression models with categorical variables with categorical variables, you need to do something called one hot encoding. 00:29:18.000 --> 00:29:29.000 So one hot encoding is where you take a categorical, variable, and then you represent it as a series of zeros and ones, depending on the number of categories. 00:29:29.000 --> 00:29:34.000 So for us we have 2 categories so we're just going to need one variable 0 or one. 00:29:34.000 --> 00:29:45.000 But in general, if you have little K unique categories, then you need to create K minus 1 one hot encoded, or also known as indicator, variable. 00:29:45.000 --> 00:29:46.000 So what is an indicator or one hot encoded, variable? 00:29:46.000 --> 00:29:52.000 You will denote it with like one little sub. J. 00:29:52.000 --> 00:29:56.000 Where Jay is going to be one of the categories. 00:29:56.000 --> 00:30:01.000 So this variable will be one if you're observation is equal to J. 00:30:01.000 --> 00:30:07.000 So, for instance, if this was one sub-stout like I have down here, it would be a one. 00:30:07.000 --> 00:30:12.000 If the beer is a stout and a 0, otherwise, so your indicator variables are just going to be one. 00:30:12.000 --> 00:30:24.000 If your observation is that particular category or option for the category, and otherwise it will be a 0 so how can we do this in Python? 00:30:24.000 --> 00:30:27.000 There's a function called, get dummies from pandas. 00:30:27.000 --> 00:30:31.000 So what you can do, and then I'll demonstrate it here as you do. 00:30:31.000 --> 00:30:35.000 Pd. Dot get underscore dummy's you import your inputs not import. 00:30:35.000 --> 00:30:42.000 You input the column that you're interested in. 00:30:42.000 --> 00:30:47.000 So for us, because we have 2 beers. We only need one indicator, variable. 00:30:47.000 --> 00:30:50.000 So I chose to make that the stout indicator. 00:30:50.000 --> 00:30:56.000 So this is what oh, no! What did it do? Oh, cause there is no stout. 00:30:56.000 --> 00:31:00.000 I'm having a brain fart. Here we go, beer type. 00:31:00.000 --> 00:31:04.000 Okay, so beer type will take your get dummies. 00:31:04.000 --> 00:31:16.000 Take sudden your categorical column, and then produces a new data frame the columns of that data frame are 0 ones, depending upon the possible options for the category. 00:31:16.000 --> 00:31:30.000 So the first one is Ipa, so you'll have a 0 if the beer is not an Ipa and a one, if it is an Ipa and a one, if it is an Ipa, so we could. 00:31:30.000 --> 00:31:36.000 Down here. Let's look at the first 5 rows. Okay. 00:31:36.000 --> 00:31:40.000 So we can see the first beer in this training set is a stout. 00:31:40.000 --> 00:31:43.000 And so that's why Ipa is 0, and Stout is one. 00:31:43.000 --> 00:31:48.000 The second one is an Ipa. So that's why Ipa is one, and stout is 0. 00:31:48.000 --> 00:31:51.000 So then, because we have to remember, we only need one indeator. So we're going to choose the stout indicator, because that's what I wrote. 00:31:51.000 --> 00:32:01.000 In the notes, and so we just need to add a stout column to the training set. 00:32:01.000 --> 00:32:20.000 So we'll do. Pd, dot get underscore dummyies, beer train at beer type, and then here we just need the column for stout. So I'm going to do column Stout. 00:32:20.000 --> 00:32:24.000 Okay. And then we can go through and just check with the first 5 stout. 00:32:24.000 --> 00:32:28.000 So it's a one Ipa, so it's a 0. 00:32:28.000 --> 00:32:32.000 And if we were to go through and check it, it would hold true for all of it. 00:32:32.000 --> 00:32:39.000 Okay. So I know often the concept of the indicator variables and the concept of get dummies is can be confusing if you're seeing it for the first time. 00:32:39.000 --> 00:33:02.000 So does anyone have questions about get dummies are about the indicators. 00:33:02.000 --> 00:33:08.000 So Laura is asking, are dummies only used for categorical data? 00:33:08.000 --> 00:33:12.000 Yes, so! 00:33:12.000 --> 00:33:23.000 Trying to do it for like continuous data would be difficult cause usually end up with like a lot of columns, because in general continuous data can take on, you know, infinitely many possible values. 00:33:23.000 --> 00:33:28.000 So you use it for categorical data. You could also use it. 00:33:28.000 --> 00:33:31.000 So some people will call like, sometimes you'll have data that it's called. 00:33:31.000 --> 00:33:43.000 So some people will call like, sometimes you'll have data that is called ordinal data, which is technically categories. But the numbers that are provided have some sort of meaning, in which case, like you, you can sometimes use those directly. 00:33:43.000 --> 00:33:44.000 But other times people suggest that you use the indicators for it. 00:33:44.000 --> 00:33:52.000 So dummies are typically I'm gonna say, like, yeah, they're always used for categorical data. 00:33:52.000 --> 00:34:05.000 Yahweh asking, Can it be non integer? So are you asking if the dummies themselves can be not integers? 00:34:05.000 --> 00:34:11.000 Oh, sorry. Yeah, I think so. It's like that may be like point 1.2 point 3. 00:34:11.000 --> 00:34:19.000 Yeah, so I guess for other types of models they could be whatever like any 2 distinct numbers. 00:34:19.000 --> 00:34:23.000 But for one linear regression they specifically do have to be zeros and ones. 00:34:23.000 --> 00:34:24.000 Okay. Okay. Thanks. 00:34:24.000 --> 00:34:25.000 Okay, can I chime in a little bit with that? I'm not sure if this applies. 00:34:25.000 --> 00:34:37.000 But I was working on a recommendation engine at 1 point, and I had, like various categories for the things I was recommending. 00:34:37.000 --> 00:34:49.000 But then within like, if if you look in the specific category, it had like a ranking within that category. 00:34:49.000 --> 00:34:52.000 So I think what I end up doing which I'm not sure is correct. 00:34:52.000 --> 00:35:07.000 I suppose this is my question is, I did like a get dummies thing I did this categorical encoding, but each, instead of zeros and ones, it would have like a 0 if if it wasn't even in that category and then it would have 00:35:07.000 --> 00:35:13.000 the ranking within that category rather than a one. 00:35:13.000 --> 00:35:14.000 Yeah, so you can do. 00:35:14.000 --> 00:35:17.000 What I'm saying makes sense like is that valid? Or I don't know. 00:35:17.000 --> 00:35:22.000 Yes, so you can do something like that. Typically people. Are. So if you're gonna keep their rankings I believe you'd want to do a different type of like again. 00:35:22.000 --> 00:35:31.000 You. Maybe we're not doing a regression model. So it's hard to tell without knowing the model. 00:35:31.000 --> 00:35:35.000 But there are certain models that are built for like rank regression. 00:35:35.000 --> 00:35:37.000 I believe I've I don't know them off the top of my head. 00:35:37.000 --> 00:35:47.000 I'd have to do a web search, but in general I believe it's recommended for ranking data that you would still do the one hot encoding, because there's not necessarily the same integer meaning behind the ranking. 00:35:47.000 --> 00:36:01.000 So the sort of difference in people's minds between between something that gets a value of rank number one and between. 00:36:01.000 --> 00:36:08.000 And maybe the value of rank number 2, and then compare that to the difference between 3 and 2 may not be equal, in people's minds. 00:36:08.000 --> 00:36:21.000 So maybe they's a bigger gap between the ranking from 2 to 3 than there is from one to 2. And so in that case, I believe it is still recommended in a regression model that you would use one hot encoding for those. 00:36:21.000 --> 00:36:24.000 So, then how would you encode the the actual ranking? 00:36:24.000 --> 00:36:28.000 Would you have a separate feature for the ranking of? 00:36:28.000 --> 00:36:35.000 Yeah, so with. So I would say, this might be a good question to break like, come to me an office hours for, because it's a diving. Yeah, a little bit too specific. 00:36:35.000 --> 00:36:39.000 Okay, yeah, yeah, I can do that. 00:36:39.000 --> 00:36:45.000 Yeah. Yeah. And then Iov, sorry if I said your name, and correctly, what if you have more than 2 category? 00:36:45.000 --> 00:36:52.000 So if you have more than 2 possible categories, you would have so like if you had 3, you could make 2 indicators. 00:36:52.000 --> 00:36:54.000 If you'd have 4 you'd make 3 indicators. 00:36:54.000 --> 00:37:02.000 So if you have K possible categories, you need to make K minus one indicators and get dummies will do that for you. 00:37:02.000 --> 00:37:04.000 Get dummies, make'll do that, for you get dummies. Makes okay of them. 00:37:04.000 --> 00:37:08.000 But then you need to select K minus one of them. 00:37:08.000 --> 00:37:18.000 And then I saw somebody had their hand up. So if that person wants to ask if there's if they still have a question. 00:37:18.000 --> 00:37:20.000 It wasn't me, but I can ask later. 00:37:20.000 --> 00:37:26.000 Okay. 00:37:26.000 --> 00:37:29.000 Alright! So what's the model that we want to build now? 00:37:29.000 --> 00:37:41.000 So now that we have this stout, variable, which is the indicator of it, is or is not, a stout, we're just gonna regress ibu on both abv and onstack. 00:37:41.000 --> 00:37:45.000 And so we don't have to change like the models, like the object, is still the same. 00:37:45.000 --> 00:37:46.000 We also don't have to change like, if this was a normal equation situation, that also still applies. 00:37:46.000 --> 00:38:00.000 But since we're using sk, learn, all we have to do is just make sure that when we fit our model we include the stout column along with the Ab. B. 00:38:00.000 --> 00:38:07.000 Column, so here I define the model, and then I fit the model all on the training set, and in this code, Chunk, there's a lot of code here. 00:38:07.000 --> 00:38:13.000 It's just me plotting. So it's making a lot of code here. It's just me plotting. So it's making a scatter plot that we saw before. 00:38:13.000 --> 00:38:15.000 But now I'm inputting the model output for those 2 different categories. 00:38:15.000 --> 00:38:36.000 So the ipas are the orange dotted lines, the stouts is the solid blue line, and one thing you might notice here is that they both have the same slope, and from looking at the scatterplot data, it's reasonable to say that doesn't look like they should have the 00:38:36.000 --> 00:38:42.000 same slope, you might suggest that the Ipas may look like they should have a steeper slope. 00:38:42.000 --> 00:38:45.000 So how do I change that? Well, let's go back to the model itself that we just fit. 00:38:45.000 --> 00:38:51.000 So we just fit this model that I'm highlighting with my mouse. 00:38:51.000 --> 00:38:52.000 And so if you look at this model, let's go through what happens. 00:38:52.000 --> 00:39:01.000 When we change from a beer that isn't Ipa meaning stout is equal to 0 to a beer that is a stout meaning. 00:39:01.000 --> 00:39:09.000 Stout equals one. So in the case when stout is 0, the model reduces to this so beta 0 plus beta 1 80 b. 00:39:09.000 --> 00:39:25.000 And when stout, when the beer is a stout, the model becomes beta 0 plus beta 2 plus beta one ab b plus epsilon, and so you can see here the only thing that changes between the 2 beer types is the intercept and so that's exactly what we're seeing here 00:39:25.000 --> 00:39:32.000 we're just seeing that the model saying, Ok, Ipas have a higher intercept than Stouts. 00:39:32.000 --> 00:39:38.000 So if we also want a model that changes slope, we have to include what's known as an interactaction term. 00:39:38.000 --> 00:39:45.000 And so the interaction term is just saying we're gonna have a plus beta 2 for stout that's what we had before. 00:39:45.000 --> 00:39:57.000 But now the interaction is going to be between Abv and stout, meaning that you're going to multiply your Abv column by your stout column, and now include that. So that's what we mean by interactaction. 00:39:57.000 --> 00:40:05.000 Term it's just the multiplication of 2 of your features, and so we can see what happens when stout is 0 versus when stout is one. 00:40:05.000 --> 00:40:10.000 So when stout is 0, we're left with Beta 0 plus beta one ab B. 00:40:10.000 --> 00:40:18.000 But now, when stout is one, we have beta 0 plus beta 2 plus beta one plus beta 3 times abv. 00:40:18.000 --> 00:40:20.000 So this model has the potential to both have a different intercept for the stouts and a different slope for the Stouts. 00:40:20.000 --> 00:40:31.000 So let's go ahead and build this model. And then visualize the fit. 00:40:31.000 --> 00:40:34.000 So the first thing we have to do is make the interaction term. 00:40:34.000 --> 00:40:43.000 So beer train at abv times. Beer train at Stout. 00:40:43.000 --> 00:40:44.000 I went. Just so. Everyone's clear when I say at at it means I just want that particular column. 00:40:44.000 --> 00:40:55.000 So that's what I mean when I say like at, I just want the Apv Column. 00:40:55.000 --> 00:41:00.000 Okay. So now I'm gonna fit this model that I wrote down here so Abbrevi, Stout. 00:41:00.000 --> 00:41:05.000 And then the interactions. So in my fit in my features, I've got abv stout. 00:41:05.000 --> 00:41:10.000 And then the interaction between Abv and Stout. 00:41:10.000 --> 00:41:21.000 And now here I'm going to plot the model fit so you can see now that I have a slightly sharper or a slightly greater slope, for Ipas. 00:41:21.000 --> 00:41:22.000 Then I do stouts, so I see we have a question. 00:41:22.000 --> 00:41:35.000 Hi, Matt! Please correct me if I'm wrong. But is the basic idea here that if you have K categories, the points of one hot encoding is that you basically have K different models. 00:41:35.000 --> 00:41:36.000 And it fits each MoD model on the data in that category. 00:41:36.000 --> 00:41:51.000 Yes. So that is the basic idea is these indicator variables, like stout, are essentially allowing the model to fit 2 models at the same time. 00:41:51.000 --> 00:41:57.000 So it's fitting the model for Ipas here, and then this would be fitting the model for Stouts. Here. 00:41:57.000 --> 00:42:01.000 So that's the whole idea behind indicators. And then remember the reason we're choosing K. 00:42:01.000 --> 00:42:10.000 Minus one indicators here is because there's always going to be sort of like a default comparison category when all of the indicators are equal to 0. 00:42:10.000 --> 00:42:23.000 So I'd be I, Ipas, get absorbed into this model, whereas this is the adjustment made, for if we have a stout. 00:42:23.000 --> 00:42:26.000 Are there any other questions for the interaction term stuff? 00:42:26.000 --> 00:42:29.000 I have a question. 00:42:29.000 --> 00:42:30.000 Yeah. 00:42:30.000 --> 00:42:33.000 So in this particular example there are 2 types of beers, right? 00:42:33.000 --> 00:42:36.000 There's the Ipa and the stout right. 00:42:36.000 --> 00:42:37.000 Yup! 00:42:37.000 --> 00:42:46.000 Suppose that there was a third kind of beer, and I wanted to augment the model to account for that third, that third kind of beer am I correct to assert that I would need say 2 more terms. Say, Beta. 00:42:46.000 --> 00:42:51.000 4 times the third beer type, and then Beta 5 times A, B V. 00:42:51.000 --> 00:42:53.000 Times, the third, beer. 00:42:53.000 --> 00:43:01.000 So you would still have the same number of models, though the difference would be that let's say the third type was like a porter. 00:43:01.000 --> 00:43:02.000 Okay. 00:43:02.000 --> 00:43:09.000 You'd have Beta 4 times porter plus beta 4 times abv times porter so like when you can. 00:43:09.000 --> 00:43:10.000 8 to 5, right? Yeah. 00:43:10.000 --> 00:43:21.000 Yeah, so you can't include just a I don't remember if this is what you said. But like, you can't include just a single indicator, you have to include all the indicators at once. 00:43:21.000 --> 00:43:30.000 Okay. Alright. Hmm, so if I have a model that needs to encompass, let's say, N, types of beers, right? 00:43:30.000 --> 00:43:32.000 Yup! Yup! 00:43:32.000 --> 00:43:36.000 Then I would need to include how many extra terms would that be? 00:43:36.000 --> 00:43:39.000 2 times N. Minus one extra terms. 00:43:39.000 --> 00:43:43.000 Yeah. So if you are, gonna include the interaction? Yes. 00:43:43.000 --> 00:43:45.000 Okay. 00:43:45.000 --> 00:43:52.000 Yeah, and so you don't always include interactions. So sometimes there is no interaction in the slopes appear to be the same. In that case. 00:43:52.000 --> 00:43:53.000 Hmm! 00:43:53.000 --> 00:43:56.000 You just include, like the indicator. 00:43:56.000 --> 00:43:59.000 Hmm gotcha gotcha! Thanks. 00:43:59.000 --> 00:44:00.000 Yup, yeah, so Aziz is asking, why did we choose the interaction to be? 00:44:00.000 --> 00:44:13.000 Avv times stout. Could it be stout, divided by Abv. 00:44:13.000 --> 00:44:29.000 So, the reason we chose the multiplication is because the thing we're regressing ibu on is Abb. So that's why it's stout times the Abv and the one that you're wondering like why did we not choose stout divided by abv that would 00:44:29.000 --> 00:44:45.000 be like, we're also trying to regret Ibu on one over Abv, so we have to stay consistent with the thing we're actually trying to regress onto. 00:44:45.000 --> 00:44:47.000 Excuse me. 00:44:47.000 --> 00:44:49.000 Yeah. 00:44:49.000 --> 00:44:51.000 I'm not sure if that Mike picked up my burp, I forgot to me my mic. 00:44:51.000 --> 00:44:52.000 Oh, okay. No worries, no worries. 00:44:52.000 --> 00:44:59.000 I'm sorry about that. Okay. 00:44:59.000 --> 00:45:07.000 Alright, so let's now imagine we're back in like predictive modeling world, just to get some more practice with like things like K-fold. And that sort of thing. 00:45:07.000 --> 00:45:11.000 So we have 4 Mods, we can review them. We have the baseline. 00:45:11.000 --> 00:45:23.000 We have simple linear regression. We have the one with just stout as an indicator, and then we have the one that has both the indicator and the the interaction term. 00:45:23.000 --> 00:45:29.000 Okay, so we're gonna go ahead and do this with the With cross validation. 00:45:29.000 --> 00:45:31.000 So we've got importing that and then importing the mean, squared error then I'm going to make my K. 00:45:31.000 --> 00:45:51.000 Fold split objects. Here are my mse's, which I'm just gonna store in this empty array, so it's typical that you might do something like you make an array of zeros and then fill in the values as you go through here I'm going through the 00:45:51.000 --> 00:45:56.000 k fold splits, and getting my train set, and then the hold out set, or other people have mentioned the leave out, set. 00:45:56.000 --> 00:46:09.000 That's what I'm doing here. Here. I'm getting the baseline fit, so I'm getting the average value of Ibu on the training part of the cross-alidation here. 00:46:09.000 --> 00:46:13.000 I'm fitting this simple linear regression model and getting the prediction here. 00:46:13.000 --> 00:46:25.000 I'm getting the model with just stout here. I'm getting the model with the both stout and the interaction term, and here I'm recording the mean squared error for all 4 models. 00:46:25.000 --> 00:46:26.000 Okay, so when I run this, I can look at the average cross validation. Msc. 00:46:26.000 --> 00:46:36.000 For all 4 models. So it looks like the one with the Stouts and the one with the interaction term. 00:46:36.000 --> 00:46:37.000 Those are the 2 that perform best, and they're pretty close to one another. 00:46:37.000 --> 00:46:39.000 But because the interaction term is slightly lower than the one with just out this way close to one another, but because the interaction term is slightly lower than the one with just out this way. 00:46:39.000 --> 00:46:58.000 If I were to end now, this would be the one that I would choose the one with the end. 00:46:58.000 --> 00:46:59.000 Okay. Are there any other? 00:46:59.000 --> 00:47:06.000 Matt, I have one question that's actually about the coding structure, and not really about regression. 00:47:06.000 --> 00:47:11.000 Stay right where you are. I noticed, and I noticed this yesterday in the for Loop. 00:47:11.000 --> 00:47:16.000 It says, for train index. Ted Test index in your Kfold. 00:47:16.000 --> 00:47:29.000 Split nowhere there is there the I. But then, for some reason, the for loop is okay with you, using I down there in the Msc. 00:47:29.000 --> 00:47:31.000 Yeah, so I define, I, right here. And yeah, yeah. 00:47:31.000 --> 00:47:34.000 Oh, right? Okay. So it's not part of the loop. 00:47:34.000 --> 00:47:37.000 It's your own counter that you put in. Yeah. 00:47:37.000 --> 00:47:55.000 Yup Yup. So I do this, cause I think it, you know, in this, except for in this one case I think it's made it makes it more clear, for, like people who are learning for the first time a lot of other people might do something like like enumerate, and then like you can 00:47:55.000 --> 00:47:56.000 Oh, right? Okay. 00:47:56.000 --> 00:48:01.000 acquire. I, as part of the loop as well. But I think yeah, so you could do something like that. 00:48:01.000 --> 00:48:02.000 Separated out. 00:48:02.000 --> 00:48:05.000 But I like to keep it. Yeah, like what I'm just when I'm teaching. I like to keep it separate. 00:48:05.000 --> 00:48:09.000 Totally makes sense thanks. 00:48:09.000 --> 00:48:18.000 Yeah. Are there any other questions? 00:48:18.000 --> 00:48:20.000 I have one. So the mean squared error here seems to have a pretty high magnitude. 00:48:20.000 --> 00:48:30.000 Does that mean? Does that say anything about the model? Because it kind of tell? 00:48:30.000 --> 00:48:37.000 I mean it. 10 to tell us that the errors are pretty big. 00:48:37.000 --> 00:48:42.000 Yeah, so if you look at the scale of Ibu, so like the Msc. 00:48:42.000 --> 00:48:48.000 Remember, is a square of the scale. So if you took the square root of this, it would be in the like. 00:48:48.000 --> 00:48:55.000 The tens. So like typically like people will if they wanna try and interpret the Msc. 00:48:55.000 --> 00:48:58.000 They'll take the square root of it and look at the root. 00:48:58.000 --> 00:49:03.000 Msc, and that's because that's on the same scale of units as the Ibu. 00:49:03.000 --> 00:49:16.000 So if you take the root of that, you'll be on the tens, which seems to be in line with the you'll be on the tens, which seems to be in line with the you know the range of Ibus. 00:49:16.000 --> 00:49:17.000 Yeah. 00:49:17.000 --> 00:49:18.000 Makes sense. Thank you. 00:49:18.000 --> 00:49:27.000 I've seen that it's standard practice to normalize all your variables to be between 0. 00:49:27.000 --> 00:49:28.000 Yes. 00:49:28.000 --> 00:49:30.000 One is, that is, that like always advised or. 00:49:30.000 --> 00:49:46.000 So you can do that, and when you ever go regression, like regular or straight linear regression, it doesn't make like a huge difference like your coefficients will just be shifted to account for the scale and other models like we're gonna learn a model on Monday where you have to 00:49:46.000 --> 00:49:49.000 do it. Otherwise you're gonna mess up the model. So like, and other like machine learning models, I think like a lot of them other than linear regression. 00:49:49.000 --> 00:50:10.000 You want to do that cause. Otherwise it just really messes up the model. 00:50:10.000 --> 00:50:13.000 Okay. 00:50:13.000 --> 00:50:21.000 Alright, so we've got 2 more notebooks to get through, and I think we should be able to, based on my memory of what I cover in those notebooks. 00:50:21.000 --> 00:50:22.000 So the for, yeah. 00:50:22.000 --> 00:50:38.000 Actually quick question. So because of this normalization scheme, so if you don't, do normalizations, I think that inverse matchsticks a pseudoinverse X transpose X that inverse, that you discussed I think a while ago, in computing the 00:50:38.000 --> 00:50:43.000 coefficient. I think that elements of that matrix would be pretty off like different elements, will have different magnitude. 00:50:43.000 --> 00:50:50.000 Right if you don't normalize. 00:50:50.000 --> 00:50:51.000 Yeah. So I think that could I mean, you know, that could happen here? 00:50:51.000 --> 00:51:01.000 It didn't cause, I think, abv and well, it just it didn't happen. 00:51:01.000 --> 00:51:09.000 It can happen. But I also like Sk, learn in the background, doesn't actually use the normal equations it uses gradient descent. 00:51:09.000 --> 00:51:11.000 So it doesn't compute an inverse. 00:51:11.000 --> 00:51:18.000 No, but you did it, I mean brute force way also, like heroically so. 00:51:18.000 --> 00:51:21.000 Yeah. Yeah. Yeah. Yes. 00:51:21.000 --> 00:51:36.000 All I'm trying to say here is that if the magnitude difference, Microsoft might probably make sense to do the normalizations, I mean, I, at least for not for 3 3 3 problems. 00:51:36.000 --> 00:51:40.000 But this type of transformations. But I think you are. 00:51:40.000 --> 00:51:47.000 You are convenient, different. Different. Port. 00:51:47.000 --> 00:51:51.000 So can you elaborate that 1? One more time? 00:51:51.000 --> 00:51:56.000 So like! 00:51:56.000 --> 00:52:06.000 You could have. I forget what's the name of like of the term is for the matrices but they can be be behaved, if you know, like you're saying. 00:52:06.000 --> 00:52:07.000 That's right. 00:52:07.000 --> 00:52:15.000 But we're not, you know, computing the what, even though, like I gave, like the background of like classically, this is how you solve a linear regression. 00:52:15.000 --> 00:52:19.000 It turns out, sk learn doesn't do the normal equations. 00:52:19.000 --> 00:52:20.000 It does. Gradient descent, so it doesn't compute the inverse. 00:52:20.000 --> 00:52:32.000 So that's not as much of a problem so I don't typically see people do scaling with lim linear regression, but with other models, we will talk about scaling. 00:52:32.000 --> 00:52:37.000 And you know the last notebook we're going to go over today is about scaling, because next week's models do use scaling quite a bit. 00:52:37.000 --> 00:52:41.000 It can be an issue like you said I wasn't. 00:52:41.000 --> 00:52:44.000 It wasn't an issue today. 00:52:44.000 --> 00:52:48.000 Okay. Okay. 00:52:48.000 --> 00:52:55.000 Okay. So the next type of thing we're going to build up our regression is to learn about polynomial regression and nonlinear transformations. 00:52:55.000 --> 00:52:56.000 So for this, we're going to work this is a synthetic data set. 00:52:56.000 --> 00:53:03.000 So it's called Poly Dot Csv. 00:53:03.000 --> 00:53:04.000 I'm pretty sure it should be in the repository but now I'm slightly nervous. 00:53:04.000 --> 00:53:09.000 I forgot to upload it. So if this doesn't work for you, it's because I haven't uploaded the data. 00:53:09.000 --> 00:53:10.000 I will make sure that it's uploaded after, if it's not currently uploaded. 00:53:10.000 --> 00:53:24.000 So we have 2 inputs x one and x 2. And then the thing we're trying to predict is why. 00:53:24.000 --> 00:53:29.000 So one thing that is a nice feature of pandas, is it? 00:53:29.000 --> 00:53:33.000 Has this function called scatter matrix. And I believe on Monday you'll also learn. 00:53:33.000 --> 00:53:34.000 Or today, actually, you learned a different function called pair plot and seaborate, which does accomplishes the same thing. 00:53:34.000 --> 00:53:43.000 So with scatter, matrix. 00:53:43.000 --> 00:53:46.000 You get this nice matrix of the different scatter plots. 00:53:46.000 --> 00:53:54.000 So the horizontal axis corresponds to the rows, so here every off diagonal plot will have X one. 00:53:54.000 --> 00:54:12.000 As the vertical axis, and then the diagonal plot shows the histogram of X one, and then every column, every column will have, like x one, as the horizontal axis, and this column X 2 is the horizontal axis, and this column and then 00:54:12.000 --> 00:54:22.000 again the histogram on the diagonal, so the one that we're most interested in is this row because we want to see why, as a function of x one and x 2. 00:54:22.000 --> 00:54:28.000 And so we're going to look at this. And so we can see there does appear to be a linear relationship between Y and X 2. 00:54:28.000 --> 00:54:37.000 But there definitely seems to be some other type of relationship between Y and X, one. 00:54:37.000 --> 00:54:38.000 And so this is going to be where we might want to try and include some transformations of x one. 00:54:38.000 --> 00:54:50.000 So, for instance, it looks like it could potentially be sort of like a an even polynomial of x one. 00:54:50.000 --> 00:55:07.000 And so we're going to first start off with learning about polynomial regression, where you add in these sorts of terms, and then we'll branch into nonlinear transformations before we talk about that, though I have a question some Zach is asking is scatter matrix preferred over 00:55:07.000 --> 00:55:16.000 a corner plot just because you can see the transpose I don't think that there's a preference like this is just a one I don't think that there's a preference like this is just the one I whenever I wrote this notebook. 00:55:16.000 --> 00:55:18.000 Like 2 years ago. This is just the one I chose to use. 00:55:18.000 --> 00:55:26.000 I don' and think that people are like there might be die hard people on. Okay, you have to use scatter matrix versus corner plot, or vice versa. 00:55:26.000 --> 00:55:32.000 I'm not one of those people. This is just like what I knew when I wrote the notebook. 00:55:32.000 --> 00:55:35.000 Okay, so we want to look at the relationship like between Y and an even power of X, one. 00:55:35.000 --> 00:55:46.000 And so we're gonna go ahead and make that as one of the columns in our data frame. 00:55:46.000 --> 00:55:53.000 So we're gonna do df, at x one. And then we're gonna square. It. 00:55:53.000 --> 00:56:00.000 And then we're going to look at the scatter matrix, go down to the row for Y. 00:56:00.000 --> 00:56:04.000 And then we can see here, you know. Still, X, one doesn't look linear. 00:56:04.000 --> 00:56:06.000 But if we look at the relationship with Y and x one squared, now, this looks closer to being a linear relationship. 00:56:06.000 --> 00:56:16.000 Okay. So now, we've got that thought of adding the following model. 00:56:16.000 --> 00:56:23.000 So we said that there does appear to be a relationship between Y and X 2. 00:56:23.000 --> 00:56:29.000 Linear relationship, and there also appears to be a linear relationship between Y and X, one squared. 00:56:29.000 --> 00:56:31.000 And so for this model, we're going to regress Y on Beta 0 or sorry y on x one x one squared and x 2. 00:56:31.000 --> 00:56:41.000 So we'll talk about in a second. Why, we want to also include X, one. 00:56:41.000 --> 00:56:42.000 But for now let's just take it that we want to include X, one. 00:56:42.000 --> 00:56:48.000 But for now, like, just take it that we want to include X one. And then I'll explain why we want to include X, one. 00:56:48.000 --> 00:56:50.000 And then I'll explain why we want to include it later in the notebook. 00:56:50.000 --> 00:56:51.000 Okay, so I'm just importing my linear regression. 00:56:51.000 --> 00:57:01.000 And then fitting the model here and so we're gonna also gonna look at a little trick and we'll dive deeper into this trick in a later notebook next week. 00:57:01.000 --> 00:57:12.000 But one thing that you might try and do when you're building these linear regression models is, look at something called the residual plot. 00:57:12.000 --> 00:57:16.000 And so a residual plot is where you take your errors also known as residuals, which is the actual micro. 00:57:16.000 --> 00:57:20.000 The predicted, and then you plot that against your actual values. 00:57:20.000 --> 00:57:34.000 So, if our model is pretty good, then we would hope that our residuals are close to our errors. 00:57:34.000 --> 00:57:53.000 Remember, from the theoretical model. So what that means is, if our model is closely approximating why, we would expect our residuals to sort of look like an evenly spaced, like uniformly spaced a blob of plots along the the I mean not uniformly but a nicely 00:57:53.000 --> 00:58:01.000 evenly spaced blob of plots around the horizontal axis so here's where I'm going to plot that residual plot, and I forgot to make this smaller. 00:58:01.000 --> 00:58:12.000 So let's do 8 comma 4. So you can see here. 00:58:12.000 --> 00:58:18.000 That the residuals. So let's make this smaller to oh, no, okay. 00:58:18.000 --> 00:58:34.000 So you can see here that there residuals are definitely not like an even band, and so like what we would expect is if our model is capturing all the information from the inputs that we would expect to see sort of an even band, because there's in our assumption the residuals are 00:58:34.000 --> 00:58:40.000 normally distributed. So if our models, getting closer to approximating those random errors, we would. 00:58:40.000 --> 00:58:43.000 You know, we're assuming that that's normally distributed like here. 00:58:43.000 --> 00:58:52.000 This is very clearly showing us, like a clear pattern in the data as we change values of why, there's like sort of this weird shape going on. 00:58:52.000 --> 00:58:54.000 So again, we'll dive more deeply into like. 00:58:54.000 --> 00:59:00.000 Why this sort of thing happens in a later notebook, but this is always an indicator that you're missing. 00:59:00.000 --> 00:59:16.000 Some signal from the variables that you have, that you might want to try and include, and so typical when you'll see like, you know, like a tip is when you see a shape kind of like this, it's almost like a twisty wish bony type shape, that usually is an indicator that you 00:59:16.000 --> 00:59:17.000 might want to try, including an interactaction term between the columns. 00:59:17.000 --> 00:59:28.000 So last notebook, we learned that interaction terms is like an indicator times a continuous, variable. 00:59:28.000 --> 00:59:34.000 We also call any like any multiplication of 2 columns, is called an interaction term. 00:59:34.000 --> 00:59:36.000 So when you see a residual plot like this, and again we'll dive more into residual plots next week. 00:59:36.000 --> 00:59:43.000 It's an indicator. That you're missing. 00:59:43.000 --> 00:59:51.000 In some sort of interaction term or a nonlinear transformation, or something so we're gonna go ahead and try and refit the model. 00:59:51.000 --> 00:59:58.000 But now, including this last interaction term between x. One and x 2, so we're gonna go ahead and do that. 00:59:58.000 --> 01:00:01.000 So df. At x. One times df, at x. 01:00:01.000 --> 01:00:12.000 2, and so, while I'm fitting the model, I'll also say you might be confused, because I didn't do a train test split here again. 01:00:12.000 --> 01:00:21.000 This one is just sort of, for like demonstration purposes of, you know the model and the transformations. 01:00:21.000 --> 01:00:33.000 This is not real data. It's since that. So if I wanted to, I could always go out into the world and produce new data I'm not in like predictive model trying to find the best possible model for any or anything like that. 01:00:33.000 --> 01:00:34.000 It's just for instructive purposes. So that's why I did not make a trained test Split. 01:00:34.000 --> 01:00:42.000 Okay, so we've refit this model. We fit this model. 01:00:42.000 --> 01:00:43.000 Now we're gonna remake our residual plots. 01:00:43.000 --> 01:00:50.000 But I'm gonna make those edits that you saw me make earlier, just real quick. 01:00:50.000 --> 01:00:55.000 There we go, and so now you can see it's sort of like this nice band between the values of negative 2 and 2. And this is like what you're looking for with your residual plots. 01:00:55.000 --> 01:01:11.000 And again we'll talk. This was sort of just like a preview of something we'll dive into a little bit more deeper and we'll talk. This was sort of just like a preview of something. We'll dive into a little bit more deeper next week. 01:01:11.000 --> 01:01:15.000 So this tends to indicate that you've captured a decent amount of the signal in. 01:01:15.000 --> 01:01:21.000 So like, I said, maybe this seems like a mystical process of like, how are you supposed to know? 01:01:21.000 --> 01:01:24.000 We'll talk about this in much more detail next week. 01:01:24.000 --> 01:01:32.000 Okay. So I also promised I would tell you why, if the relationship that seems to be linear is X squared, why are we also including X, one? 01:01:32.000 --> 01:01:42.000 So this is about something called respecting the hierarchy, and so, if we look at the coefficients, you might also look this is the coefficient on X one. 01:01:42.000 --> 01:01:50.000 This one. I'm highlighting right here. This is relatively close to 0. 01:01:50.000 --> 01:02:01.000 So you might be thinking, well, I should probably just try and remove it, because it's already pretty much 0, anyway, and just fit this model that I've provided here like, maybe this is the functional form. 01:02:01.000 --> 01:02:17.000 So you don't wanna do something like that. So when you're building polynomial regression models or regression models with interaction terms, you have to include all of the lower powers of of the of the variable as well, so if you want to include a model that as 01:02:17.000 --> 01:02:22.000 x one squared in it. You also have to include the variable X one. 01:02:22.000 --> 01:02:28.000 If you were to only include X, one squared, you'd be fitting a a parameter of the form. 01:02:28.000 --> 01:02:38.000 Beta 0 plus beta one x one squared. And you're limiting the flexibility of your model to only consider parabolas of this form. 01:02:38.000 --> 01:02:41.000 So by including the X one, you're able to fit like the whole range of parabolas on the real numbers. 01:02:41.000 --> 01:02:44.000 And so that's why you need to include all the lower power. 01:02:44.000 --> 01:02:55.000 So. For instance, if you had something that you were including X, one cubed, you'd also need to include X one squared and x one. 01:02:55.000 --> 01:02:56.000 If you had x, one to the fourth, you need to include X cubed, X squared, and x as well. 01:02:56.000 --> 01:03:08.000 So if you're fitting upon polynomial of the nth degree, you need to include all the lower degree terms as well. 01:03:08.000 --> 01:03:21.000 So similarly for interaction terms. If you include X, one times x, 2, you need to include both x one and x 2 as their own predictors as well. 01:03:21.000 --> 01:03:30.000 And then, just as a final thing before I open it up for questions, the same thing like just like we made, you know, polynomial transitions. 01:03:30.000 --> 01:03:33.000 And if you did the problem session today, you saw that you can do this as well. 01:03:33.000 --> 01:03:39.000 You can. You can just make nonlinear transformations of the columns so so you can take square roots. 01:03:39.000 --> 01:03:44.000 You can take logs, you can take signs. You can take E to the column. 01:03:44.000 --> 01:03:50.000 You can do any nonlinear transformation you'd like to and then include it in whatever model you'd like. 01:03:50.000 --> 01:03:51.000 So that's another step in the process of sort of you know. 01:03:51.000 --> 01:03:58.000 Looking at these different plots, and then seeing which ones maybe have linear relationships. 01:03:58.000 --> 01:04:02.000 Because again, we're doing a linear regression model. 01:04:02.000 --> 01:04:03.000 Okay. So now I'll open it up for questions before we end up. 01:04:03.000 --> 01:04:12.000 End. The polynomial regression notebook. 01:04:12.000 --> 01:04:15.000 Hi! I have a question. 01:04:15.000 --> 01:04:16.000 Yeah. 01:04:16.000 --> 01:04:17.000 It seems like this can get really complicated. Really quick. 01:04:17.000 --> 01:04:33.000 If you have like, lots of features so like in like a real world situation, you might have like 10 features, and then you might see that, like some of them are, you know, maybe quadratic looking. 01:04:33.000 --> 01:04:36.000 Some of them are exponential. So you have to include all those. 01:04:36.000 --> 01:04:46.000 And then you have to include all the interactions terms like, is there any advice for like doing this like in using real data where you might have lots of interaction terms? 01:04:46.000 --> 01:04:53.000 Or would you just scrap this whole linear thing? And maybe, you know, do something else. 01:04:53.000 --> 01:04:55.000 Yeah. So when you no, no, no, no, no, it's fine, that's fine. 01:04:55.000 --> 01:04:58.000 Sorry. It's kind of a vague general question, but. 01:04:58.000 --> 01:05:05.000 So like, if you have a data set that has lots of features, it would be difficult to. 01:05:05.000 --> 01:05:17.000 It can be difficult to do this sort of systematic process, like of plotting like, for instance, with like 10, it's maybe somewhat manageable to make like sort of a scatter matrix like this. 01:05:17.000 --> 01:05:20.000 But like you're saying, it could very quickly get out of hand and other problems. 01:05:20.000 --> 01:05:28.000 You'll have like more than 10. So in those sorts of situations you can't always do this sort of thing where you're doing like plotting or you're doing like going, you know, step by step, like this. 01:05:28.000 --> 01:05:37.000 So in those sorts of cases you could still consider linear regression models. 01:05:37.000 --> 01:05:45.000 But you might have to try more algorithmic approaches to selecting features which I know I said earlier, we are going to see more algorithmic approaches to selecting features. 01:05:45.000 --> 01:05:55.000 Next week. You may also consider other model types entirely so we don't explicitly cover the regression versions of these. 01:05:55.000 --> 01:06:01.000 But we do talk about like random. Forests and support vendor machines and K, nearest neighbors. 01:06:01.000 --> 01:06:06.000 Type stuff when we do classification and they have regression counterpoints. 01:06:06.000 --> 01:06:10.000 So you might just try different models. You might try more algorithmic feature. 01:06:10.000 --> 01:06:21.000 Selection approaches. If it's like, if it's reasonably manageable, you might just do the time, take the time to look at the different plots and then see like, okay, if, like, like, you said in your example, this seems to be quadratic. 01:06:21.000 --> 01:06:43.000 Then you might include the quadratic term and then do a you might include the quadratic term and then do a different. You can like just sort of do a cross validation, because linear regressions tend to be relatively quick quick to fit in comparison. To other models. 01:06:43.000 --> 01:06:48.000 And so Ky thoughts asking principal component analysis is another approach. 01:06:48.000 --> 01:06:54.000 So Princeton component analysis isn't necessarily showing you which features are the most important. 01:06:54.000 --> 01:07:14.000 For as predictors, or give you the best predictions, but it can be a pre-processing step for linear regression models are the most important, for as predictors, or give you the best predictions, but it can be a pre-processing step for linear regression models in particular linear regression models don't perform well, if the columns are highly correlated 01:07:14.000 --> 01:07:15.000 with one another. So this is because you can essentially rewrite one column as a linear combination of the others, or at least get close to it. 01:07:15.000 --> 01:07:26.000 So the regression fit performance badly. In those cases you might prefers badly in those cases. You might perform. Pca. 01:07:26.000 --> 01:07:27.000 First, to get a set of perpendicular predictors. 01:07:27.000 --> 01:07:28.000 We'll talk about Pca. First to get a set of perpendicular predictors. 01:07:28.000 --> 01:07:32.000 We'll talk about Pca. More next week. It's not really. 01:07:32.000 --> 01:07:37.000 It's not really something that's done in terms of like figuring out which features are most important for predicting. 01:07:37.000 --> 01:07:43.000 Why? 01:07:43.000 --> 01:07:46.000 You know, like I always thought Pca was just used. 01:07:46.000 --> 01:07:52.000 If you have like a lot a lot of features like more feature than you have rows. 01:07:52.000 --> 01:07:53.000 Yeah, so. 01:07:53.000 --> 01:08:01.000 But I was more just. I was more just saying, like like this model seems complicated, even if you have like 3 features like. 01:08:01.000 --> 01:08:02.000 Cause. You have to consider interactions with them and everything like you can get comfortable. 01:08:02.000 --> 01:08:12.000 Yeah, so it can get complicated. Yeah, yeah. Yup. So Pca can also be used to sort of compress the data as well. 01:08:12.000 --> 01:08:17.000 Yes. 01:08:17.000 --> 01:08:18.000 Great. 01:08:18.000 --> 01:08:25.000 I just had a question about the first part of this exercise where you square one of the variables. 01:08:25.000 --> 01:08:26.000 Yeah. 01:08:26.000 --> 01:08:30.000 Why did what were you looking for when you squared it? I didn't quite follow. 01:08:30.000 --> 01:08:34.000 So remember. So we're doing linear regression in order for linear regression to be a good model. 01:08:34.000 --> 01:08:48.000 It has to be a linear relationship. So if we look at this first plot, the relationship between Y and X one here is clearly not linear, but it does look like sort of like an even polynomial right. 01:08:48.000 --> 01:09:06.000 So even polynomials tend to both. Go go up in both directions and sort of kind of curve like this. And so it might be reasonable to try something like an X one squared to see if that helps. And so that's why we tried X one. Squared. 01:09:06.000 --> 01:09:13.000 Okay. Thanks. 01:09:13.000 --> 01:09:30.000 Okay, so we're gonna leave the world of regression for the rest of today and go into the world of data pre-processing and talk about scaling data and then depending on how long this takes us, we may also start something called pipelines we'll just have to see where we 01:09:30.000 --> 01:09:43.000 are at the end of this notebook. Okay, once my kernel starts. 01:09:43.000 --> 01:09:48.000 Awesome. Okay? So we're going to learn about things that scalars. 01:09:48.000 --> 01:09:49.000 In particular, we'll focus mainly on standard scalar. 01:09:49.000 --> 01:10:05.000 But this you know, the processes we learn in this notebook will apply for every single scalar object in S scale, and then I also see I missed a question from Lara. 01:10:05.000 --> 01:10:08.000 Can you do a different transformation for each variable Yup? 01:10:08.000 --> 01:10:09.000 So it could be the case that you want to do something like make a square for X one. 01:10:09.000 --> 01:10:20.000 Do a log transform for X 2, just like well, I guess it's not necessarily different from today's problem session. 01:10:20.000 --> 01:10:24.000 But in today's problem session, you did log transforms of, I guess, one of the features. 01:10:24.000 --> 01:10:28.000 And then the thing you're providing. But you can. You can. 01:10:28.000 --> 01:10:31.000 It doesn't have to be the same transation for every single variable. 01:10:31.000 --> 01:10:39.000 They can be different transformations, depending on what you're seeing in the data. 01:10:39.000 --> 01:10:45.000 Okay, so back to Scalar. So we're gonna pretend that we have some data here. 01:10:45.000 --> 01:10:50.000 So it's just gonna be a series of different, randomly generated variables. 01:10:50.000 --> 01:10:56.000 And then, if you look at the way these variables are generated, you'll see that they have very different scales. 01:10:56.000 --> 01:11:00.000 And so what do I mean when we say the scales of the data? 01:11:00.000 --> 01:11:01.000 So we just mean, like the powers of 10, basically. 01:11:01.000 --> 01:11:08.000 So, for instance, this one x. One, the first variable. 01:11:08.000 --> 01:11:18.000 If you look at the variance, it goes as a very high variance so, and you know, if you look at the way it's generated, it's like in the 1,000 the next. 01:11:18.000 --> 01:11:21.000 This one is sort of looking at the variance in the mean. 01:11:21.000 --> 01:11:25.000 In the ones, and and just 10 like the ones in 10. 01:11:25.000 --> 01:11:29.000 Here you have something that's also, and like the the thousands or tens of thousands. 01:11:29.000 --> 01:11:32.000 And then here you have something that's on the scale of like the hundreds. 01:11:32.000 --> 01:11:33.000 And so sort of like. We mentioned earlier some of your models. 01:11:33.000 --> 01:11:40.000 Will struggle if you have vastly different scales. 01:11:40.000 --> 01:11:46.000 So up to this point, we're Regression has been relatively well behaved with the models we fit. 01:11:46.000 --> 01:11:51.000 But in general machine learning models and data science models can behave poorly. 01:11:51.000 --> 01:11:53.000 If one of your columns has a vastly different scale from the other column. 01:11:53.000 --> 01:11:56.000 So imagine a column that's in, you know the tenths versus a column that's in the 1 million. 01:11:56.000 --> 01:12:09.000 So I like that sort of thing. So typically what you'll do as a what's called a pre-processing or cleaning step is, you will scale all of your data so that the different columns are operating on the same scale. 01:12:09.000 --> 01:12:31.000 So one way to do this is standardization, which is slightly different from normalization, that we talked about earlier with some of our questions which is slightly different from normalization, that we talked about earlier with some of our questions. So in standardization, you're doing something called standardizing, it which means you're going to do the following so you take your 01:12:31.000 --> 01:12:49.000 variable X. Maybe this represents a column. You take your variable X, and then you do all the observations minus the arithmetic mean of that column divided by the standard deviation of that column, and so, if you've taken a frequentist statistics. 01:12:49.000 --> 01:12:51.000 Course, or use something called a statistics course, or use something called a Z table. 01:12:51.000 --> 01:12:58.000 This should look familiar. So this is the exact transformation applied to turn any arbitrary, normal, random, variable into what's known as a standard, normal, random, variable meaning. 01:12:58.000 --> 01:12:59.000 It's a normal, random, variable, meaning. It's a normal, random, variable, with mean 0 and standard deviation. 01:12:59.000 --> 01:13:07.000 One. So this is why this process is called standardizing, normal, random, variable with mean 0 and standard deviation one. So this is why this process is called standardizing so when you standardize a column, all of that column that call will then have mean 0 and standard deviation. 01:13:07.000 --> 01:13:25.000 So when you standardize a column, all that column that call will then have means 0 and standard deviation. One. 01:13:25.000 --> 01:13:29.000 And so this way you could take all 4 of these columns and then put it on the same scale from 0 to one. 01:13:29.000 --> 01:13:30.000 So how do you do this? In Python? You have to use something called the standard scalar object. 01:13:30.000 --> 01:13:39.000 So this is stored in the pre-processing sub package of of Sk. 01:13:39.000 --> 01:13:48.000 Learn. So we're gonna see a lot of tools from pre-processing over the next couple weeks. 01:13:48.000 --> 01:13:55.000 So one of them is standard scaleer. Okay, so this is known as a scalar object. 01:13:55.000 --> 01:14:02.000 There's multiple scalar objects. So if we go down here, see where it did it go? 01:14:02.000 --> 01:14:06.000 I think we still want to be in pre-processing. 01:14:06.000 --> 01:14:09.000 Here we go. So we've got like Max Apps scalar min. 01:14:09.000 --> 01:14:21.000 Max Scalar normalizer a standard scalar, robust scalarer, robust scaler. So we're going to focus and mostly just use standard scalar. 01:14:21.000 --> 01:14:27.000 But it might be, you know, interest. You might be interested in looking to see what these different scalars do by checking out the documentation. 01:14:27.000 --> 01:14:31.000 So what's show you how to use the standard scalar and python? 01:14:31.000 --> 01:14:44.000 The first thing we need to do is import it so from sqlern.be processing, we're going to import standard scalar, and then also I wanna make a quick note. 01:14:44.000 --> 01:14:48.000 I noticed in some of the problem session work groups earlier today. 01:14:48.000 --> 01:14:51.000 Maybe some of us are newer to Jupiter notebooks than others. 01:14:51.000 --> 01:14:58.000 So one thing that's really easy, instead of having to like go up here and click, run, or some people have the run button over here. 01:14:58.000 --> 01:15:02.000 If you just click, shift and enter at the same time it runs the code. 01:15:02.000 --> 01:15:07.000 Another thing that's a nice feature is you can try sort of auto complete. 01:15:07.000 --> 01:15:19.000 So you might have noticed when I started from Sklearn dot-complete. So you might have noticed when I started from S. Learn dot pre then like this showed up. 01:15:19.000 --> 01:15:22.000 So this showed up because I hit the tab button, and so by hitting the tab button. 01:15:22.000 --> 01:15:27.000 It shows you like using. Its sort of memory and what it knows about the package like here are the things that you might be trying to type out. 01:15:27.000 --> 01:15:30.000 So if you click on it, it will then auto complete for you. 01:15:30.000 --> 01:15:32.000 If there's only one option, it just does the auto complete already? 01:15:32.000 --> 01:15:35.000 So this can be a nice feature that cuts down on your coding. 01:15:35.000 --> 01:15:41.000 Okay, so aside a Jupiter notebook aside over back to standard scalar. 01:15:41.000 --> 01:15:45.000 So the first thing we have to do is make a standard scalar object. 01:15:45.000 --> 01:15:54.000 So I'm going to call that scalar equals. And then you just say, Standard scalar. 01:15:54.000 --> 01:15:57.000 Then we have to do what's called fitting the scalar. So we do. 01:15:57.000 --> 01:16:00.000 Scalar got fit of X, and so remember, why are we doing X. 01:16:00.000 --> 01:16:09.000 Because that's what my data stor in. So we can imagine that this is X-ray or X test. Well, sorry. 01:16:09.000 --> 01:16:18.000 Don't not excess, but just X strain. We can imagine that this is like an X train for a model, but for simplicity of typing, I just did X okay. 01:16:18.000 --> 01:16:22.000 And so you might be wondering, what do you mean that you have to fit? 01:16:22.000 --> 01:16:29.000 So what's happening when you fit the scalar is, it's going through each of the columns of your of your array. 01:16:29.000 --> 01:16:39.000 Find the mean, and then also finding the standard deviation because it needs to know both of those things in order to perform the standardization. 01:16:39.000 --> 01:16:43.000 So that's what's happening when we call dot fit. 01:16:43.000 --> 01:16:50.000 Okay. The next thing we need to do is then scale the data and sk learns, speak syntax. 01:16:50.000 --> 01:16:55.000 This is called transfer. So we'll do scalar, dot, transform. 01:16:55.000 --> 01:17:01.000 And we input, X and then this will be stored in a new variable that we call X scale. 01:17:01.000 --> 01:17:10.000 Okay. And so now, if we look at the mean and the variance of the scale data, we can see that all of them have means that are virtually the same as 0. 01:17:10.000 --> 01:17:13.000 So computers, it's difficult for them to get to 0. 01:17:13.000 --> 01:17:19.000 But this is like close, and then all of them have variances that are virtually the same as one. 01:17:19.000 --> 01:17:25.000 Okay. 01:17:25.000 --> 01:17:26.000 Okay, so. 01:17:26.000 --> 01:17:30.000 Is this okay to do? If you have one hot encoded variables? 01:17:30.000 --> 01:17:41.000 Yeah. So if you have, this is typically only done for continuous variables. 01:17:41.000 --> 01:17:42.000 Okay. 01:17:42.000 --> 01:17:45.000 If you have one hot encoded variables, you would have to separate those off from your array, or when we learn pipelines you'd have to make a custom. 01:17:45.000 --> 01:17:49.000 Pipeline objects in order to do that. 01:17:49.000 --> 01:17:52.000 Sorry I have a question. 01:17:52.000 --> 01:17:53.000 Yeah. 01:17:53.000 --> 01:17:59.000 What will happen in the situation whereby we have the data set already, and they're in the same scal already. 01:17:59.000 --> 01:18:03.000 I see quite, and apply this. 01:18:03.000 --> 01:18:18.000 Yeah. So if all of your row, if all of your columns are on about the same scale, you theoretically wouldn't have to do the scalar, I think sometimes it would still be like good practice to do the scalar so like. 01:18:18.000 --> 01:18:32.000 I'm trying to think of a good example off the top of my head, but I think like it's it's just typically good practice use the standard scalar, for, like the models where that struggles. But if all of your columns are on about the same scale. 01:18:32.000 --> 01:18:37.000 You know theoretically, everything should be okay. But I think, like, you know, computers can struggle with like really large numbers or with really small numbers. 01:18:37.000 --> 01:18:47.000 So sometimes it can still be useful to standard scale. 01:18:47.000 --> 01:18:48.000 Yeah. 01:18:48.000 --> 01:18:51.000 Okay. 01:18:51.000 --> 01:18:58.000 Okay. So I wanna make sure this whole fit transform and fit underscore transform thing makes sense. 01:18:58.000 --> 01:19:15.000 So fit was the fitting process that meant, in this particular instance, calculating the mean and the standard deviation of every column transform was then the thing that went through and actually calculated for each column. X. 01:19:15.000 --> 01:19:22.000 Minus mean, divided by standard deviation. So you always have to call fit before you can call transform. 01:19:22.000 --> 01:19:28.000 So, for instance, if I made. 01:19:28.000 --> 01:19:37.000 Scalar. 2 is also a standard scalar. Object, and then I tried to call Scalar to dot, transform X. 01:19:37.000 --> 01:19:40.000 We'll get an error, and we'll see right here at the top. 01:19:40.000 --> 01:19:50.000 Not fitted error. So why is this fitting always has to be done. 01:19:50.000 --> 01:19:56.000 Before transforming. Okay? So let's minimize that. 01:19:56.000 --> 01:20:03.000 And so there's also this nice function that you can use for these objects called fit underscore, transform. 01:20:03.000 --> 01:20:15.000 That will do all of this at the same time. So if we tried this again, and we did scalar to dot fit underscore, transfer X. 01:20:15.000 --> 01:20:18.000 We'll see that we get out what we want. And then later, we would not have to refit it, because it's already been fit once so fit underscore trans. 01:20:18.000 --> 01:20:30.000 Transform does both fit and transform at the same time. 01:20:30.000 --> 01:20:36.000 Okay. So you might be wondering, well, if it does both at the same time, why do I need anything other than fit transform? 01:20:36.000 --> 01:20:42.000 That's should always be what I use. So whenever you're doing predictive modeling work, you need to do only fit your scalars on the training data. 01:20:42.000 --> 01:20:52.000 Never on the test data. So remember, the idea with predictive modeling is we don't. 01:20:52.000 --> 01:21:10.000 We don't know what the labels on, whatever our test set or holdout sets are, so we have to fit the scalar on whatever the training data is, and then use that fitted scalar for test or holdout data, so if we were then to maybe go back and try and refit the scalar using 01:21:10.000 --> 01:21:14.000 the test or the holdout data that would be, what's known as data leakage. 01:21:14.000 --> 01:21:25.000 Okay, so let's go ahead and show you what this looks like, and sort of a predictive modeling workflow. 01:21:25.000 --> 01:21:27.000 So I'm gonna import my train test split and then make a trained test split of X so I also will take this change. 01:21:27.000 --> 01:21:39.000 This point out. So I think I confuse people earlier. So when I introduce the train test split, I used an array here. 01:21:39.000 --> 01:21:45.000 It's also an array. But you don't always have to have both an X and a Y for trained test. 01:21:45.000 --> 01:21:48.000 Split. So if you're only to splitting one thing, it's okay to put that one thing in train test split is also built. 01:21:48.000 --> 01:21:50.000 It built to be able to take 2 things like an x and a Y, so you could do that. 01:21:50.000 --> 01:22:08.000 So, for instance, like, if I had a Y, I could put it here, but I don't have to, and so then, when I had something like the data frame that I had earlier, in the problem session, you can just put the data frame in here. 01:22:08.000 --> 01:22:12.000 And then split that into like I think it was cars, train and cars. Test. Okay? 01:22:12.000 --> 01:22:17.000 So now that asides over, let's sort of go through our workflow. 01:22:17.000 --> 01:22:25.000 So we would first make our standards scalar object. 01:22:25.000 --> 01:22:27.000 We would fit it on the training setting and then get the transformed scaled data on the training set. 01:22:27.000 --> 01:22:38.000 So I'll turn it we could think of. We could have used fit transform here, because it's just the training setting. 01:22:38.000 --> 01:22:51.000 Then we would go ahead and imagine that we build some model like a linear regression or something, and then, once we're done, we want to get this in order to get the fit or the model prediction on the test set. 01:22:51.000 --> 01:22:58.000 We would do a scalar new dot transform X test. 01:22:58.000 --> 01:23:01.000 So we would not refit the scalar on the test set. 01:23:01.000 --> 01:23:05.000 We use the fit scalar from the training data. Okay? 01:23:05.000 --> 01:23:10.000 And so like. I said earlier, there are other Scalar objects from standard scalar. 01:23:10.000 --> 01:23:24.000 You can find this in the documentation, and then I can see in the chat that some of you have already started exploring the documentation which is great, and I believe Pedro, we are going to look at polynomial features in a later notebook. 01:23:24.000 --> 01:23:32.000 So that's a great find. Okay? So if there are any, are there any questions about scaling the data? 01:23:32.000 --> 01:23:41.000 Yeah, so we transform the tests. That would be mean and standard deviation of the test set. 01:23:41.000 --> 01:23:42.000 Sorry the training set. Am I right? 01:23:42.000 --> 01:23:47.000 Yup, yeah. Yup, yup. 01:23:47.000 --> 01:24:00.000 So we can essentially do the. If I understood this correctly, we can do the fit underscore transform to the entire data set before we split it. 01:24:00.000 --> 01:24:01.000 So. That's a great question. Because you you cannot do that. 01:24:01.000 --> 01:24:04.000 Is that correct? 01:24:04.000 --> 01:24:28.000 So you can't do that because the data from your test set would then be leaking into your model fitting process because it would be encoded into the mean and the standard deviations of the scale like that were fit with the scalar so it would be encoded into the mean and the standard deviations of the scale like that were fit with the 01:24:28.000 --> 01:24:35.000 scalar so the standard scalar should be considered a part of your modeling. So, like the mean and the standard deviation you get for the scalar can only come from your training sets not from your test. Set. 01:24:35.000 --> 01:24:48.000 So if you did this before the train test split some of the data from the test set would be leaking into your model because it would be encoded in the mean and the standard deviation of the of the scalar. 01:24:48.000 --> 01:24:50.000 Cool. Thank you. 01:24:50.000 --> 01:24:53.000 Yeah, yeah, thanks for asking that question. 01:24:53.000 --> 01:24:58.000 I'm I'm just trying to recap what you explained. 01:24:58.000 --> 01:25:02.000 So you're saying that so you would. You would do the fit transform in one. 01:25:02.000 --> 01:25:05.000 Go on. The training set, correct. 01:25:05.000 --> 01:25:08.000 Yeah, so you can do that. Yup. 01:25:08.000 --> 01:25:13.000 But on the test set you would just do fit and not transform it. 01:25:13.000 --> 01:25:18.000 You would just do, transform and not fit. 01:25:18.000 --> 01:25:21.000 Okay, so yes. So you would just to transform a not fit. 01:25:21.000 --> 01:25:25.000 And then you would build the model and test it on the training. 01:25:25.000 --> 01:25:28.000 And then once you have, yeah, okay, I'm then tested on the testing. 01:25:28.000 --> 01:25:38.000 Okay, right? So you so basically, you're saying, you never do on the test set right? 01:25:38.000 --> 01:25:42.000 You shouldn't do fit on the testing. 01:25:42.000 --> 01:25:43.000 Okay. 01:25:43.000 --> 01:25:54.000 Yes, Yup, like, you know, with the caveat that like once you're done, and you've selected a model like you would refit the entire model on the entire data set that you have but that's like at the very end. 01:25:54.000 --> 01:26:02.000 Like, I have this model, and I'm happy with it. And I'm gonna put it on whatever system my team is using. 01:26:02.000 --> 01:26:03.000 What if you're doing a validation set like doing some crossover? 01:26:03.000 --> 01:26:19.000 Idation. So would you, just, you know, do the fit, and then transform on the whole train set, and then split again into the train, and validation. 01:26:19.000 --> 01:26:24.000 Yeah. So the validation set or the holdout set from across validation. 01:26:24.000 --> 01:26:28.000 That is sort of mimicking the role of a test set. 01:26:28.000 --> 01:26:29.000 So each time through the crossfalidation, you would have to fit the scalar. 01:26:29.000 --> 01:26:41.000 Then excluding that validation split so like. 01:26:41.000 --> 01:26:42.000 Yeah. 01:26:42.000 --> 01:26:43.000 Okay. So you would make sure to not fit the validation set. 01:26:43.000 --> 01:26:49.000 Okay. I was thinking you would use the standard scalar before you do the cross validation. But. 01:26:49.000 --> 01:26:53.000 Yup, and then next week, because we're out of time today. 01:26:53.000 --> 01:27:02.000 Next week we'll learn about pipelines and, like those are like a nice way to basically like they wouldn't include the standard scalar as like a part of your model. 01:27:02.000 --> 01:27:13.000 So it'll be easier to rem like. Don't fit your you know your standard scalar on these holes. 01:27:13.000 --> 01:27:18.000 Great any other questions about anything. 01:27:18.000 --> 01:27:19.000 Yeah. 01:27:19.000 --> 01:27:22.000 Yes, I have a question. So eventually we would transform the this set also. 01:27:22.000 --> 01:27:31.000 So I was just wondering eventually going to make your prediction, because in dirty word, you want to see actually the baby or the reader. 01:27:31.000 --> 01:27:33.000 Not that I have transformed the desket. Also? Does it affects, like everybody are gonna be seen for linear regression. 01:27:33.000 --> 01:27:45.000 For example. Now, if you have transformed your data, your data to in number between 0 and one, now you, interview what you want to see, maybe like 100 and so on. 01:27:45.000 --> 01:27:48.000 So I think you go back. 01:27:48.000 --> 01:27:56.000 Yeah. So doing the like, the coefficients you would have to trans like a lot of times right in linear regression. 01:27:56.000 --> 01:28:06.000 You want to interpret the coefficients to get a sense of if I increase, you know, if I increase expenditures by 2, then I, you know, expect whatever increase in profits you know that sort of thing. 01:28:06.000 --> 01:28:13.000 So when you do this sort of scaling process, you lose the ability to directly interpret the coefficients. 01:28:13.000 --> 01:28:28.000 In the scale of the original data, you would have to do sort of like a backwards process to the coefficients, to sort of re-engineer like what it means in the original scale. 01:28:28.000 --> 01:28:30.000 I see that that makes sense. 01:28:30.000 --> 01:28:32.000 Yeah. 01:28:32.000 --> 01:28:33.000 Thank you. 01:28:33.000 --> 01:28:42.000 Of course. Alright! If there are no other questions, I'm gonna stop the recording, and then I'll hang back for a few minutes to answer like questions of that. 01:28:42.000 --> 01:28:44.000 People maybe didn't want recorded. 01:28:44.000 --> 01:28:50.000 Okay.