A First Predictive Model Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this second video in our regression topic. We're going to talk about a first predictive modeling project. So we're gonna give sort of an insight into the sort of steps you might take in a predictive modeling project. Uh Let me go ahead and share the Jupiter notebook and we'll be ready to go. So we're going to review some of the common steps in predictive modeling projects. Uh In this, we'll be working with the data set about baseball. Uh We will introduce the concept of a baseline model and at the end, we will practice implementing cross validation. So I'm gonna go ahead and import our things and we'll get going. So we're going to, in the context of this problem. I want us to imagine that we work for a major league baseball team and we're in the off season. So the reg the season's over, somebody's just won the World Series. And now it's our job to try and figure out what players we can bring into the team going into next season to help us get more wins. And so a reasonable question we might ask ourselves is whether or not, we want good defensive players, we want to bring in more good defensive players, which would be, help us limit the number of runs that our team allow other teams to score or should we go for offensive players, uh, which are going to be players that allow us to increase the number of runs that our team scores. And so, uh, whether or not is this, uh, which of these two is gonna help us better predict the number of wins that we're gonna have in the following season. Uh And so we can for existing, uh let's imagine that somebody else on the team is working on converting player stats into an estimate of the runs that they will bring to the team or runs allowed that they will uh diminish for the team next season. And then our job is to look at, um which is more predictive of wins, the runs that we score in a season or the runs that we allow in a season. So we're going to sort of look at this problem uh with this data set called baseball. So this is uh one of our first data sets. It's stored in the data folder of the repository. Uh This will work on Mac and Linux. But if you're running a Windows machine, you may need to change the direction of the slashes to go the other way. Uh I used to have to do that when I had a Windows but now I have a Mac and so that's why this is the code. So this data set has a team ID, which gives us the team. So for example, the first observation here in this sample was for Pittsburgh, the Pittsburgh Pirates. Uh It was from the season 2012. And all of the data in this goes from 2001 to 2018. Uh We have a league ID. So whether or not they were in the National League or the American League, which are the two halves of major league baseball. How many games they played in a season? So this will be 100 and 62 for every observation here. Uh How many wins they had in that season? How many losses they had that season? How many runs they scored capital R and how many runs they allowed? So how many runs did they give up to the other team? So our goal is to see whether or not runs or runs allowed is better at predicting wins. OK. So remember we talked about data splits. So the first step we need to do before any sort of exploration or modeling is we need to do a train test split. So, uh there's some stuff here that we're gonna have, you may be trying code. So there are things it's not perfectly uh empty. Uh Like for example, we import it, but now you can see here we're missing some stuff. So go ahead and pause this video if you'd like and try and fill in the rest. Otherwise you can just watch me do it and then see where we go from there. Ok. So we have train test split uh baseball dot Copy is here. Why is it dot Copy? Well, it's dot Copy because this is a data frame. And so I wanna make sure that I have a hard copy of it uh Before I split it into the train and the test set. Now I need to put in my uh random states, I need to put in a shuffle argument ... and I need to put in a um test size and says I want 80% for training. So 800.2 for test sites. ... OK? So now we have our train test split. One of the first uh steps before you actually do any modeling is exploratory data analysis. So a lot of times this will involve um computing things like some basic statistics on the data or, or making plots. And so in this example, we're gonna look at the uh relationship if there is one between W and R and then W and R A by just plotting a scatter plot of W on the Y axis or vertical axis and runs and runs a lot on the horizontal axis. So here we have on the left, we have our wins against our runs scored. And we can see here that uh obviously the more runs, you score, the more wins you get. And this is over the course of an entire season. So it's not like one team was scoring 900 runs in a single game. That's impossible. But over the course of the season, they scored 900 runs over here, we can see that there appears to be a negative linear relationship. So uh a linear relationship with a negative slope uh between winds and runs allowed. So the more runs you give up, the less likely you are or the less wins you seem to have. So both of these importantly seem to have a linear relationship between wins uh with wins. So there seems to be a linear relationship between wins and runs scored. Uh And there also seems to be a linear relationship between wins and runs a lot. So this is a really important thing to check uh when you're making your choice for model that you're gonna try. So far, the only model we know is simple linear regression. So it was important for there to be a linear relationship between the variables uh wins. And then the things we're considering as possible features uh because if there was not, then we wouldn't want to use simple linear regression. So as I said in a real modeling project, you probably would do some more data exploration, making additional plots or calculating some very simple straightforward statistics like correlations uh and variances and standard deviations things like that. Uh But for this very simple, straightforward problem, we're gonna end it with that visualization. Uh And it's gonna lead us to these candidate model, which is step four, you've done some exploration now, you're ready to maybe write down some models you might try. So we're gonna try two models in particular. The first being uh W is beta zero plus beta one times R plus some random noise. And then the second model is W is equal to beta zero plus beta one times runs aloud plus some random noise. Now, before we can go about trying to fit this and doing a cross validation, another model we're gonna want to consider is a baseline model. So baseline models are models that are relatively straightforward, easy to explain and make sense. Uh And their purpose is just to have a point of comparison. So you want to have a point of comparison when making a couple of different models and seeing which one does best because it allows you to see what was the point of going out and fitting a model any better than what I could do with this simple approach. So for example, maybe we go through this whole process and we find out that model one ends up with a mean square error of 100. So, is this any good? Uh is it bad? I don't know in the abstract, it's really hard to tell. It's only when we have a reasonable baseline model that we're able to put into context, how good our performance is. So if our baseline model had a mean square error of 1000 and the model we went with had one of 100. Well, we've done quite well. We've um you know, decreased our uh baseline by a factor of 10. But if our baseline models, MS E was 10, and then our model had an MS E of a of 100 we have now increased upon the baseline model uh MS E by a factor of 10. So it having the baseline allows us to tell is this good, is this bad? Uh And moreover, let's say we do have a model that has a lower MS E, it's not, let's say uh we're in the situation where the MS E was 10 for the baseline. And then we have a model whose MS E is 9.5. Now, if that model is quick to train and quick to make predictions, then we're in business, we have a better and we have an improvement over the baseline. But if the model takes a very long time to train and or a very long time to make predictions, it may not be worth the computational costs for that very minor improvement in the mean square error. So these don't uh these two points this point at the end here about having to think about computation time or computational cost or monetary cost for the model. They're not gonna be considerations in this notebook because we're just fitting simple on your regressions. But as you go out into the world and do more data science, particularly if you're gonna go work in industry to do data science or if you're gonna work in academia and do data science based research. Uh These start to be more and more considerations you have to make. It's not always just about choosing the model that performs the best with the MS E but also the model that has a good uh performance metric and doesn't cost too much, however, you're measuring cost. So for this problem and for any regression problem, a very standard first baseline is just to take the average of your output. So for us, our baseline is just gonna be looking at the um average of wins. So model zero, which is our baseline model is that wins for any particular team is equal to the expected value plus some random noise. OK. So we have three models, two real models in a baseline. Uh And then we can use cross validation to see which one will have the best average cross validation mean square error. So once again, there's code below, uh part of it is filled out, the rest of it is empty. I encourage you if you're interested in doing this on your own, pause the video, try and fill in the missing chunks of code or you can watch me fill in the missing chunks of code right now or you can try and code along with me at the same time. So if we're gonna do cross validation, uh we need to import K fold. So that's already written for us, saving us time. Now, let's make our K fold object with K equal five. So five goes first. Uh shuffle is equal to true and random state is gonna be 44. Oh So we now have a K fold object we ready. Uh So we're gonna look for the model with the lowest mean square error. While we could calculate this by hand, it's a good opportunity to use S K learns mean squared error function, which has its documentation linked to here. Uh So we can import means squared error from S K learn dot metrics. And then I also import the linear regression model object. So we're now ready to form cross validation. Again, there's some missing code. If you want to do it on your own, pause, the video fill in the missing chunks. If not, you can watch me do it right now or come back and check later. So the first thing I do is I always like to make an array of zeros. And what this is gonna do is we're gonna fill in. Uh So remember if we had like say a single model, we were looking at our array would have one row and five columns and then as we go through the different splits of cross validation, we would just record the mean square error from that split in each of the columns of the array. So here we have three models. So that's why we have three rows and we're doing five fold cross validation. So each column will represent the performance of one of our models on one of the data splits. This I the purpose of this I is just to keep track of what column I'm on. So which split in the fivefold cross validation am I on is gonna be uh uh kept track of with this I variable. So uh I need to fill in for train index. And again, you could call this holdout index, but I've just been used to calling it the test index for a very long time. Yes, not that long. Just like five years uh for train index test index in K fold dot split. And then I put um what have I been calling the training set? I think it's just baseball train. Let me check B B train, B B train. Uh Then we will get our training set. So I've called this B B underscore T underscore T to save some typing. So B B train um dot look train index holdout set will be B B train dot lo test index. So then model zero, uh all we have to do for model zero is take the average of the train train set. So the holdout or the uh the four splits that we'll train on. Uh And then multiply it, we're making a prediction. So the prediction should be the length of the holdout set. So here this is just gonna be an array where every observation in the array is the average value of wins for over our fitting set. Uh So now we're gonna do model one which is beta zero plus beta one times R plus epsilon. So we need to define a linear regression object copy X equals true. Then I need to fit it. So fit B B TT and then we're gonna do R dot values dot reshape negative 11 B B TT dot W dot values. Then we can get a prediction on my hold out set by doing model one dot predict. And then B B H O dot R dot values dot reshape negative one one. OK. And then model two is already filled in for us. So then we're gonna go through and we're gonna get the mean square errors for each of the three models we fit above. So the first one is done for us mean squared error, the true values on the holdout set and then uh the predicted values. So then we just have mean squared error B B holdout W values predicted one. OK. And then at the end, we're done with our first split. And so then we increase the column by one ... to do. ... So this might need to be changed to just Ilo. ... Here we go. So because I'm putting in an index here, I have to use Ilo because Ilo is uh locating rows by call by the index. OK? So now this figure will compare the performance across the three. So first we'll plot um the performance of the baseline, then the performance of model one, then the performance of model two and then I um label things. OK? So here's the baseline. Each black dot is a single split. And I'm actually gonna make a quick cosmetic change to make that more visible for people who may have uh color blindness. So let me keep going. ... OK? So I'm gonna make the color of all of these white, but then I'm going to make the edge color equal black ... and then gotta fix that ... and fixing that and then let's maybe increase the size to be 40 and let's go put the comma in. There we go. Uh Let's make the size even bigger 60. OK. So all of the um points here that are white filled in black outlines. Those are the values of the mean square error for the model in a single split. The red filled in circle is then the average cross validation MS E. And so we can see here that the one, the model of the three we considered uh with the lowest average cross validation, MS E is model two and that's uh a little bit under 80. Um And so we would choose if we were done here, which we're gonna assume that we are for the purposes of um showing what we do. We would choose model two. Uh And then let's say we've chosen model two as our final or we would then look at the test set. So, uh in practice, this was our first set of models that we looked at. Uh usually what you might do after this is try additional models. So we tried linear regression, we'll learn some additional models in the future that we might want to try. Um But we're not going to do any more modeling uh Because we've, we've reached the limits of what we know. So far, we only know simple linear regression and baselines. So we would say model two as the one we choose because it performed best. And then we could even say that it gave this much improvement over a baseline model. And then if we were to keep, uh if we were to keep looking, we could then use model two as a new baseline and see how much new models would improve over model two. So now we're gonna look at step six, which would be, we're all the way done. We've chosen a final model, which for us is the one regressing wins on runs allowed. And then we would say, all right, let's do a sanity check here. We're going to do the sanity check by checking the performance on the test set. And so we do this for two main reasons. The first would be, well, if we have any errors in our earlier coding, the hope would be that it would then show up when we're rewriting the code from scratch in the test set check. Uh So if we get something that's behaving weirdly when we check on our test set, that's maybe an indication there's a coding error. Uh It also allows us to assess on overfitting. So we'll see what this means in a few notebooks. Uh But essentially, if your fit is like extremely worse on the test set than it is on the training set, then usually that's an indicator that maybe there's some overfitting that you hadn't anticipated. So you might want to go back and recheck things as well. So we're just gonna go through. Uh Now that we've selected the model, here's the key thing. The note, once you've selected a model and you're ready to go in and check the test set, you then fit that model on the entire training set, not just the splits and then you can check the MS E performance. So here's the performance on the training set. So 74.99 is the MS E versus the performance on the test set, which is 72.22. So here uh these are comparable. So I would say that we didn't make a coding error. It doesn't look like we're overfitting. And I should also note that in practice, it's not going to be very common for your test set to have better performance than your training set like we did here. OK. So you've now worked your way through uh an initial very straightforward predictive modeling project uh in the future, you'll probably do similar steps if you work in predictive modeling. Uh And so now it's nice that you've been exposed to that. All right. So in the next video and notebook will continue our journey through regression. I hope you enjoyed this notebook and I hope to see you in the next video. All right. Have a great rest of your day. Bye.