A Supervised Learning Framework Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi and welcome back. We're gonna continue learning about supervised learning by touching on a supervised learning framework which we'll be using throughout the rest of these uh notebooks on the different various models and algorithms uh we're interested in. So let me go ahead and put up our Jupiter notebook. This is a supervised learning framework contained within the supervised learning folder in our lectures in the repository. So we're gonna talk about a common theoretical framework used for supervised learning problems. And, and in particular, we'll rigorously define the setup. We'll give a basic outline of what, how we approach such a problem. And then at the end, we will briefly discuss the differences between predictive and explanatory modeling. Uh And mention that we're gonna focus more heavily on predictive modeling in this series of videos. So here's the framework that we're gonna be talking about. So we're gonna think in the world there's something that we're interested in predicting and we're gonna call this variable that we're interested in predicting the outcome variable. And we will typically denote it with the little letter Y. We also have some data set that we think might be useful in the prediction of why. And we're gonna call this the feature set or the input set. And oftentimes we'll denote this as a capital X. Now these variable names may change. But in general, we'll say that we are interested in predicting why using capital X in supervised learning, we're going to assume that there's some sort of underlying relationship between Y and X. And it's our goal to try and figure out the best we can. What that relationship is, this typically takes the form of a statistical model where Y is going to be equal to some function of X F of X plus epsilon, which is an error term. So epsilon can be thought of some random error that occurred when we were out measuring the data. Uh So F is gonna be a function. So typically, we assume that X has M uh variables. So like a set of M measurements for observation. So F is a function from the space uh the M dimensional space of real numbers to just the single dimensional space of reals. OK. Uh We're gonna try and approximate this or estimate this using some data that we collect. Uh in this framework. F of X is sort of thought of as the systematic information that X gives about Y. Uh If you've ever heard of the, you know, very popular uh Nate Silver book, we think of this as the signal that X is providing about Y. So like information about Y that we can derive from X uh epsilon, as I said is some random noise. Notably, we will most often assume that this is independent of capital X. Uh But the specifics about the noise are dependent upon the actual problem we are working on. So in general, we're gonna assume that epsilon is a random variable independent of X. It's just some random noise term uh in specifics when we get to specific models, we'll make more specific assumptions on epsilon. So I think it's useful to see this spelled out uh with pictures and seeing an example. So we're gonna go through and we're gonna use Python to generate some examples for this and to show how this works. In practice, uh we're gonna use, I will point out I have a bunch of Python code before uh written below. Uh I don't expect you to be able to recreate this perfectly. It's just being used as illustrative purposes. Uh You should be able to know most of this if you've been through the Python prep materials. But again, don't focus on the code focus just on the supervised learning framework that we're touching on. So, and to illustrate this, we're gonna suppose that instead of being M dimensional, uh we're gonna say specifically that X is a one dimensional vector. Uh And that we assume that there is a very simple relationship between Y and X that is Y uh is equal to X plus epsilon. So F of X in this case is just the regular is the identity function X. Uh So our model here is Y is equal X plus epsilon and epsilon is independent of X and distributed according to a normal distribution with mean zero and then some fixed variances uh sigma squared standard deviation sigma. ... So using Python code, this is what our systematic information is telling us. And so if we put X on the horizontal axis Y on the vertical axis F of X is here, so we're saying the systematic information is telling us that Y is equal to X, which is just this uh you know this line here. And so now we don't know, remember we don't know this. Uh we're trying to estimate this is our guess at what is truth. Uh But in practice, we don't know this, we come up with some idea of it and then we go on to estimate from it. So how do we estimate this? Well, we have to go out into the world and collect a data set from which to make estimates. Uh Sometimes this is called an observation or a sample. So we go out and observe in this example, 100 observations. So this would be like going out and taking a clipboard and writing. So, you know, in a physical way of thinking about it, maybe you go out and you write down the measurements of 100 people and that's what X represents whatever the measurement is, or maybe you scrape something that somebody has put up online. Those are our observations. Um So we can think of uh we're gonna collect 100 observations. So here is me collecting 100 observations using the random module or the random uh feature from NU pi. And then here is me generating the Y values that correspond to that using the underlying model, which I've assumed is true. So the blue dots here represent. So remember the black line was our um our systematic part Y is equal to X. These blue dots represent the sample. So you can see here that each of them uh is this F of X. But then we added an error term. And so that's why they're not exactly on the line. So in the real world, we wouldn't have the black line, uh we would only have the blue dots and now that we have the blue dots, we can go out and estimate something. So in this particular example, we're gonna estimate F using something called simple linear regression. Uh So we're gonna learn about this and one of our coming videos in the supervised learning. Uh So you may not know what linear regression is yet, you will learn it soon. But the algorithm that we're going to use to estimate F of X to sort of get an estimate of the black line that was used to generate this observation uh we're gonna use simple linear regression. So this code you may not have seen before. Don't worry about it. We're gonna talk about it soon. ... So now that we've collected the data, we use that data to make an estimate which is represented by this red line. And we can see that the red line isn't exactly on top of the actual relationship that exists, but it's pretty close. Uh And eventually we'll see that you can never recreate exactly the relationship that exists. You can only get as close as some irreducible term. Uh So what do we mean by close? Well, the particular example, uh particular definition of close depends upon uh the problem you're using. So here we're talking about having a low mean square error. Again, we'll see what that means soon. But in general, close is measured using uh a particular function or measurement that depends upon the problem that we're doing. So our thought of close will be different for regression than classification and may even be different from algorithm to algorithm. So as a quick review, a quick blow by, we assume some model which was actually the truth. In this case, we assumed that Y was equal to X plus epsilon, which was represented by this. So this is what we're assuming is true. Uh We collected some data, these blue dots, we use those blue dots and linear regression to estimate the relationship ... and got this red line. And so if we were then to go and get collect some more. Let's say we went out into the real world and collected some more observations. But didn't have the Ys, we could use this red line to estimate what Y would be uh for various values of X. ... Uh So we're gonna end by talking about like what are two of the main goals for supervised learning using this framework? The first is the one that we're gonna focus on the most, which is making predictions. So we're gonna want to make a model or algorithm that uses training data, which is what we would call the observation in this setting. For predictive modeling. We want to use that model to take in new observations for which we don't have the known output. So maybe we have a bunch of, maybe we trained a model to predict um the B M I of somebody by looking at the height and the weight of somebody. Uh That's a bad example by looking at just the height of somebody. Um And then we wanted to uh OK. This is a better example. We uh made a model predicting the weight of somebody using just their height. And so we collected, we were able to go out and collect both uh the height and the weight for 100 people. And that's what we just did. Uh And then we are now moving forward, we're only able to collect the height of somebody but now that we have this model, we can go ahead and estimate what their weight would be using our model. So that's sort of the idea of the predictions is we're gonna make predictions using the model that we fit from the observation. Uh The other idea that you might have is you want to be able to make inferences. And so the idea here is we wanna be able to produce a model that helps explain the relationship if any that exists between Y and X. And so, in this setting, the goal is to understand how changes in X impact Y. Uh So maybe if you increase X by a little bit, this is what happens in Y and that might be able to help you make uh choices in a business setting for instance, uh or maybe doing some sort of causal inference, which is again, is gonna be on the scope of this course. Uh But maybe you use this sort of inferential modeling to set up a causal inference and maybe you can help explain why something is happening in nature. And then we can use this inferential model to implement policy changes that would combat it or that something good help promote it. So uh one example in this setting is what's the, you know, the best estimate in an inferential setting might be to find the model that explains as much of the variants and why while still being parsimonious meaning you didn't put in too many variables. So in the rest of the notes that follow, we really will focus mostly if not entirely on goal number one. But at from time to time, we may touch on goal number two, which is making inferences. So we really care about making predictions. Uh We may touch a little bit on making inferences along the way. I also want to note that it may seem counterintuitive that we're thinking of this as two different things because surely if you can explain the data, uh you should be able to make predictions from the data. Uh That's usually a good rule that's like usually a good thought. Uh But it's not always the case that the best explanatory model is gonna also be the best predictive model. And so an example of this um comes from the winning team for the Netflix Prize competition uh which was a competition that Netflix held years ago that was uh trying to incentivize people to improve upon their recommendation algorithm with the promise of a million dollar prize to the first team that improved the algorithm by I think like 10% or something like this. Uh So when coming up with their winning solution, Bellcore's Pragmatic Chaos team left out features from their model that did help explain user behavior, saying uh we should mention that not all data features were found to be useful. For example, we tried to benefit from an extensive set of attributes describing each of the movies in the data set. Those attributes certainly carry a significant signal, meaning that they help explain the data. Uh and it can explain some of the user behavior, but we concluded that they would not help at all for improving the accuracy, which for them is uh how they measure how well their prediction was uh of well tuned collaborative filtering models. So again, it's gonna be the case that sometimes models that are uh features and models that explain the data really well, don't always make better predictions and models that are really good at making predictions aren't always the best at explaining the data. So sometimes it is the case that they are one and the same other times they may be different. And again, I'll end by saying we're really gonna focus on predictive modeling, but both approaches play an important role in data science and research and some, some data science scientists only focus on two with very little regard to one uh and vice versa. OK. So that's gonna be the end of this supervised learning framework video. Uh In the next video, we'll look at some data cleaning and preparation techniques that you have to do for supervised learning. All right, I hope you enjoyed this video and I hope to see you in the next video. Bye.