Advanced Pipelines Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We'll talk about more advanced pipelines. So in a, in a different video, we talked about basic pipelines and that's where we imported a pipeline object and did things like standard scaling. Uh If you've watched the imputation video, maybe you did a simple computer and then you could put it into a model sometimes. Um We're gonna need more functionality that's slightly more customized for our data set than what we can get doing a sort of simple or a basic pipeline that we talked about in that video. So in this notebook, we'll show you how to do so more advanced uh functionality in particular, we'll do this with that Penguins data set that we touched on in the imputation video. Uh In the process, you'll see how to build a custom transformer object. We will introduce the function transformer object from S K learn. We will then review again the fit transform and fit transform paradigm within in the process. And at the end, you'll build a pipeline uh or we will build a pipeline to predict Penguin type. So let's return to that Penguins data and this is gonna be a slightly different version of the data set. Again, this is from the Seaborne uh package. And so uh just refreshing ourselves that we now we have in this version of it, we have 333 entries. Uh Most of them have values for all the columns. The ones that are missing values for columns are the body mass and the sex column. So, uh we're missing some values for the body mass of the penguins and some values for the sex of the penguins. So we're just gonna start off um essentially the world that we're pretending we live in is we want to build uh a model that will predict the species of a penguin if we feed it in all of these other uh features. So what does that look like? Well, let's maybe take a second to look and see. Um Essentially this is like Sea Bourne's approach to the Iris data set. So remember the Iris data set, if you have seen that or if you watch those videos or we go over that data set, we have four different measurements for different types of viruses. Here, we have a little bit more than four for different types of penguins. Uh And the idea is maybe we can build a model that as you can see, can use these features to separate out the three different species of penguin uh and build a good model. OK. So, in addition to these uh quantitative or continuous features, bill length, build, depth flipper length. We also have the island that the penguin was living on as well as the sex of the penguin. So we want a pipeline uh because we have some of these missing values. Well, we're gonna have to do some imputation and we're also going to have to one hot and code things like island and sex, right? Because we can't be putting a male female into the model or Bisco Dream or to Tor Torgerson Torgerson. So the desired pipeline that we've taken some, let's imagine, we've taken some time and we've come up with is we're first going to impute missing values of body mass. Uh We're gonna do a simple impute with just the median value, we're gonna input missing values of sex and here we're just going to use the most common or the mode. So for instance, if the most common was male, that's what the computer would do. Uh We need to one hot and code island in sex as variables and then we will fit a random forest model to the data. So uh one reason that the basic pipeline approach that we talked about in a different video may have some issues here is we have two different types of impute imputation that we'd like to do. And in general, if we have a pipeline that just has a single, simple computer in it, it's going to use the same strategy for all of the missing variable or all the missing values in any column. So for instance, body mass and sex would be imputed in the same way. Uh Another issue is we have two uh features that need to be one hot encoded while none of the others need to be one hot encoded. So these are slightly more complicated than what the basic pipeline can accomplish, which is why we need to learn a little bit of additional functionality for these pipelines. And so how can we get some of our custom and functionality written into a pipeline where we have to make our own transformer objects? So a transformer object is anything that takes in your data set and then performs some kind of transformation to it. So we've seen a number of examples in previous notebooks. For instance, the standard scaler is a transformer that takes in data fits it, uh gets things like the mean and the standard deviation and then transforms the data by scaling it accordingly or standardizing it polynomial features would take in data and return the polynomial transformation of those features. Simple computer takes in data and then we're uh uh fills in missing values with whatever strategy you give it. These are transformer objects and uh S K learn has some nice functionality that allows us to make our own custom ones. These are preset from S K learns developers, we can make our own with the transformer object. Um we're gonna start. Uh and I'm just gonna show you some code and trying to explain it. And we're gonna start by making a custom imp computer for body mass G. And I wanna make a note some of this material may be if you're unfamiliar with uh classes and objects, that kind of concept in computer science or programming, it may be useful to go back and review the classes and objects in Python notebook, which can be found in the Python prep material within the repository. ... So in order to make our own custom transformer objects, we're gonna need three things or two things uh from S K learn. Uh Primarily we're gonna need the base estimator and the transformer mix in and we'll see what those do in a second. Uh Simple imp computer is something that we went over in the imputation video. This is a transformer object, that's its own thing that we're just gonna be borrowing some of the functionality from. So we're going to call our custom impute object, the body mass imp computer. And so uns similar to defining a function, we have some keywords that are important. Uh But this comes from defining your own custom class. So our class is going to be a body mass imp computer class. And when you define a class, you start by saying using the class keyword in Python. And then similar to how you define a function, you put the name of the class, which for us as a again is body mass computer, then you will have some parentheses. And now uh in this one, in order to get some of the functionality from S K learn as a transformer object, you need to input the base estimator. First, this gives us the base estimator functionality from scalar and then comma transformer mix in. And this uh gives us additional functionality from SCU learn uh which will make our job programming a transformer object much easier. Now, the first thing we have to do is define what's known as an initialization step. And so this is any time that we want to make uh a, a body Massu object which would be like um here, for instance is one where we make it uh any time we want to do something like this, this underscore underscore I N I T underscore underscore gets called. And so this tells the computer, hey, what do you want me to do whenever you want me to make one of these? And so what you're saying is I want to initiate it self refers to the object. And so any time I create one, for instance, in computer uh would be a reference to self. And all we're gonna do is we're going to endow each uh version of our body mass imp computer with a simple computer method and that method or class or feature. Uh I think it's an attribute actually. Uh it's going to just be a copycat of S K learns simple computer with a median strategy. And so this will give us all of the functionality of S K learns simple computer with a, a median strategy. And now what does every transformer need? Well, it needs a fit and it needs a transformer. And so to define the fit, we're just gonna say D E F fit. Uh When you define a new class, every uh method you give for the class has to have the word self in it. And what do we need for this? Well, we need an X and we need to use Y equals none. So why do we have a Y equals none? Well, an imp computer doesn't actually need any feature or um any output variable, right? The imp computer only will work on the input variables or the features. So that's the X and then the Y um we need to have that there because when we're going through a pipeline, right? We feed in both X and Y. Uh But in general, we don't need a Y. So you said Y equals to none as a default. So when we fit this, all we want to do is call simple computers fit method. And specifically, we're making it so that we fit it only on the body mass. So we're gonna imagine that the X that gets fed in is a data frame that has a body mass column. And then we're gonna reshape it because we'll have to reshape it for S Taylor. OK. So now we have essentially stolen uh or copied of simple computers fit method and then specified it so that the data frame uh we use it all only on the body mass uh column. We do the same thing with the transform method where we're just going to basically copy and paste simple computers transform method, but we're going to restrict it to the body mass column of the data frame. And uh the nice, the reason we have say copy X is I usually just like to add this in uh as a as a safe step to make sure that the data doesn't transform the original data frame which would be stored as X and when it gets called uh but is instead being done to a copy of the data frame which we get with X dot Copy. ... OK. So now that we have a custom computer class and a and again, if this seemed weird to you and you don't understand what's going on, I really encourage you to look at this classes and objects in Python Notebook in the Python prep material. Um Now that we have a class that is body mass computer, we can make a body Massu object. Uh So for instance, now we can use this to call and impute the body mass. So for here is where the training data has a missing value for body mass and then I can call computer dot I'm gonna use Fit transform. Uh And I put in my data frame and then I'm just gonna look at the same observations where the body mass used to be missing. And so now you can see that the body mass has been replaced with the median body mass and all of the other columns are exactly as they were before. Now, you might be thinking, wait a minute. That's weird. How can you have a fit transform when you didn't explicitly define a fit transform? So this is the nice thing. Uh I believe this is getting done by the transformer mix in. So when we call this, the developers of S K learn as a package have made a a nice thing to call transformer mix in which I think this is what this does. I may be wrong but essentially whatever it is, the S K learn developers have made it so that as long as you define a fit and as long as you define a transform, you're allowed to call fit underscore transform without explicitly defining it. And when you call it, it's going to work the way that fit transform is supposed to work. First, it will call fit and then it will call transform. So again, when you make your own custom computer, you don't need to explicitly define fit transform. Uh the transformer mix in will make sure that fit transform works. The way it's supposed to work uh without you having to explicitly define it. So now we're gonna go through and code a different computer for the sex variable that we're gonna call sex impute. So some of this has been filled in already. Uh just the base estimator transformer mix in and then the uh the, you know, the defined line of the initialization function. And so uh what I wanna do is I'm just gonna sort of like go through and code it so you can see it in action and maybe that will help click how it works in your mind. So what do we need to do first is we need to define self with the simple computer method. And so we're gonna do self dot Simple computer is equal to simple and Peter. And remember for sex, the strategy we wanted was just most frequent, which will give us the mode. Now, the next thing we need to do is define a fit method. So define fit. We need to input self, we need to input X and then we need to input Y but we're gonna set it to a default of none because we don't need a Y in order to call fit, right? Then we're just going to call self dot Simple computer that fit uh X and then we're at the sex column dot values dot reshape negative 11 and we can check, we need also the return statement, almost forgot that return self. Now, finally, I'm going to call define and I need the transformer, I need a transform function self X Y equals none. And what do I need to do? Well, I'm gonna first make a copy of X copy of X is equal to X dot copy. And then I'm going to, let's go ahead and just check uh I'm going to replace the co the uh sex column of the copy with the transformed version self dot Simple computer uh dot transform. ... And then X at sex dot values dot reshape negative 11. And then I return the copy of X. ... OK. Now I can make a sex computer object. Here's what it looks like. This is the only row of the data frame where I'm missing an observation for sex and then we can see what happens. Uh So it looks like male is the most frequent class here. OK? Now, finally, we need a one hot encoder for these data. So what it remember, what were our steps for the pipeline was the first thing we needed to do was impute body mass, which we did impute sex, which we did. And then the last thing we need to do is one hot encode island and sex. And so how can we do this? Well, there's a nice nifty feature called function transformer uh which is a transformer object in Python and S K learn uh which will allow you to take just a regular Python function uh which we will, we have already written. Well, I have already written down here. Uh It allows you to take a regular Python function and then it will take it and then turn it into a transformer object which will allow you to put it in a pipe, uh a pipeline which a regular function would not be allowed to be placed in a pipeline. So we're first going to say from S K learn that preprocessing, we will import the function transformer. And then here is a function where I've defined uh the imputation process. So it takes in the data frame which for us will be penguins train or penguins test. Uh I make a copy of the data frame because I don't want to change the original data frame. Just the copy. I first replaced the sex column with the get dummies version where zero will now represent a male, one will represent a female. And now I also get island columns um in the copy uh using get dummies on the island column of the original data frame. And these are the three names of the islands. And then finally, I'm going to return the copy data frame uh using o with only the columns that I'm most interested in. So for instance, I get rid of the original island column, but I keep the uh dummy variables that I've generated. And so here, for instance, is the penguin train data frame and I can then go through and see what would happen if I, you know, apply this function to the training data frame. And I'll look at the corresponding first five rows. So uh the bill length gets returned, the bill depth gets returned the flipper length and the body mass sex is now a 01 column. And then I now have these three island columns. You'll notice that species is missing. Um That's fine because it is remember it is the Y and where this is going to be applied to the X values right now that this is a function. This is a Python function. But Python functions uh can't go into a pipeline. We can't put it as part of a pipeline. So now we can use the function transformer to turn this function into an S K learn transformer object. So it's very simple. You just call function transformer and then we put in one hot uh what do I call it? One hot encoder? ... There we go. Uh one hot encoder is equal to function transformer, one hot encoder, uh one hot transformer. Sorry. And then once you have one hot transformer which is now a transformer object, we can see this one hot transformer. ... See it's a function transformer object. Now we can call one hot transformer dot transform and we get the same result as if we would have applied the function like we did above. Uh But now this is one hot transformer is something that can be fed into a pipeline. And if you don't believe me, you can go through and check it out. We're gonna make our pipeline down below. You can try out trying to put just the function in the pipeline and see what happens versus when we put the transformer in the pipeline. So now we're gonna make our pipeline uh and just as a review for how does a pipeline work? So we call pipeline. Uh then we're going to put in a list of arrays. Our first step is the bill Massu. ... So we're gonna call Bill Massu, ... right? That's what we called it. Let's just do a quick check to make sure that's what we called it body mass computer, sorry. So not bill but body ... then we call the sex impute. That's our second step. ... Our third step is the one hot encoding which we want as a function transformer of one hot encoder. Uh Before I finish, uh this might be bad form. But before I finish, let me, I wanted to also point out I wrote it here, but I forgot to say it. You might notice that here with transformer, we were able to call dot transform without fitting it first. In function transformer, you don't have to call fit first because nothing's being fit, right? We're just putting train the data data frame uh through the transformer object where nothing gets fit. Um So hopefully that wasn't, I don't know. Now, I'm worried that I confuse you by saying that, but take a second to think about it if you need to, I'm now going to go back to making my pipeline. ... Uh And then finally, the last thing I want to do is this random forest classifier. That's my last step. So random forest, random forest classifier. And if you're watching this before, we, you've looked at the random forest video. Uh Don't worry about this part, just think of this as a normal classifier. And then once you watch the random forest video, uh you can, you'll see what we're talking about. So we have 100 then I'm just gonna set a max depth of say three. Hm Actually, let's do five. Now, I have a pipeline, I can fit the pipeline. I can make predictions on the training set. I can, uh as we see here. OK? And I can make predictions on the test sets and then here is say the accuracy on the training set versus the accuracy on the test set. And actually I just got my own typo. So uh you may need to change that. I, I will try and remember to change it before I upload the notebook. Uh That's your version. OK. All right. So this was a, a video going through more advanced pipelines. What did we cover? We talked about how sometimes what we learned in basic pipelines don't, doesn't always work for a particular data project. Uh We may need more advanced functionality. So we learned how to make our own transformer objects in Python using object oriented programming. That's that whole class object thing. And then we also learned about functions, transformers and ask. ... OK. So I hope you enjoyed learning about all this and I hope you enjoyed working on a penguin data set. I know that I did and I hope to see you in the next video. Have a great rest of your day. Bye.