Multilayer Neural Networks Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna continue learning about neural networks this time with an extension, the next extension off the perception known as the multi layer neural network. So let me go ahead and share my Jupiter notebook and we'll get started. So uh at the end of the Perceptron Notebook and video, we talked about that uh the limitation that perceptions had that they are limited to linear decision boundaries. This limitation sort of stymied neural network development that and the lack of uh good hardware for computers. Um But eventually in starting towards the uh 19 eighties, I believe there was a breakthrough that found these new architectures as well as a breakthrough in hardware for computers that allowed for a reuptake of multilayered neural networks. And so these this new architecture does in fact allow for nonlinear decision boundaries. So uh in this notebook, in particular, we'll introduce this architecture, we'll demystify the so-called back propagation algorithm which is really just the chain rule. Uh And then we'll demonstrate how to implement a multi learn network and S K learn. Let me go ahead and restart my kernel before we get started. Uh So I'm starting with a fresh notebook. OK. So uh the architecture for a feed forward network, feed forward networks are another name for multi layer networks. Um And maybe this picture will help see why we want to call them feed forward networks. So here is an example architecture of a multilayered network um otherwise known as a feed forward network. So we still have an input layer and we still have an output layer uh which should be familiar from the Perceptron. Uh But now we have this thing called hidden layers where there's gonna be extra layers of nodes in the middle. Um So why is it called a feed forward network? Well, from this image, we can see that each layer's nodes feed into each subsequent layer's node. So X one uh has a weighted sum that goes into the first, the top first hidden layer node, the middle first hidden layer node and the third most hidden layer uh 3rd 1st hidden layer node. So uh the input layer, each of the nodes here contributes to a weighted sum and each of the nodes in the first hidden layer. Then the first hidden layer uh has nodes that contribute to each of the nodes. And the second hidden layer through weighted sums as we'll see. And then the output layer is uh has inputs from all of the nodes. And the most recent hidden layer. The second one, um so this is why they're called feed forward networks because each layer of the network feeds directly into the next one. In particular, this diagram, this architecture uh is a neural network with two hidden layers, a feed forward neural network with two hidden layers. And then we would say each of these hidden layers have dimension three. Why do they have dimension three? Because they each have three nodes. So the height or sometimes known as the width of the hidden layer, the number of nodes in them is the dimension of the layer. These layers are called uh hidden because we only see in some sense, we only see what goes on in the input layer and we see what goes on in the output layer. So we see what we put in and we see what comes out. And so these nodes in the middle are known as hidden layers because we don't directly see what's going on in them. Uh If as we, if we think of feeding in the data on one side and seeing the output on the other sort of think of this as like what's going on under your car engine, you see the gas go in and then you see the movement the results of the gas going in right on the other side, maybe the hidden layers are sort of the engine. Um Maybe that's a bad analogy, maybe it's not OK. So this is a, a nice picture and a nice example. Let's touch on the mathematical setup behind a uh feed forward network. So again, we suppose that we have n observations of little M features, we let little X represent a single observation. This is an M by one vector which again, we could have a uh a set or an, an entry in the vector be um a column like a one uh which would give us a biased term. But for this setup, we're gonna ignore the column of ones. Uh Let's suppose further that our network this time instead of two hidden layers has K hidden layers. And in particular layer L will then have piece of L nodes. So here piece of one was three piece of two was three. We take H sub L to denote the vector corresponding to hidden layer L. So in this example, above this would be H uh this all three of these nodes would be held in a uh vector called H one, all three of these nodes would be held in a vector called H two. And then H one would have entry H 11, H two H 12 and then H 13. OK. Uh Then we suppose that our output layer has little O nodes. So again, this could be for regression or classification. Uh And then W is a P one by M weight matrix because it is represents the weights going from all of the features into all of the hidden layers. So it's a matrix. Uh We need it to have M columns because X is M or yeah, X is M by one and then it has to have P rows because uh H is P by one. I think that works out. Yes. Uh OK. Uh And then for L equals two through K W sub L then has to be a P L by P L minus one weight matrix. And then W K plus one has to be an O by P K weight matrix. And so all this is saying is we're setting up a bunch of matrices so that the dimensions of the multiplication would work out uh algebraically. Then this time we're gonna take capital fee to be our activation function. Uh in the last notebook or the Perceptron notebook, we use little Sigma here we're switching to a capital fee. So the output of the network uh so like the values of all these nodes, the input layer nodes, those are set by the data. But then the value of each hidden layer node and then of the output layer which is an approximation of Y these are determined by the following recursively defined equation. So H one is just some activation function applied to W one times X H L plus one. The next one is the, an activation function could be different. But in this setup, it's the same applied to W L plus one times H L. So the weighted sum of all the previous nodes and then the final estimation of Y and the output node is the a weighted sum of the last hidden layers uh with an activation applied to it. But just like in perception here, I have this bad uh notation where when I say fee of W one times X, I mean that this will give me W one times X will give me a vector. And then I'm going to apply capital fee, which is an activation function to each of the entries of the vector. So this will have some entry uh in the first position. Then I'm gonna apply fee to it in that first position to get H 11 if that makes sense. So that's the mathematical setup. Uh We're gonna see how you, the thing that we still have to estimate uh is like what are all these weights going to be? We have in this setup, we have a ton of weights. How do we figure out what they're going to be? Uh that's called back propagation, which we'll see in a second um Before we had this hidden layer where we drew out all the nodes and then drew all the arrows uh that can be tedious and very busy, especially when you get to real networks. So sometimes you'll see this architecture laid out like this where instead of having individual nodes, you'll have a rectangle angle for each layer. So I have a rectangle representing the vector X uh a rectangle representing each hidden layer which may or may not be the same size. And then uh a circle or if your vector Y ha is more than one dimension uh another rectangle may be for the output layer. So as I said, our thing, what we need to figure out is how do we find these weight vectors? How do we find these weight matrices uh that's called back propagation in practice. Uh And you know, all it really turns out to be is we're gonna take a bunch of derivatives. And because the of the dependence structure here, right, uh the value of H K if that's the last one is dependent on all the previous hidden layers, which is dependent on, you know, all the previous weights. And so basically what you have to do to take this derivative from CALC three and then also from CALC one uh is the chain rule. And so we have to do this gradient descent is what we're gonna use. But remember gradient descent, we have to take derivatives. How do we take derivatives of compositions of functions? We use the chain rule. So we're gonna use a very simple architecture of a of a feed forward network where I have a single input, two hidden layers of, of dimension one and a single output. And then we're gonna walk through the process of back propagation to sort of demystify what's going on. Uh If you've ever read any blog posts for neural networks. I always find that they make back propagation out to be this very difficult thing that only the wisest of the neural network people could figure out. Uh it's really just calculus and I'm gonna try my best to demystify it. So it doesn't, I mean, maybe you don't like taking derivatives. I don't like taking derivatives. Uh But at the very least, fundamentally, we know what's going on. I don't know why we give it this very silly name instead of just saying gradient descent. OK. So it's called back propagation. Uh The reason it's called this is because it's uh got a forward step and a backward step. And we're gonna use this back propagation to find the weights that we need here, which are labeled W one W-2 and W three. And then using that mathematical setup from above here is the um architecture and the mathematical setup for this particular network. So the four step of back propagation is you're going to first make some guess. So if it's the first time through the gradient descent, this is a random guess for the W S which is stored in a vector here. If it is uh a later time through the, the gradient descent, let's say you've done it once and now you're on your second step with a second guess for the weights you just use whatever your current guess is. Um And then you use this guess to sort of propagate uh sorry to propagate forward all the values. So once you have a guess for W one that allows you to calculate each one according to this formula, and then once you have a guess for H one, uh if we have a guess for W-2, then we can calculate a value for H two using that guess and the value we calculated for H one. And then if we have a guess for uh W three, then we can calculate, using all the previous stuff, we can calculate a guess for the estimate of Y. So basically what we're saying is at the first very first step, uh you use your random guess for W which is W one W-2 W three. And then you use this random guess to fill in the values from H one all the way to your estimate for Y. And then now you're going to have estimates for the H S and estimates for the Y S which we can use to perform the backward step. So the backward step involves uh calculating a cost function. So in uh in this particular uh setup, what set our cost function to be uh the actual or the um predicted minus the actual squared. So sort of like a square error, not sort of like it is a square error. And so in order to update W using gradient descent, we have to use that the new estimate or the new value of W is going to be the current value or we can think of as the old value of W minus some learning rate times the gradient of the cost function, right? Because we want to lower the cost uh at the current guess of W where here the gradient of C is taken with respect to W remember why hat is a function of these weights? Uh A A is a hyper parameter called the learning rate, which we saw um it was called alpha and the Perceptron notebook. And then for our derivation, we have to assume that whatever cost you're using is differentiable. Here, it is uh with respect to all the weights, there are some numerical workarounds for activation functions that are not differentiable everywhere. So uh in order to calculate the gradient of C uh with respect to the weights, we have to use the chain rule. And so we can just slowly work our way backwards, which is why it's called the backwards step. So the gradient uh the partial derivative of C with respect to W three. Well, according to the chain rule, that's the partial derivative of C with respect to the estimate of Y times the partial derivative of that estimate with respect to W three. Uh And so this, when you work all that out is here, and again, we'll assume that our activation function is differentiable with respect to W three. Similarly, we can use the gradient or the chain rule to calculate the partial of C, the partial derivative of C with respect to W-2. Well, that's partial of C with respect to Y hat times the partial of Y hat with respect to H two. But remember H two is a function W-2. So now we have the partial of H two with respect to W-2. And if you go through and calculate all of these three partials, you get this, OK. I'm not gonna say that part. And then finally, the last one we have to calculate is the partial of C with respect to W one. Uh which if you work all the way out is this. And then you can actually calculate all that by hand and you get this. And so you can estimate all of these using the value from the forward steps. Remember from the forward step, we'll have a value for a little X. We'll have a value for H one. We'll have a value for H two and we will have a value for Y hat. Uh So that's what you fill in here to get your estimates. Now, you can perform the gradient descent step uh and update your estimate for W ... do do, do. ... So then you're going to, I'm just reading what I wrote to make sure I'm covering everything. Uh You'll do sort of the random looping through all of the observations in your training set. Again, you can make an adjustment where you do like a, a ba a mini batch where you take a random selection of all the observations, minimize that cost function. Um That can be done as well to increase the speed of the computation. Uh When you go through your entire training set, either one at a time or in many batches, that's again called an epic. Uh You will set, typically you set the number of uh epics that you'd like to go through. We'll see that more in the car notebook after this one. And that's really all, that's all back propagation is uh you have these two steps where you get estimates for all the H S and all the Y hats. Uh And then you'd have this backward step where you perform the chain rule using the values that you got in your forward step. OK. So it's just grading descent where we have to use the chain rule an annoying amount of times. Um But that's all it is. And so again, this will be done by a computer. Um Someone in maybe an interview setting may ask you how back propagation works where you may even have to go through a calculation like this. That's fine. But I believe in practice, you're rarely going to be calculating gradients by hand. Um So we talked about one adjustment, we also talked about another adjustment for gradient descent is called stochastic gradient descent here. Instead of setting the learning rate by hand before you fit the algorithm. Uh You'll do it where each update has a randomly chosen ver uh value for the um learning rate. Uh And you can learn more about this and the very specific gradient descent notebook in the supervised learning lecture folder. So there is a notebook dedicated to gradient descent that can be found in supervised learning. Uh There's also a video corresponding to that notebook if you want to watch it. So let's now show you how to implement uh a feed forward network in S K learn using the S K learn version of the data set, uh the M N I S T data set. ... OK. So now I've got, I've loaded my data, I've scaled it. So this is another algorithm where it's usually a good idea to scale your data. So here I'm just dividing by the maximum value of the pixel uh 2 55 to scale it between zero and one. And then I get a validation set because with uh neural networks, you typically use validation sets instead of cross validation. Just because neural networks can take a long time to train. We're going to use M LP classifier which is um uh the neural network or the feed forward network classification uh algorithm. And S K learn. If you had a regression problem, you could use M LP regress or M LP stands for multi layer Perceptron. So we first have to import my model object which is from S K learn dot NL network, we will import the M LP classifier. And now I'm gonna make two classifiers. The first will have an architecture of a single hidden layer with 500 nodes. And then the second will have an architecture of two hidden layers each with 200 nodes. And we'll talk about why we might want to do something like this before we end the notebook. OK. So M LP classifier, then we want hidden layer sizes equals uh and then in the first one, you do a tuple and then you put in the number uh the dimension of the first hidden layer. And if I don't have another hidden layer, I put a comma and then leave the tuple open like this. And then I want to increase the maximum number of iterations uh for the gradient descent. So max Iter is equal to let's do 5000 and hopefully that's enough. And now for the next one, I want to have two hidden layers of 200 nodes each. So I have 200 comma 200 comma. And again, leaving the last place of the tube will open like this. That's just the syntax. Uh And then again, max iterations equals 5000. Do do, do what did I do wrong? Uh hidden layer sizes? Sorry about that small typo. Here we go. And then we fit it just like we fit every other classifier. This might take a little bit to fit, but once it fits, uh we'll be all good. Uh Again, this takes a while because we have a lot of nodes that we're working through. Um Maybe what I'll do is if it doesn't stop before I finish this sentence, I will pause the recording and then come back um when it's already fit. Ok. So after a little bit, uh and it was only like 10 to 20 seconds after pausing for a little bit, both of my classifiers are now trained and I will look at the accuracy on the training set along with the accuracy on the validation set. Uh So here we go. So the accuracy for both of the networks on the training set, you're able to get an accuracy of 100 and, and comparable performance on the validation set. And so one thing about these feed forward networks is there is a high likelihood of overfitting on the training data, which is why you want to have something like a validation set or if it doesn't take too long to train, you could do cross validation to try and get a sense of how much overfitting is happening. Um The architecture here. So we had a single layer with 500 versus two layers with 200. What the architecture that works best for your problem is something that you're going to have to tune with a validation set or a cross validation set and figure out which one gives you the best performance. Um There's not like a one size fits all architecture uh for all problems, but just be aware of the more hidden layers you have and the the more the higher the dimension of those hitting of the hidden layers that you do have, uh the longer it will take to fit and probably the more likely you are to over fit, right? Because you'll have a larger number of parameters to control the model. A nice feature for these sorts of things that you might be interested in if you're using a neural network for classification problems is the confusion matrix. And so here's an example where you can look at this confusion matrix on the validation set and see where the model is getting confused. And so, for instance, uh we have an example here where an actual five was accidentally classified as a four. And another example here is we have an actual nine, accidentally getting classified as uh a seven. And so this seems reasonable, right? You'll see people who have sevens that look like this. And if you have a very small nine loop, it could look like a seven. So that's a nice feature of these sort of classification problems where you can look at the confusion matrix. And then in this particular example, you could even find the one that's misclassified and see. All right, it's reasonable that my model got that one wrong. ... So I said that there's a variety of uh activation functions that you might use here are the ones that are readily available for S K learn. And I'm just gonna plot them first, then we'll go through them all. Uh The first is just the identity. So with the identity activation, you're just having a weighted sum as the input. Uh this might be useful for doing a regression problem, but it's typically not what we use. Uh Another one is the logistic activation. Um So this is sort of used if the, if you look at this, this should look familiar from logistic regression. So uh if you're doing a, a binary classification, this might be the output that you want on the last layer hyperbolic tangent. This gets used a lot in A in a not in S K learn uh for the problems that we just looked at. But in uh what's known as a recurrent neural network which will be covered in a later video. This activation function gets used quite a bit. And then the one that probably gets used the most is this rectified linear unit rayle activation which takes the maximum between zero and the input X. So for any nonzero X, we get uh the value of X back for any value of X that is less than zero, we get zero back. ... So we will go ahead and uh this is the, the default for both of the M LP S from ES learn. So we will uh end by talking about this universal approximated theorem. So there's a theorem that says that uh this is the quote from Wikipedia, a feed forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of R to the end. So N dimensional R under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent AAA wide variety of interesting functions when given appropriate parameters. So basically what this theorem is saying is any function we could want and remember functions determine the regression as well as the decision boundaries. We can approximate any reasonable function we want using a feed forward network with a single hidden layer. So that's it. We've figured out machine learning, we have the best algorithm we could want. But if we continue reading from that excerpt, uh it does not touch upon the algorithmic learnability of those parameters, which is essentially just saying uh you could theoretically fit any function you want with a single layer hidden a single hidden layer neural network. But it may not feasibly be possible. Uh Given the, you could maybe need uh a very large hidden layer which could not, maybe you are unable to feasibly get the number of uh training observations you would need to fit such a hidden layer. Uh So the idea of being here uh just because we can do it theoretically doesn't mean we can do it practically. And so this is why many people are interested in uh the understanding the field of deep learning or deep learning is looking at this tradeoff between the depth of a hidden layer. So the dimension of a hidden layer and then the number of hidden layers. So sometimes instead of having a single hidden layer with a large dimension, you'll maybe do a tradeoff for one or two or maybe three hidden layers with a much smaller dimension uh that can give you comparable performance. OK. Uh So let's end also, I we should mention some deficiencies. So we talked about overfitting. Uh So sometimes if you have a neural network that has is too deep, so has too many hidden layers. Uh Your gradients can have encour encounter some problems called explosions or disappearing. So if you noticed here at the deeper we went, the more multiplication of gradients that happen. So if your gradients tend to be close to zero or above one, this means that if you have a very deep hidden layer, uh you could end up with Abnor uh lo very large gradients or very small gradients which can cause problems for algorithms. Um convergence can take a long time as we saw in this example. So a lot of uh cost functions that you might be interested in are very bumpy, meaning that uh your gradient descent step can get stuck in a suboptimal uh minimum. Uh And get stuck there. So sometimes this can be fixed with stochastic gradient descent, but it's not always a perfect solution. Uh And then if you have a very complicated notebook or not notebook, a very complicated uh neural network, it's possible that you won't feasibly be able to train it on your laptop at home. Uh And you'll need to work on more powerful hardware which you may or may not have access to for free. Uh It may cost you a lot of money. So if you were to even make a small mistake, you could waste a lot of money and time trying to train the network. So this is seen as a deficiency. Uh They're still very powerful, but the time and money that it takes to train some of these networks is quite expensive. OK? So here are some references that you might be interested in if you want to learn more about deep learning. Uh and you know, feed forward neural networks and the next video will show you how to implement this type of network in kiss. Uh And then we'll learn other neural network types in later videos and notebooks. I hope you enjoyed learning about feed forward networks. I enjoyed having you watch this video. Have a great rest of your day. Bye.