Perceptrons Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna talk about neural networks. Uh We're gonna talk about the building blocks of modern neural networks called Perceptron. So the first neural network model. So uh perceptions are the fundamental building blocks of neural networks. They're introduced by Rosenblatt in 1960. Here's the link to the original paper. Uh And the material in this notebook is slightly adapted from this textbook uh which has a PDF version online, which I find quite helpful. Uh It's more theoretical uh than practical. We'll see some more practical books, incoming notebooks uh in this notebook, we'll discuss the foundation of neural networks. We'll demonstrate a fundamental limitation of this foundation and then we'll end by building a Perceptron and S Taylor. So neural networks are a machine learning technique that loosely tries to mimic the way that the network of neurons uh inside of your brain maybe learns. And so the idea being that we're going to maybe try and create some learning algorithms that in some sort of loose sense copy or uh how humans actually learn with the firing of neurons in their brain. Um So the building blocks of the vast complex network of your brain are single neurons. Uh The building blocks of an artificial neural network is called a Perceptron. So uh we will start with this very simple building block to give you a nice foundation of what's going on which we will then expand upon in later notebooks to give more advanced uh neural networks. So the perception uh we're gonna learn this in the setting of a classification problem. But as I believe, we may see in later notebooks, this can also be used to solve regression problems. So let's suppose that we have N observations of M features stored in an M by N stored in M N by one column vectors. So X one X two X M and we're gonna let capital X be a matrix that has N rows and M columns where the columns of X are the M features. So we're gonna use this to predict some target Y which in our case is going to be binary, but it could be multi class. It could be uh uh a quantitative as we said. But in this particular setup, it's going to be binary 01. I want to point out we could have included a column of ones inside of X. But we're gonna leave it out for this formulation. Uh When you do include a column of ones, this is known as building a Perceptron or a neural network with a biased term. So the biased term refers to the column of ones. This is a computer science term. Uh So not necessarily in statistics, we would just call it an intercept, right? But in computer science, they call it a biased term. So I'm also gonna let little sigma denote some non-linear function from the reals to the reals. So for classification, we will take this to be the sign function which is one when X is positive, negative one when X is negative. And then I believe it's zero if X is zero uh in the language of neural networks, this is known as an activation function. And so perceptions make estimates of Y called Y hat by doing the following. So they apply this activation function. And again, in this particular setting, it's gonna be the sine function uh to a weighted sum of the features uh the M features. And so we have uh this nonlinear um activation function. Sigma applied to W one X one plus W-2 X two plus dot dot dot plus W M X M uh which is Sigma X times W in a potential abuse of notation here, I don't mean Sigma being applied to the entire vector. Uh When I write Sigma of X W, uh it's gonna be the same and not just in this notebook but in future notebooks, I'm taking this to mean that I want to apply Sigma to each of the N entries of X times W so X times W will result in an N dimensional vector. And then sigma of X W means I want sigma of X times W one uh whatever that turns out to be right. Uh And so that's the idea. And here W is taken to be a column vector of the weights. So if little X is a single observation of this, I can sort of picture this weighted sum process as a network uh in the following way. So I'm gonna have a set of M input nodes which are each of the observation features, then I draw them going into an output node here uh which is going to be our prediction of why. And each of the arrows going into the output node are associated with our weights. And so the idea here is in the Perceptron diagram or the architecture, um the arrows denote being a part of a weighted sum which is why we see a capital Sigma here. And then after we take this weighted sum, we apply the nonlinear activation function which is why we see a little Sigma on the rights. And then this is in our approximation of Y our estimate of Y. And so the goal here, what we have to do is figure out which estimates which values for the W S to choose. So these are parameters that we're going to estimate. Uh we want to find the estimate that gives us the best approximation of our outputs. Why? OK. So uh As I said, this is known as an architecture of the neural network. We'll see more complicated versions of this in later notebooks and a Perceptron if we had a biased term, so if we had that column of ones, you'd see like an extra B down here and it would be going in here and the weight with that would be associated uh would be called uh letter B. OK. So uh do, do, do this column of nodes right here that I'm highlighting with my mouse X one through X M is known as the input nodes because they're going into the Perceptron. The output node has both the Sigma capital Sigma and little Sigma because of the weighted sum and the nonlinear activation function. OK. Uh So how do we find the weights? I said that all we have to do now is find the weights. So how do we do that? Uh It's going to be gradient descent. Uh So you first randomly select all of your weights or you could do uh maybe you have a good idea of some good weights to choose for whatever reason. Uh You can set them yourself. And then you're going to take a single data point from the training set which uh here I'm using the superscript I to denote an observation uh calculate the error which is the actual minus the predicted. And then you update W with a gradient descent step which is the updated version of W is uh W current plus alpha, Y I minus Y hat times X I. So this is uh where alpha is the learning rate of the network. And then the perception this cycles through all the training points and continues to adjust the weights until it converges to a weight vector. W. So each time you go through all of the training points, you'll go through them one by one sampling without replacement. Once you get through all of them, that's known as an epic or an epoch, I ca I always forget which one it is. Um It's one of those two. Uh And then typically, as I said, these training points are chosen at random without replacements. Uh You can, instead of doing it one at a time, you can take small batches. Uh Sometimes you maybe would even use the entire data set. Uh So if you do it with small batches, it's called mini batch gradient descent. Where then you're going to choose the set of batches randomly without replacements, uh You can rewrite this particular algorithm to work in parallel. And here's a link to a paper, two papers that show you how to do that. And then uh if you want a simple example that illustrates a step by step process of update, you know, calculating these weights. Uh this link right here goes through a step by step example on page 10. So this is the perception uh we're gonna show you how to fit one in S K learn. But in essence, this is we're not gonna use the Perceptron on its own very often. And we'll see why at the end of this notebook. So perception is stored in linear model because as you can see, we have a uh a linear um expression here. Uh So that's why it's a linear model. And maybe that can give you a hint as to the limitation for perceptions. So from S K learn that linear model will import Perceptron. And then we're gonna use it on this data set that we've used a lot in our classification notebook. So we've got our zeros up here. Our ones down here with the line Y equals X being the true decision boundary. And this model object works like every other S T learn. So we first make the Perceptron object so purse equals Perceptron. Then we fit it. ... Uh Let's see, it's just X, right. Yes, X comma Y. ... And now we're gonna plot um the light blue is where the Perceptron would predict class zero. The light orange is where the Perceptron would predict class one. And then the, the points are the training points. So as we can see here, the Perceptron does really well with this linear boundary. But let's look at another example here. So in this example, I have zeros at 01 and 10 and I have ones at 00 and 11 and we're gonna fit fit a Perceptron to this and sort of demonstrate the limitation of this model. So this is an example where you could pause the video and try and do it on your own or you can watch me do it. Um Just if you're gonna do it on your own, make sure you store your new model uh in a variable called lower case P in order for the rest of the code to run. OK. So P is equal to Perceptron ... P dot Fit. And then this is also just X Y. ... OK. So our actual data right was 1001 and our predicted data is 000. And so this is what the decision boundary looks like. It just uh it is not on the unit grid here, right from 01 to 01 in both axis, uh it's unable to separate these points. Um And let's get rid of this is actually a typo, that part shouldn't be there. That's not the decision boundary anymore. So it's unable to um unable to separate these four points. And the reason why is because the perception is a linear model, meaning you can only take uh can only create linear decision boundaries. And this data right here which uh this data right here is unable to be separated in two dimensions, at least with a linear boundary. And so in computer science speak the reason that this was a big deal is uh sort of a data set like this would be how you code up the logic operation, not as exclusive or uh which is a key operation in computer science. And because perceptions were unable to do that, they lost a lot of uh you know, lost a lot of interest in neural networks for a long time until we, they were able to come up with the model that we'll learn in the next video. So if, however, that is to say, you know, this was a nonlinear example. If you had a linear example, like the one that we had before, if you suspect that your data is linearly separable, it may be worthwhile to try a perception. There are uh mathematical proofs and guarantees as to the convergent rate um for a perception in data that is linearly separable. So it may be worthwhile if you believe your data is linearly separable. So uh in this video, we talked about the perception which is the foundation of most of the uh of the neural networks that we will learn in the rest of this course. Uh And they, we showed how to fit one in S K learn, we showed how to fit one theoretically. And then we also talked about uh the limitation that perceptions have, which caused them to lose favor for a number of years. OK. So that's it for this video. I hope you enjoyed learning about the Perceptron. I enjoyed teaching you about the Perceptron and I can't wait to see you in our next video. Have a great rest of your day. Bye.