WEBVTT 00:00:11.000 --> 00:00:20.000 23 boot camp and we're gonna spend it learning a little bit about neural networks. So we're gonna try and cover notebooks number one through number 4 here. 00:00:20.000 --> 00:00:30.000 So this is just gonna be a really easy introduction. If you'd like to learn more about different types of neural networks, we also have notebooks on convolutional neural networks and recurrent neural networks. 00:00:30.000 --> 00:00:40.000 That you can go through on your own. Remember these have pre recorded videos on the website already. These sort of just also just touch the surface. 00:00:40.000 --> 00:00:49.000 Nowadays, there's a lot of other different types of networks that we don't have content on and maybe we will sometime in the future, but for now we don't. 00:00:49.000 --> 00:00:55.000 So I always like to get started by just sort of reviewing like the very foundation of neural networks called perceptrons. 00:00:55.000 --> 00:01:04.000 These are introduced by a researcher named Rosenblatt back in 1960 and here's a link to the original paper. 00:01:04.000 --> 00:01:16.000 Assuming that the link still works. A lot of them material in this notebook and in the following notebooks is based on the know this textbook neural networks and deep learning that I always really like this is more of a theoretical book. 00:01:16.000 --> 00:01:20.000 I also have a link to an applied book of how to do this in Python that I really like. 00:01:20.000 --> 00:01:28.000 That I have will link to later. 00:01:28.000 --> 00:01:34.000 So the first sort of question is like when we say neural network, what do we mean? So what is an artificial neural network? 00:01:34.000 --> 00:01:45.000 In a loose sense, they're a technique that is trying to mimic the network of neurons that make up brains. 00:01:45.000 --> 00:01:50.000 That in some very loose sense, how humans learn. So The building blocks of the vast network of the brain is a single neuron. 00:01:50.000 --> 00:02:09.000 And neural networks, artificial neural networks, the building blocks are known as perceptrons. So this is the very first model that started the whole neural network idea. 00:02:09.000 --> 00:02:13.000 So let's talk about the perceptron. For all of these, we're going to be in the setting of classification because I think I personally find it easier to set it up. 00:02:13.000 --> 00:02:24.000 You can use these for regression as well. But we're gonna sort of learn about it with classification in mind. 00:02:24.000 --> 00:02:39.000 But do remember that you can use regression also. So once again, assume that we have in observations of M features stored in an M, I sort in M, N by one column vectors. 00:02:39.000 --> 00:02:45.000 So X one, X 2, XM. And if we collect these on to a matrix X, we want to use this matrix X to predict some target Y. 00:02:45.000 --> 00:02:54.000 So in general, we couldn't also include a column of ones like what we did in linear algebra, but we're gonna leave it out for this formulation. 00:02:54.000 --> 00:03:05.000 So in perceptions in the language of neural networks, what do you include ones that's known as including a bias term. 00:03:05.000 --> 00:03:10.000 And the only thing that will change is like here you'll see like this is the image of a perceptron. 00:03:10.000 --> 00:03:13.000 When you include a, an intercept or bias term, you'll typically see another little circle down here. 00:03:13.000 --> 00:03:30.000 That has a letter B in it. So we're gonna talk about this diagram, but just wanted to make the note that we are not, we're in the this formulation we're not including an intercept or bias term, but in general you could include one. 00:03:30.000 --> 00:03:31.000 So we're gonna let little sigma be some sort of nonlinear function from R to R. 00:03:31.000 --> 00:03:44.000 And in particular, this can change, but for this setup, we're gonna take it to be the sign function, which is written as SGN. 00:03:44.000 --> 00:03:51.000 So the sign function is one if x is greater than 0 and negative one is x is less than 0. 00:03:51.000 --> 00:03:57.000 And then I think it's just 0. And then I think it's just 0 if x is 0 if x is 0, but that's probably not going to happen if your network is. 00:03:57.000 --> 00:04:00.000 But that's probably not going to happen if your network is, working. Okay. 00:04:00.000 --> 00:04:18.000 So perceptrons estimate why that we're going to call this estimate y hat. And the estimate is the activation function, which is sigma, we call sigma the activation function applied to a weighted sum of the inputs. 00:04:18.000 --> 00:04:28.000 So we could write this. Using linear algebra, sigma of x times w, where x is that matrix and W is a vector of weights. 00:04:28.000 --> 00:04:37.000 So if we let little x. Denote a single observation of the big X's. So a single row of the X, matrix, then we often can denote this perceptron with a picture. 00:04:37.000 --> 00:04:49.000 So we have A list of circles that we're calling input nodes and each of these nodes. 00:04:49.000 --> 00:05:00.000 It represents one of the observations of a feature. So for instance, the one on the top will typically be taken to denote the first observation of the first feature and so forth. 00:05:00.000 --> 00:05:07.000 Then they all have arrows feeding into something called the output. We'll talk about that in a second. 00:05:07.000 --> 00:05:16.000 Each of these arrows is just drawn to represent the weighted sum. So the weight on the first arrow is the weight on the first feature, which is W one. 00:05:16.000 --> 00:05:21.000 The weight on the second arrow is the weight on the second feature, which is W 2 and so forth. 00:05:21.000 --> 00:05:33.000 Now for the output node this is sort of going to be our prediction of why. And so here that's split in half where we have a capital sigma, which represents the sum. 00:05:33.000 --> 00:05:41.000 And then a lowercase sigma, which represents the activation function. And so your inputs go into the node. 00:05:41.000 --> 00:05:47.000 There's someed according to these weights and then we apply an activation function. So I diagram like this is known as an architecture of a neural network. 00:05:47.000 --> 00:06:08.000 We're going to see more complicated ones in the coming notebooks. Again, if we wanted to include an intercept, the only thing that would change is we would have a little node down here with a lowercase b, because the people who develop this, they take intercepts to be called the bias 00:06:08.000 --> 00:06:18.000 terms. So I see we have a question. Kearthons asking, can you remind me if the bias is added only to the training set? 00:06:18.000 --> 00:06:26.000 So if you included a bias term, like with the little, like I just said, this would be a feature of the model that like. 00:06:26.000 --> 00:06:31.000 You know, there'd be a bias term that we would fit using the training set, but then we would also like it would apply to any predictions we'd wanna make. 00:06:31.000 --> 00:06:41.000 So if we add a different observation, the weighted sum would be like B plus the W's times whatever your observation is. 00:06:41.000 --> 00:06:56.000 So it's fit during the training set, but it's always gonna be there in the model, no matter if the observation you're trying to predict on is in the training set or not in the training set. 00:06:56.000 --> 00:07:24.000 Are there any other questions before we go into like how do you find these W's? 00:07:24.000 --> 00:07:29.000 Parameter that would need to be fit, think like back in linear regression where you have that beta 0. 00:07:29.000 --> 00:07:37.000 So their beta 0 is a bias term, but because it was developed by statisticians and mathematicians, it's called intercept. 00:07:37.000 --> 00:07:44.000 Perceptions were developed by computer scientists. So it's called, a bias term. 00:07:44.000 --> 00:07:51.000 So this will, I will say as an aside, this is part of it, one of the things that kind of makes data science sort of annoying to learn is because it's a field that's basically just taking different pieces of other fields and then kind of putting it all together. 00:07:51.000 --> 00:08:09.000 And so the language is not always consistent across the algorithms. So sometimes you have intercepts, sometimes you have a bias, sometimes things are labeled as 0 one, sometimes they're label as negative one and one. 00:08:09.000 --> 00:08:16.000 So you just have to be mindful to keep track of, you know, the differences and what, what is the same. 00:08:16.000 --> 00:08:33.000 So icons asking are the observations being weighted or features being waited. So the features themselves will have weights applied to them, but then like when you go to put in a single observation into the perceptron or in general the neural network like the weights will also apply to each observation. 00:08:33.000 --> 00:08:43.000 So like in general, like the model is saying like we are going to apply a weight of W one to the. 00:08:43.000 --> 00:08:50.000 First feature, but then when we actually go to, you know, apply it or fit it, you use the observations to find those weights. 00:08:50.000 --> 00:08:54.000 Does that make sense? 00:08:54.000 --> 00:09:08.000 Great. Alright. So how do we find these weights? So it's gonna maybe look weird, but we're essentially doing something called gradient descent and this will come back when we do more general neural networks. 00:09:08.000 --> 00:09:13.000 So the first step is you're just going to randomly select your weights. Then you'll use a single data point from the training set. 00:09:13.000 --> 00:09:17.000 This can be adjusted, but for this setup we're just gonna use a single data point from the training set. 00:09:17.000 --> 00:09:32.000 You'll calculate the predicted value using those randomly generated weights and then calculate the error, which is going to be the actual minus the observed. 00:09:32.000 --> 00:09:40.000 And then you're gonna update W. By doing the current guess at W, which remember at the beginning was randomly chosen. 00:09:40.000 --> 00:09:49.000 Plus a small nudge, so alpha, which is known as the learning rate. And other algorithms that I think we used Ada, but here we're using alpha. 00:09:49.000 --> 00:10:02.000 Times that difference. Times the features. So then the perceptron will use this updated weight, find another randomly selected trading point XI. 00:10:02.000 --> 00:10:07.000 Do the same exact process and update the weights again. So it will go through and sample the training points at random without replacement. 00:10:07.000 --> 00:10:20.000 Once it's all the way through all the training points that's considered an Epic or Epoch depending on who you talk to and then it like basically then all the training points get put back into the pile to be randomly selected again. 00:10:20.000 --> 00:10:46.000 So this is gonna be done until the weights converge to a weight vector W. And so that is typically decided like there's some sort of tolerance parameter of like, if my weight, my new weight is within so much distance, a certain distance of the current weight, then I'm not going to keep 00:10:46.000 --> 00:10:54.000 updating. Other times you stop because you've reached a reached a maximum number of times you're willing to go through the process. 00:10:54.000 --> 00:11:00.000 So some adjustments you can make is instead of doing 1 point at a time, you can choose small batches and then like the batches would be chosen randomly without replacement. 00:11:00.000 --> 00:11:15.000 Somebody has gone through and figured out how to do this so you can fit it in parallel. Another adjustment and I think maybe this will come up in the later one is instead of choosing a fixed alpha. 00:11:15.000 --> 00:11:22.000 You can choose a randomly selected alpha at each time and we'll talk about that in a later notebook while you want to do that. 00:11:22.000 --> 00:11:28.000 And let's see. So I have a question. Petro is asking, so here it wouldn't be necessary to scale the features. 00:11:28.000 --> 00:11:44.000 So in general, You do want to scale the features for neural networks. We're not gonna worry about that with the perceptron because we're actually not going to use the perceptron but in general when you're dealing with neural networks, like you do wanna scale your features. 00:11:44.000 --> 00:11:51.000 Okay. So a really nice simple example of like a step by step training of the perceptron can be found at this link, assuming it still works. 00:11:51.000 --> 00:12:02.000 So like it just goes step by stepping sort of shows you like how the predictions change and imagine this like weird shifting thing I'm doing. 00:12:02.000 --> 00:12:09.000 This is the decision boundary changing each time. It actually goes through and shows you how it sort of settles then by doing this update process. 00:12:09.000 --> 00:12:17.000 So I like it, so I hope it's still there. I forgot to check. Okay, so how can we use a perceptron and escaler? 00:12:17.000 --> 00:12:23.000 Again, we're not going to spend a ton of time on this because we weren't, you don't typically use a perceptron. 00:12:23.000 --> 00:12:25.000 But it's, I think it's just useful to have and your back pocket is like knowledge of the history of neural networks. 00:12:25.000 --> 00:12:41.000 So from SK Learn dot linear model. So the perception is a linear model. You'll import. 00:12:41.000 --> 00:12:52.000 Perceptron. We're going to use this to for on our tried and true data for classification. 00:12:52.000 --> 00:12:57.000 So this part should be pretty familiar if you've gone to all the other lectures. So you make your S. K. 00:12:57.000 --> 00:13:04.000 Learn model object. We're gonna call it perk. Right? Yep. 00:13:04.000 --> 00:13:11.000 As equal to perceptron. Then we are going to fit it. 00:13:11.000 --> 00:13:15.000 X comma Y. 00:13:15.000 --> 00:13:16.000 And now you can see here how, you know, it's able to fit the this training data very well. 00:13:16.000 --> 00:13:27.000 Okay. And again, this is a very. Straightforward data set for most of the algorithms to fit. 00:13:27.000 --> 00:13:33.000 So as a motivating example, here's another. Silly example. So we've got only 2 points here. 00:13:33.000 --> 00:13:47.000 A 1, 2 ones, and 2 zeros. And typically like if you had something like this, this isn't enough data to train an algorithm on, but I just want to show it as. 00:13:47.000 --> 00:13:56.000 As an example to help motivate us. So we're gonna do P is equal to perceptron. 00:13:56.000 --> 00:14:07.000 And then p dot fit x comma way. Okay, and so what I have here is the actual data and then the predicted data from the perceptron. 00:14:07.000 --> 00:14:13.000 So notice it's all zeros. And then here's what the decision boundary looks like. 00:14:13.000 --> 00:14:19.000 So it's all blue, okay? So this isn't very good. And again, this isn't like a proof or anything. 00:14:19.000 --> 00:14:25.000 It's just sort of showing you the some intuition. So one reason that the perceptron sort of failed is that it's unable to do a nonlinear decision boundaries. 00:14:25.000 --> 00:14:51.000 So just like other algorithms we've seen like it can't do nonlinear things. And so this was a problem for perceptrons at the time because you can use this sort of classification problem to encode a very common computer science practice called exclusive or as a decision gate and people want to be able to do that 00:14:51.000 --> 00:15:05.000 when they're programming. So the perceptron I believe was initially proposed for like actual programming. I couldn't do exclusive or and so the computer science communities like, well, these neural networks are never gonna take off if you can't do that. 00:15:05.000 --> 00:15:12.000 So that sort of led to sort of a death of the neural network for a little bit until somebody was able to come up with a new idea that allows you to fit these sorts of things. 00:15:12.000 --> 00:15:20.000 So that's really, that's the idea for the perceptron and like that's sort of like where it's at. 00:15:20.000 --> 00:15:33.000 If you have something that you think can be linearly separated. Like you had here, a perceptron might work quite well, but I think in general they don't really get used. 00:15:33.000 --> 00:15:39.000 So Yahoo is asking how about you employ more, perceptrons, I'm assuming. 00:15:39.000 --> 00:15:43.000 So. 00:15:43.000 --> 00:15:54.000 We're gonna see in the third notebook, like the worker or not the workaround, but the in the development of a model building on the perceptron. 00:15:54.000 --> 00:16:07.000 That makes it so that it could classify this type of problem. So we are going to see and the basic idea is kind of like, what if we just have more perceptrons, in some sense. 00:16:07.000 --> 00:16:15.000 Okay, so before we move on, are there any questions about perceptrons? 00:16:15.000 --> 00:16:25.000 I'm sure I made them uninteresting to you now that I've told you that they don't really work very well. 00:16:25.000 --> 00:16:29.000 Okay. 00:16:29.000 --> 00:16:43.000 So as a brief aside, I wanted to introduce the data set we're going to be using. So this is a data set called the M and IST data set, which I've written down here, stands for the Modified National Institute of Standards and Technology. 00:16:43.000 --> 00:16:52.000 So if you have't seen this data set before. It's pretty famous. I do vaguely remember as a kid seeing some special on PBS about it. 00:16:52.000 --> 00:17:03.000 So basically the data set has a bunch of pixelated images of handwritten digits. So these are the numbers (012) 345-6789. 00:17:03.000 --> 00:17:06.000 Each image is broken into a grid of pixels of grayscale values which measure the intensity of handwriting within that pixel. 00:17:06.000 --> 00:17:22.000 So a value of 0 would mean no marketing, a value of 255 would mean the the darkest marquee marking and I think I accidentally said marketing before but marking. 00:17:22.000 --> 00:17:32.000 So like if you had a piece of paper where you draw the number to those would have values that are not 0 but then all the pieces of the paper that have nothing on it, those would have values of 0. 00:17:32.000 --> 00:17:40.000 So the original data set has 60,000 training images and 10,000 test images. For these 2 notebooks that we're going to cover and maybe a fifth one if we have time. 00:17:40.000 --> 00:17:53.000 We're going to use 2 different versions and the reason we're using 2 different versions is just because we're using 2 different versions. And the reason we're using 2 different versions is just because we're using 2 different packages. 00:17:53.000 --> 00:18:01.000 So we're gonna have one notebook where we use SK Learn. And for that, we're going to use SK Learn's version of the digits. 00:18:01.000 --> 00:18:02.000 So when you're using SK Learn, you can get the digits by this function. 00:18:02.000 --> 00:18:11.000 Stored in datasets called load underscore digits. 00:18:11.000 --> 00:18:19.000 So here you'll have. This many images, which is much smaller than 60,000, which I think is just done for storage purposes. 00:18:19.000 --> 00:18:22.000 So they don't have to store as many and it loads very quickly. And then also your your models will train more quickly. 00:18:22.000 --> 00:18:36.000 And so here's what a particular observation looks like. These are grid values of the pixelated image. 00:18:36.000 --> 00:18:41.000 So we have. 64 pixels. So it's an 8 by 8 grid. 00:18:41.000 --> 00:18:47.000 And this is what some of the pictures look like. So very grainy images. So here's a 0. 00:18:47.000 --> 00:18:55.000 Here's a one. I 2 a 3 a 4 and a 5. So it's not very high quality, but it can, it works well enough for demonstrating. 00:18:55.000 --> 00:19:05.000 Different algorithms, how they work on this data set as well as testing different algorithms. 00:19:05.000 --> 00:19:11.000 So, Laura's asked, how do you obtain the numerical information from the images? So these have the features or stored in X. 00:19:11.000 --> 00:19:24.000 So X stores the grid values, but then the Y's will store like the value of the like what number is this and so the 0 entry of the array. 00:19:24.000 --> 00:19:30.000 This is what the pixels. Values look like, but we can see here that the 0 entry is A 0 itself. 00:19:30.000 --> 00:19:40.000 And so here you can see as we go through, where did I write this? So I'm going through the first, 6. 00:19:40.000 --> 00:19:45.000 The first 6 observations. And then I'm plotting it. I'm reshaping it. 00:19:45.000 --> 00:19:58.000 So instead of being a row vector, it's an 8 by 8 grid. And then you can see I have the Y of I and so that's how you get these labels 0 1 2 3 4 5. 00:19:58.000 --> 00:20:05.000 Okay, so Laura is saying initially if you have the handwritten numbers, so let's say the sheets of paper. 00:20:05.000 --> 00:20:11.000 How do you generate numerical information? My guess is that they scanned them in. I don't actually know. 00:20:11.000 --> 00:20:15.000 So if you want to learn how they got the data, maybe it's at the Wikipedia entry. 00:20:15.000 --> 00:20:29.000 But my guess is they may be scanned it in on some sort of machine and a very standard way and then, you know, digitize the image and then set it for, for instance, for the SK L and got different, a lower resolution image. 00:20:29.000 --> 00:20:39.000 We'll see for the Keras version, it's a much higher resolution version. 00:20:39.000 --> 00:20:48.000 Yeah, and then Sanjay has commented it's extracted using OCR technology, which I saw I finally saw a post about this earlier today of like. 00:20:48.000 --> 00:20:59.000 I think it was on like a computer science stack exchange or something where somebody asked like what's the best way to store your image, like to preserve disk space and they're like, oh, just print them out. 00:20:59.000 --> 00:21:04.000 Delete the image on your computer and then when you need it. Scan it back in using OCR. 00:21:04.000 --> 00:21:10.000 So. Very topical per today's lecture. 00:21:10.000 --> 00:21:17.000 Okay, so That's the SK Learn version. That's what we're going to use in Notebook number 3 because the model is the SK L and model. 00:21:17.000 --> 00:21:23.000 The other model version we're going to use is the Keras version, which is the actual original version of the data. 00:21:23.000 --> 00:21:29.000 So if you first you need to have Keras installed, which we will talk about and notebook number 4. 00:21:29.000 --> 00:21:34.000 And then second, you have to run it. So the first time you run this, this line in particular can take a little bit of time. 00:21:34.000 --> 00:21:49.000 If you have it installed the data. Similarly here, like these 2 code chunks can take a little bit because it has to I believe download the data set onto your computer because it has to, I believe, download the data set, onto your computer and then load it. 00:21:49.000 --> 00:21:55.000 And so you can see these are 60,000 images, onto your computer and then load it. And so you can see these are 60,000 images and then the images are stored on your computer and then load it. 00:21:55.000 --> 00:21:58.000 And so you can see these are 60,000 images and then the images are stored in as 2D arrays 00:21:58.000 --> 00:22:04.000 And then the test set, which is nice. They provide both the training set and the test set. 00:22:04.000 --> 00:22:11.000 You know, has the 10,000 images. And so here you can see these are much higher resolution. 00:22:11.000 --> 00:22:22.000 So you can actually read them without needing the label. And so these might provide better, you know, better models than the SK L in version. 00:22:22.000 --> 00:22:23.000 So the basic point is we're using 2 different versions because sometimes we'll be building a model and SK Learn. 00:22:23.000 --> 00:22:35.000 Other times will be building a model on Keras and so we're just using the the data set that goes along with that model. 00:22:35.000 --> 00:22:57.000 Okay, so are there any other questions about? The data set before we move on to learn more about neural networks. 00:22:57.000 --> 00:23:04.000 So, is asking about perceptrons. Is the perceptron the same as a neuron? 00:23:04.000 --> 00:23:14.000 So. Like, do you mean like, like a brain neuron or just like the nerd like the little image, like the little nodes within the bigger neural network. 00:23:14.000 --> 00:23:17.000 Yeah, the notes. Yeah. 00:23:17.000 --> 00:23:47.000 Yeah, so it's gonna be. Like, I think they're Not 100% the same, but like they are going to there are essentially the same so like the bigger neural network is essentially just going to be made of many different perceptrons all stacked up in particular ways. 00:23:47.000 --> 00:23:56.000 Okay, so now with that as a good staging question, let's go ahead and show, show you about multi-layer neural networks. 00:23:56.000 --> 00:24:04.000 So this is gonna be like the most basic neural network. And then from here, like you'll learn different architectures and stuff. 00:24:04.000 --> 00:24:09.000 So these are known as they have a couple different names, so they're multi-layer neural networks. 00:24:09.000 --> 00:24:17.000 They're also known as feed forward neural networks. And then importantly for the next notebook, they're also known as dense neural networks. 00:24:17.000 --> 00:24:25.000 So why are they called these things? So basically as an example, I think it's maybe better to explain it with this example. 00:24:25.000 --> 00:24:34.000 The multi-layer or feed forward neural network is going to have a series of what are known as hidden layers. 00:24:34.000 --> 00:24:44.000 And then the hidden layers are going to have the first one every single input node feeding into every single hidden layer and then subsequent hidden layers then have every previous hidden layer node feeding into the next ones. 00:24:44.000 --> 00:24:55.000 And then finally the all the nodes of the last hidden layer feed into whatever your output layer nodes are. 00:24:55.000 --> 00:25:02.000 So this might sound a confusing. I'm hopeful that it will become more clear as we keep going through the notebook. 00:25:02.000 --> 00:25:12.000 So this particular. Architecture is a feed-forward network or dense neural network that has 2 hidden layers. 00:25:12.000 --> 00:25:15.000 So these 3 nodes are a hidden layer and then the second column of 3 nodes are a hidden layer. 00:25:15.000 --> 00:25:25.000 They, we would say that those hidden layers have dimension 3 because they have 3 nodes in their column. 00:25:25.000 --> 00:25:36.000 Why are they called hidden layer? So the basic idea is they're hidden because we see the inputs and the outputs, but don't necessarily see what's going on on the inside. 00:25:36.000 --> 00:25:42.000 Now it's not a hundred percent true. We can see, remember the arrows are web representing different weights. 00:25:42.000 --> 00:25:57.000 So we can see the different weights later. If we want to. So it's sort of not necessarily hidden, but the idea is we're not really, you know, we don't 100% know what's going on in the middle there is we just see the inputs and the outputs. 00:25:57.000 --> 00:26:18.000 So basically just before we dive into the more theoretical setup, though the nodes are representing the same exact thing as we had in the perceptrons where you have your input as each one of your features, every arrow has a weight on it and then when the arrows lead into a node it represents a weighted sum. 00:26:18.000 --> 00:26:27.000 With an activation function applied. And so this first node will have a weighted sum of all the X's that then has an activation applied to it. 00:26:27.000 --> 00:26:36.000 So will the second one with different weights than the first one and as will the third one which will also have different weights from the first 2. 00:26:36.000 --> 00:26:41.000 I guess in theory they could end up with the same or similar weights, but in general they're different. 00:26:41.000 --> 00:26:49.000 Then the following hidden layer for the second layer, this will be a weighted sum of all the 3 previous hidden layers. 00:26:49.000 --> 00:26:54.000 That's what's the arrows are representing. So arrow, arrow, arrow, and then the same with the other 2 nodes. 00:26:54.000 --> 00:27:05.000 So arrows represent weights applied to the inputs. The inputs have a weighted sum, then there's an activation function applied to that sum. 00:27:05.000 --> 00:27:13.000 And that's the idea. So for a more, a more formal formula. So for me personally, I need to see it written out to understand it. 00:27:13.000 --> 00:27:23.000 Rather than have somebody talk it at me. So hopefully this works for those people that are like me that need to see the formula written out. 00:27:23.000 --> 00:27:35.000 So we're gonna suppose that we have an observations of M features that little x is representing a single observation which is an m by one vector so a column vector. 00:27:35.000 --> 00:27:44.000 We're gonna suppose that in general we have little k hidden layers. That layer L has P sub L nodes in it. 00:27:44.000 --> 00:27:55.000 H sub L is the vector corresponding to hidden layer L. So In this example, these 3 nodes would be considered H sub one. 00:27:55.000 --> 00:28:01.000 And then these 3 nodes would be considered H sub 2. 00:28:01.000 --> 00:28:18.000 So we would then have a very, various series of weight matrices. So W one is going to be the late The input layer to the hidden layer and then WL is a weight matrix from hidden layer L to hidden layer L or sorry from hidden layer L minus one to hidden layer L. 00:28:18.000 --> 00:28:40.000 So basically W one would be all of these weights collected in a matrix. W 2 would be all of these weights collected in a matrix and then W 3 would be all of these weights collected in a matrix. 00:28:40.000 --> 00:28:53.000 So then you can calculate the values of the nodes in the following way. So for the first hidden layer, you're doing the weights times the inputs and then applying a function to it. 00:28:53.000 --> 00:29:01.000 And so this is maybe an abusive notation. What we, when we write something like this, we mean that the The function is not being applied to the entire vector. 00:29:01.000 --> 00:29:21.000 But it's being applied entry-wise. This is just the easiest way to write it. So like this W one times X is going to give you a vector and then you're going to apply the capital sigma to each entry of the vector. 00:29:21.000 --> 00:29:33.000 The middle hidden layers are just the weights for those layers times the previous. And so again, being applied entry-wise, and then finally you take the weights for the last one. 00:29:33.000 --> 00:29:42.000 Multiplied by the last hidden layer. And again, applying the activation function. You know, entry-wise. 00:29:42.000 --> 00:29:53.000 In general, these activation functions can be different from layer to layer. In practice, they tend to be the same between all the hidden layers and then you'll have a different activation function for the output. 00:29:53.000 --> 00:30:01.000 This will hopefully become more, I know I keep saying this a lot. But this will hopefully become more clear when we have a concrete example. 00:30:01.000 --> 00:30:08.000 Sometimes you'll see architectures like this instead of having to draw all the individual nodes and arrows. 00:30:08.000 --> 00:30:19.000 Sometimes you'll just see them represented as rectangles where you have a big rectangle representing the features and then other rectangles that represent the hidden layers and then, you know, a certain number of nodes representing the output. 00:30:19.000 --> 00:30:28.000 So for us, if it's binary classification, we'd have a single node. It's multi-class, we'll have multiple nodes. 00:30:28.000 --> 00:30:34.000 Okay, so I see we have a question. 00:30:34.000 --> 00:30:41.000 So Yahweh is asking if I read it right does the final layer basically sum up all the weights. 00:30:41.000 --> 00:30:47.000 In some sense, see after remember that these weighted sums are then being fed through some sort of function. 00:30:47.000 --> 00:30:57.000 And so it's not necessarily like the final layers of some of some of some of some of some, however many layers you have, like it's a sum of the, it's like this sort of. 00:30:57.000 --> 00:31:08.000 Russian nesting doll, right, where you have the outer most doll is the last activation function, then you take that off and then you have this weighted sum of previous activation functions, you take that off and so forth. 00:31:08.000 --> 00:31:24.000 So like we're gonna see a concrete example where it will become more clear where we work through the process of fitting. 00:31:24.000 --> 00:31:31.000 So Parham's asking, the activation function does not have to be linear, right? Nope, in general it is a nonlinear activation function. 00:31:31.000 --> 00:31:40.000 We may have linear ones, but in general they tend to, they are nonlinear. 00:31:40.000 --> 00:31:54.000 Are there any other questions before we dive into a very particular example? 00:31:54.000 --> 00:32:05.000 Okay, so the one big question might be like, how do we, you know, if we look at this setup, the thing that we have to find ourselves through estimation are these weights. 00:32:05.000 --> 00:32:11.000 And so a natural question is how do we find these weights? So this is a process that you may have occurred before called back propagation. 00:32:11.000 --> 00:32:19.000 I've personally. Trying to find the name intimidating if you're starting neural networks for the first time. 00:32:19.000 --> 00:32:26.000 If you remember a little bit of calculus from college or if you did it in high school, you know, kudos to you. 00:32:26.000 --> 00:32:38.000 It's just gonna be something called the chain rule mixed with a process from Calc 3 called gradient descent So let's have a very simple network where we have a single input load. 00:32:38.000 --> 00:32:49.000 And then 2 hidden layers each with one node and then one output layer. So altogether this would require 3 weights, little w one little w 2 and little w 3. 00:32:49.000 --> 00:33:02.000 And then this is what like the function version of this looks like. Okay, so back propagation is called back propagation because it has what's known as a forward step. 00:33:02.000 --> 00:33:11.000 And then a backwards step. So for the forward step. Just like with before, we're going to randomly guess weights. 00:33:11.000 --> 00:33:19.000 Then we're going to propagate or we're going to take the forward step of using a single observation. 00:33:19.000 --> 00:33:28.000 Or a batch if we wanted to use batch using a single observation using the randomly guest weights and then getting all the values. 00:33:28.000 --> 00:33:37.000 So we'll have the value of x one from the observation we choose. Times the randomly guessed value for W one, which can give us H one. 00:33:37.000 --> 00:33:45.000 Which in turn can give us H 2, which in turn can give us an estimate for Y. So now we have all those things. 00:33:45.000 --> 00:33:57.000 Now comes the backward step. So typically we have some sort of loss or cost function. So let's say hypotheticalically, it's this y hat minus y squared. 00:33:57.000 --> 00:34:05.000 It might be something different. It depends on the problem we're looking at, but for illustrative purposes, let's take it as this. 00:34:05.000 --> 00:34:09.000 So in order to update the weight vector, we're going to use something called gradient descent. 00:34:09.000 --> 00:34:21.000 And so there's this idea from calculus 3 that the, if you want to decrees a function, of so basically like let's say you're somewhere and you want to go in the direction that decreases the function as quickly as possible. 00:34:21.000 --> 00:34:35.000 Well, that direction is negative the gradient. And so gradient descent sort of exploits this by saying, all right, I'm going to take my current position. 00:34:35.000 --> 00:34:40.000 And then just move a little bit in the direction of the negative gradient. So the little bit part is the eta, which I guess I called alpha in the previous notebook. 00:34:40.000 --> 00:34:50.000 I need to make these more consistent someday. And then times the gradient of C, evaluated at your current guest. 00:34:50.000 --> 00:34:56.000 So your new guests for the next step in the gradient descent is going to be your current guests. 00:34:56.000 --> 00:35:06.000 Minus eta times the gradient of C evaluated at the current guess. So here your gradient, the thing that we're trying to optimize of is. 00:35:06.000 --> 00:35:17.000 Evaluated at the W. So that's again your current guess and then we're gonna see, we're gonna see the, the gradient of C in this particular example. 00:35:17.000 --> 00:35:27.000 So the gradient of C will be of evaluated using the values you found in the forward step. So let's dive into a very, this, you know, a particular example. 00:35:27.000 --> 00:35:36.000 So using the train rule. Chain rule, not train rule. Using the chain rule, we can find the partial derivative of C with respect to W one. 00:35:36.000 --> 00:35:44.000 The partial derivative of C with respect to W 2 and the partial derivative of C with respect to W 3. 00:35:44.000 --> 00:35:45.000 So let's start with the easiest. One and then work our way forward. So we have this. 00:35:45.000 --> 00:35:57.000 Derivative of C with respect to W 3. Well, C is a function of Y hat and Y hat is the function of W 3. 00:35:57.000 --> 00:36:05.000 So the chain rule tells us that's the derivative of C with respect to Y hat. Times the derivative of y hat with respect to w 3. 00:36:05.000 --> 00:36:07.000 And so if you work all that out, you get this. This is assuming your activation function is differentiable. 00:36:07.000 --> 00:36:18.000 There are some computational worker rounds that you can use and get used that we won't have to worry about. 00:36:18.000 --> 00:36:23.000 I guess if you wanted to go and like into like a neural network research career, you might want to worry about it. 00:36:23.000 --> 00:36:31.000 But for our purposes, we're not doing that. Okay, then for the second one you use the chain rule again. 00:36:31.000 --> 00:36:36.000 So, see the derivative of C with respect to Y hat. Well, why hats a function of H 2? 00:36:36.000 --> 00:36:43.000 So then we have the derivative of Y hat with respect to H 2 and then H 2 is the thing that is a function of W 2 so then we have the derivative of H 2 times, with respect to W 2. 00:36:43.000 --> 00:36:52.000 And then that's what we get. And then just, I'm not gonna say it all. 00:36:52.000 --> 00:37:03.000 That's where the last step is. You just do the chain rule again. So all of these expressions, we either have a guess, which is from the W old. 00:37:03.000 --> 00:37:09.000 Or we have a value from the forward step. So we know what that we have a value for H 2 from the forward step. 00:37:09.000 --> 00:37:15.000 We have a value of H one from the forward step. And then we have X one because it's the particular. 00:37:15.000 --> 00:37:25.000 Observation we chose. So then for the gradient adjustment, we would just take in those calculations, plug them in, and then that will give us the new weight vector. 00:37:25.000 --> 00:37:34.000 And then we'll keep doing this just like with the perceptron until we get close enough to like what we think an optimal is or you know we are just done training. 00:37:34.000 --> 00:37:43.000 So Again, the process of going through all your training points is called an Epic or Epoch. 00:37:43.000 --> 00:37:54.000 Let's see, what are some adjustments? So we talked about how you'll occasionally or not occasionally I believe the default is instead of a fixed learning rate you'll do stochastic gradient descent which will randomly select a learning rate. 00:37:54.000 --> 00:38:15.000 So this is done to avoid getting stuck in a local minimum. So like the idea being maybe every once in a while you'll get stuck somewhere, but if Ada goes big enough randomly, then you can get out of a local minimum and go find a global minimum. 00:38:15.000 --> 00:38:21.000 There's the batch grainy descent, which we talked about before. And then I think that's it. 00:38:21.000 --> 00:38:23.000 So if you're if you're interested in learning more just about gradient descent. I have a notebook about it in supervised learning. 00:38:23.000 --> 00:38:35.000 So check out that folder and there's just a notebook called gradient descent. 00:38:35.000 --> 00:38:44.000 Okay, so you see we have a question. 00:38:44.000 --> 00:38:58.000 So Zach's asking our other methods for peak finding used other than great need to send. I think it is possible that there are other, you know, other people out there who have, you know, found other methods, but I believe in general. 00:38:58.000 --> 00:39:09.000 It's typically gradient descent and then like the algorithms are just different ways to calculate the gradients. 00:39:09.000 --> 00:39:12.000 Are there any other questions? So like this was a lot of math. So if you're not a math person, don't worry. 00:39:12.000 --> 00:39:20.000 You know, don't worry about it. If you are a math person, you can ask a question. 00:39:20.000 --> 00:39:28.000 Or if even if you're not a math person, you can still ask a question about, you know, don't be intimidated by all the math if you don't remember calculus 3 or if you never took calculus 3. 00:39:28.000 --> 00:39:33.000 The important thing is just sort of remembering like. 00:39:33.000 --> 00:39:36.000 Sort of, I don't know, it's kind of intimidating if it's your first time seeing it. 00:39:36.000 --> 00:39:41.000 I it took me a while to learn neural networks so I hope you're not intimidated. 00:39:41.000 --> 00:39:46.000 So icons asking just to recap, how does the forward step calculate H's and W's? 00:39:46.000 --> 00:39:58.000 So at any step along, so if it's the beginning, you'll have a random guest for W or one that you've set up like maybe you have a good idea of what W is, so you make a guess that's not random. 00:39:58.000 --> 00:40:03.000 You'll have a random guest or some kind of guest for W and then you'll know what the observation is. 00:40:03.000 --> 00:40:15.000 So you'll have a value of X or a collection of values of X from the training data. So using that guest for W and then your values from your training data, you just propagate through. 00:40:15.000 --> 00:40:27.000 So you'll plug in the values of X, the guests for W, and that will give you all of the values in your first hidden layer, which you can then go forward and get all the values in the second hidden layer. 00:40:27.000 --> 00:40:47.000 Which you can then use to go forward and get the value in. For the estimate. 00:40:47.000 --> 00:40:59.000 So Clark's asked how exactly is batching, speeding things up. So, with not doing bashing with regular gradient descent, you have to go through each observation in the training set one at a time. 00:40:59.000 --> 00:41:16.000 With batching, I think it can be like these calculations can be done. Either in a vector and a vectorized manner or they could be done in like parallel because like the individual values. 00:41:16.000 --> 00:41:23.000 Of the for the different values of X like you don't need to know One another, does that make sense? 00:41:23.000 --> 00:41:31.000 Like it doesn't rely on like prediction for observation one and the prediction for observation 2 are independent of one another if you're if you're getting the forward step at the same time. 00:41:31.000 --> 00:41:43.000 You can speed things up that way. Instead of going through and. Doing one and then later doing 2 and then later doing 3. 00:41:43.000 --> 00:41:50.000 Does that make sense? 00:41:50.000 --> 00:41:53.000 Great. 00:41:53.000 --> 00:42:03.000 Any other questions? 00:42:03.000 --> 00:42:15.000 Okay. So how can you, let's say you just wanna try something out and SK, and you don't wanna go through like using Keras or Pi Torch is a slightly longer process. 00:42:15.000 --> 00:42:27.000 So maybe you just wanna try out some ideas in Skeler and how can you build a neural network and So we're gonna use that MN IST data set we looked at in the previous notebook. 00:42:27.000 --> 00:42:40.000 So to implement a multi-layer neural network or feed forward neural network in SK L and you use a model called the MLP classifier. 00:42:40.000 --> 00:42:47.000 Alright, so sorry, before we keep going, I want to answer this question. Yahweh is asking, why do we need a backward propagation? 00:42:47.000 --> 00:42:54.000 So we have to. In order to fit the neural network, we need all these W's. 00:42:54.000 --> 00:43:02.000 And so back propagation is the way that you implement gradient descent. With a neural network. 00:43:02.000 --> 00:43:19.000 And so it's called backwards propagation because you sort of have this step that's the backward step so you go first you go forwards and get estimates for all the H's in the Y's, then you go backwards using these estimates to give you the the gradients. 00:43:19.000 --> 00:43:25.000 So all these partial derivatives, these would give you the gradient of C with respect to the weights. 00:43:25.000 --> 00:43:32.000 And then you use that gradient to get the updated guest for the W's. 00:43:32.000 --> 00:43:41.000 Of course. Alright. So in Skeler and all this stuff is done with MLP classifier for classification and MLP regressor for regression. 00:43:41.000 --> 00:43:50.000 So from SK Learn dot neural. 00:43:50.000 --> 00:44:03.000 Underscore network. Import MLP classifier. So the MLP stands for multi-layer perceptron, I believe. 00:44:03.000 --> 00:44:10.000 So we've got MLP classifier. 00:44:10.000 --> 00:44:27.000 So what we're gonna do is we're going to, you can control the hidden layers with a an argument called hidden layer sizes and so as an example we could do a single layer. 00:44:27.000 --> 00:44:37.000 With 500 nodes, this would probably be like way too big for this particular data set. You know, so if with that in mind, why don't we just go ahead and audible. 00:44:37.000 --> 00:44:45.000 I'm gonna audible and just see like 50 just to make it smaller that way it'll fit faster and then you are going to increase the max iter. 00:44:45.000 --> 00:44:49.000 So the number maximum number of iterations. I'm just doing this to make sure it will fit. 00:44:49.000 --> 00:44:57.000 Let's try 10,000. So you might notice, there's this like weird trailing comma. 00:44:57.000 --> 00:45:08.000 We go to the documentation like that's just how they have it set up. Don't necessarily know why they set it up like that, but I'm just following the documentation. 00:45:08.000 --> 00:45:12.000 If somebody does know why, you know, feel free to put it in the chat so everybody else will know why. 00:45:12.000 --> 00:45:17.000 Okay, so this is our first one and then as an example let's go ahead and do one with 2 hidden layers and then we'll cut it in half. 00:45:17.000 --> 00:45:29.000 With 25 and 25. 00:45:29.000 --> 00:45:34.000 Okay, so Zach is saying that the comment indicates it's a shape, not an integer. 00:45:34.000 --> 00:45:39.000 Thank you, Zack. 00:45:39.000 --> 00:45:43.000 Hey, so we're fitting. So even a smaller one, you can see it takes a little bit. 00:45:43.000 --> 00:45:50.000 You know, it's does take a little bit. So these models are typically slower than some of the other models we've looked at. 00:45:50.000 --> 00:46:00.000 I will say like, this was still pretty fast and the scheme of things. There are times when you'll be working on a real model for a real problem where it could take hours or over a day to fit. 00:46:00.000 --> 00:46:06.000 So even if it takes like a minute or 2, it's not too, it's not too slow. 00:46:06.000 --> 00:46:13.000 Okay. And then this is just showing off like, you know, our accuracy. So trading accuracies. 00:46:13.000 --> 00:46:20.000 And I guess I should change this. Because I did edit my models. 00:46:20.000 --> 00:46:25.000 There we go. So this is just showing you the training set accuracy and a validation set accuracy. 00:46:25.000 --> 00:46:35.000 So You know, what works best, the architecture that works best for you. It depends upon the problem you're dealing with. 00:46:35.000 --> 00:46:45.000 So you'd want to probably do some sort of tuning process where you try different architectures and then compare the performance on a validation set or cross validation. 00:46:45.000 --> 00:46:50.000 So in the real world, you know, your neural networks tend to take a long time to fit. 00:46:50.000 --> 00:46:59.000 So I think you typically will use a validation set over a cross validation. But if you have the time to do across validation, you could try different architectures and then see which one works. 00:46:59.000 --> 00:47:09.000 I think typically when you're working with real neural networks, people will look up for, usually have a problem in mind, so maybe image classification. 00:47:09.000 --> 00:47:16.000 And so you'll look up like what people tend to do and then sort of follow that in terms of like trying architectures. 00:47:16.000 --> 00:47:21.000 I don't think there's like a good rule of thumb or like you always start with 10 and then go to 20. 00:47:21.000 --> 00:47:25.000 I don't know that there's like a good rule of thumb for like neural networks. 00:47:25.000 --> 00:47:30.000 Architectures in general. 00:47:30.000 --> 00:47:41.000 And then here, before I start answering some questions, I think I'm just showing off, you know, the multi-class neural net or, confusion matrix, which then I can use. 00:47:41.000 --> 00:47:44.000 You know you can like see like okay Are the, I think this is just maybe more in the context of this particular problem. 00:47:44.000 --> 00:47:59.000 Are the things that I'm getting wrong, like are they reasonable mistakes? So like you could maybe see like a one that looks close to a 2 and a 0 that looks close to a 9. 00:47:59.000 --> 00:48:07.000 And so like it's not like I'm making. Outrageous mistakes. It seems like this is a pretty good classifier. 00:48:07.000 --> 00:48:11.000 Okay. 00:48:11.000 --> 00:48:21.000 So Yusuf has asked, when do we use a neural network? So you typically wanna have, so let's go back to the picture. 00:48:21.000 --> 00:48:27.000 Because you're doing a lot of weighted sums, you're going to have a lot of parameters. 00:48:27.000 --> 00:48:33.000 And so in general you want to have a lot of observations like a lot, a big training set. 00:48:33.000 --> 00:48:44.000 So like maybe thousands or tens of thousands or if you can if you can get it like you know hundreds of thousands to millions of observations because you have a lot of weights that you're trying to fit. 00:48:44.000 --> 00:48:48.000 So it's very easy for the neural network to over fit. So it's very easy for the neural network to over fit. 00:48:48.000 --> 00:48:53.000 Neural networks to over fit. Neural networks get used a lot for problems like image class. 00:48:53.000 --> 00:49:07.000 Not this particular neural network but neural networks in general. They get used a lot in NLP problems, but again, you know, if you're working on an LP problem with not very much data. 00:49:07.000 --> 00:49:12.000 You maybe don't want to use a neural network, but if you're working in like an industry. 00:49:12.000 --> 00:49:17.000 For like a big tech company they'll have lots of data so that's usually not a problem. 00:49:17.000 --> 00:49:23.000 Yeah. 00:49:23.000 --> 00:49:27.000 Okay, so up to this point we've sort of had this like vague idea of you're using an activation function. 00:49:27.000 --> 00:49:35.000 So I wanted to go over some of the activation functions that get used. Just so you're aware of them. 00:49:35.000 --> 00:49:50.000 So the first one is just, the identity function. So not a nonlinear one, but it does get used in particular like with things like, I think like with regression as the output, like you might use like an identity activation as your output function. 00:49:50.000 --> 00:50:06.000 Another one that gets used is hyperbolic tangent. I'm not 100% sure like when this one will get used, but I do think I've seen it used in something called recurrent neural networks, but this is a possible activation function that people will use. 00:50:06.000 --> 00:50:12.000 This is called the rectified linear unit or, and so this one gets used all the time from hidden layer to hidden layer. 00:50:12.000 --> 00:50:26.000 So it takes the maximum value. Between what the input. And 0. So if your input is less than 0, it will give you a 0. 00:50:26.000 --> 00:50:32.000 And then if it's 0 or higher, it will give you the value of the input. So this gets used a lot. 00:50:32.000 --> 00:50:40.000 It's like the default hidden layer activation function. And then these are the ones for SK Learn. Okay. 00:50:40.000 --> 00:50:46.000 I missed this one. I knew I missed this one. This one's called the logistic activation or the sigmoid activation. 00:50:46.000 --> 00:50:58.000 So this one gets used a lot as the output node activation function for a binary classification. There's another one that gets used that we don't have for SK Learn called the softmax. 00:50:58.000 --> 00:51:05.000 We'll talk about that in a later notebook. 00:51:05.000 --> 00:51:06.000 Yeah. 00:51:06.000 --> 00:51:13.000 Excuse me. So, the problem of the differentiability with regards to the RULEU activation function. 00:51:13.000 --> 00:51:17.000 So they have, I don't particular, I don't know the particulars, but they have ways of, sort of dealing up dealing with the fact that it's not differentiable everywhere. 00:51:17.000 --> 00:51:35.000 Competitionally, they have workarounds for that. And I don't, I don't personally know what the worker rounds are, but they they have like, they do have workarounds and like the algorithms where they calculate the Yeah, so it's not a satisfactory answer because I just don't know 00:51:35.000 --> 00:51:41.000 Okay, thank you. 00:51:41.000 --> 00:51:50.000 but yeah, I just, I never bothered to look it up. 00:51:50.000 --> 00:51:56.000 So. Well end this notebook. 00:51:56.000 --> 00:52:10.000 With this sort of Nice result in theoretical deep learning. So the feed forward neural network with a single hidden layer is a universal approximator, which means that it's been proven mathematically that a feed forward network with a single hidden layer containing a finite number of neurons. 00:52:10.000 --> 00:52:28.000 So meaning a finite number of hidden layer nodes can approximate continuous functions on compact subsets of RN under mild assumptions on the activation function. 00:52:28.000 --> 00:52:36.000 So basically what this is saying is that for reasonable problems, like reasonable classification or regression problems, you're trying to, you're trying to learn. 00:52:36.000 --> 00:52:52.000 We're reasonable is sort of this. This statement, compact subsets of RN, mild assumption, the activation function, for reasonable problems, you can approximate the true relationship as closely as you would like to, assuming you have enough training data. 00:52:52.000 --> 00:53:05.000 And enough compute power. So. That's why it's called a universal approximator because it can approximate like a large class of functions as well as you might like it to. 00:53:05.000 --> 00:53:19.000 However, the problem is that you probably will not have enough observations or enough compute power to like, it says a finite number of neurons, but finite can still be pretty darn big. 00:53:19.000 --> 00:53:28.000 And the bigger your hidden layer is the harder it is computationally to fit it. So while this Theorem exist. 00:53:28.000 --> 00:53:43.000 It's not necessarily helpful practically in terms of You know, you can't just keep throwing more and more hidden node nodes into your hidden layer because there is a limit to like how much your computer or GPU can handle. 00:53:43.000 --> 00:54:00.000 So this is where sort of the idea for what so called deep learning comes from. So the idea behind deep learning is, well, maybe instead of having a really large single hidden layer, I cut off some of the nodes from that and then add them on as a second layer. 00:54:00.000 --> 00:54:08.000 And so you're trading the height of a single hidden layer for increased depth and then. You know, hopefully getting similar results. 00:54:08.000 --> 00:54:28.000 So in general, you know, some it's been found that you can do this and then the desire to wanna understand like theorems like this of okay like can I get some sort of guarantee on if I replace this high of a neural network with one hidden layer with like say 2 hidden layers of such a height like you know what do 00:54:28.000 --> 00:54:36.000 I get with that? That's sort of what the field of deep learning is concerned with. So it's deep because you have more than a single hidden layer. 00:54:36.000 --> 00:54:41.000 So what are some deficiencies for feed forward neural networks? They can really easily over fit on the training data. 00:54:41.000 --> 00:54:51.000 There are some techniques you can use to control for this. There's a thing called like a dropout layer that you can add. 00:54:51.000 --> 00:54:59.000 Gradients, can sort of explode or go to 0 quickly because of the chain rule. 00:54:59.000 --> 00:55:09.000 So like if your neural networks have too many hidden layers, remember this chain rule is doing. Let's go through the example. 00:55:09.000 --> 00:55:24.000 So this chain rule for getting the derivative with respect to W one had like 3 levels of multiplying and so if all of these things are less than one you can quickly approach 0 and if all these things are greater than one, you can quickly blow up. 00:55:24.000 --> 00:55:37.000 So that's something you need to be mindful of. When you're adding hidden layers. Convergence can be slow and difficult, meaning you may have to train for a very long time to get a model that's any good. 00:55:37.000 --> 00:55:42.000 You can get stuck in local minima, but you can try using us to stochastic gradient descent. 00:55:42.000 --> 00:55:56.000 And sort of a more practical deficiency is if you're in someone who's just operating on a personal computer like a laptop and you don't have access to GPUs or servers, like a neural network just might not be a practical choice for you. 00:55:56.000 --> 00:56:03.000 So there come computational costs with these really powerful models, which is why you You really only see like these massive tech companies or someone like Open AI who has massive tech company funding. 00:56:03.000 --> 00:56:16.000 Building really big models because it one, it takes a lot of time to get that data and then 2. 00:56:16.000 --> 00:56:24.000 It costs a lot to trade. The networks. So some additional references you might want. 00:56:24.000 --> 00:56:30.000 So there's this nice YouTube video series here. I think it's by. 00:56:30.000 --> 00:56:34.000 It's fueled by nature. It's by 3 blue one brown and it's like 4 YouTube videos. 00:56:34.000 --> 00:56:49.000 About deep learning and sort of going over that kind of stuff, which is useful. There's a blog post that has step by step example of fitting through back propagation, which is nice. 00:56:49.000 --> 00:56:52.000 There's this book, which I've referred to earlier. Chapter 2 goes through the feed forward networks. 00:56:52.000 --> 00:57:12.000 Or wait, is that what I wanted? Okay, this is a different book. And then there's the book I originally linked to which will take a while to learn a load because it's a PDF, but they have a chapter on like feed forward networks, which is useful. 00:57:12.000 --> 00:57:23.000 And any questions about anything before we move on to Caris? 00:57:23.000 --> 00:57:28.000 So Yahoo is asking if needed, if you need to switch to the other optimization. 00:57:28.000 --> 00:57:38.000 Can it be done in the package in one step or an argument? So I'm not quite sure what you mean with your question. 00:57:38.000 --> 00:57:40.000 So I guess we could look. At the. Yeah. 00:57:40.000 --> 00:57:56.000 Sorry, I mean like can if we like the Wait, like can we do it like for example the other one that's stochastic? 00:57:56.000 --> 00:58:01.000 Who identity sent, you know, like to optimize the weight. 00:58:01.000 --> 00:58:02.000 Okay. 00:58:02.000 --> 00:58:12.000 Okay. So typically you'll have to specify like what algorithm you want to use for the optimization. 00:58:12.000 --> 00:58:19.000 So in SK, And then I guess maybe this is also a question to Zach, or an answer to Zach's earlier question. 00:58:19.000 --> 00:58:22.000 So it looks like with Skeler and they do have a version that uses Newton's method instead of stochastic gradient descent. 00:58:22.000 --> 00:58:27.000 But like for instance, you can set it to be St, just regular, statistic, yet still cast it, gradient descent. 00:58:27.000 --> 00:58:38.000 But these are like set before and I think in general it's like it'll be something you set before you start fitting. 00:58:38.000 --> 00:58:49.000 And I don't know that there, I guess you could in theory program a way up that like if you see something you can maybe switch part way through but in general I think you 00:58:49.000 --> 00:58:58.000 You choose it before. So Pedro is asking, could you use cross validation to see how many layers slash nodes is best or is that too expensive computationally. 00:58:58.000 --> 00:59:06.000 So you very well, you could use a cross validation. And so like if your neural network trains in in a quick enough time, you could do that. 00:59:06.000 --> 00:59:14.000 One, like just like you sort of alluded, alluded to in your question. If it's taking a really long time to train like maybe a few hours, it might not be worth it to go through the process of training at 5 different times. 00:59:14.000 --> 00:59:34.000 So then you would use something like a validation set, instead. And so you have the trade-off there of, okay, now you're really just maybe overfitting on this one particular set, but it is better than just using nothing. 00:59:34.000 --> 00:59:48.000 Yeah, any other questions about this before we move on to Caris? 00:59:48.000 --> 01:00:00.000 Okay. So actually let me take a drink of water. 01:00:00.000 --> 01:00:17.000 So Caris is a nice package for doing. Neural network, building and Python, more general neural networks than what you can do with Skeler and Skiller might be nice like for just like an initial probe but if you're looking to build a like seriously build the neural network 01:00:17.000 --> 01:00:26.000 you probably want a package that's specifically does that. So the 2 that are very popular are Kara slash tensorflow. 01:00:26.000 --> 01:00:29.000 So Keras is sort of like a sub package within TensorFlow. And then the other one is Pi Torch. 01:00:29.000 --> 01:00:40.000 We won't be learning Pi Torch because this is our last day. But, you know, I do have a book reference that you can use if you'd like to learn Tai Torch. 01:00:40.000 --> 01:00:48.000 Some people, I think which one you're gonna want to use depends on. Probably on the problem you're using and like where you're working. 01:00:48.000 --> 01:00:54.000 So some places might prefer you use TensorFlow and Caris. Other places might prefer you use Pi Torch. 01:00:54.000 --> 01:01:08.000 It really just depends on the team and the problem you're working on. So. Harris is a deep learning API and Python that is running on on top of TensorFlow. 01:01:08.000 --> 01:01:12.000 And so the idea is that what's really nice about Keras is You kind of just say like, alright, I want a dense network with this. 01:01:12.000 --> 01:01:30.000 Lear and then it does it. Whereas like in TensorFlow, I've never tried to use sensor flow, but I did look into it in a book and I think you have to do much more fine-grained construction, whereas Keras is sort of like built on top of TensorFlow with like 01:01:30.000 --> 01:01:36.000 a user in mind of somebody wants to build a dense neural network with this many nodes. And so then the code uses dense with a number of nodes. 01:01:36.000 --> 01:01:44.000 So it's very likely if you have not installed it before, you don't have Keras installed. 01:01:44.000 --> 01:01:51.000 So. One thing you can try and do, is import Keras and then check your version. 01:01:51.000 --> 01:02:06.000 If this does not work, you may also want to try the following. So because Keras is a part of TensorFlow, you may want to try first doing, doing from TensorFlow import Keras. 01:02:06.000 --> 01:02:11.000 So if importing care straight doesn't work, try doing from TensorFlow and port caras. 01:02:11.000 --> 01:02:20.000 And if that doesn't work, you're probably going to need to install it. So you could install it directly using either PIP or Conda. 01:02:20.000 --> 01:02:26.000 You may, if that doesn't work, you may need to install TensorFlow using Pippa Conda. 01:02:26.000 --> 01:02:33.000 And then if that doesn't work, it's possible that it's because you either have an Apple M one chip or an Apple M 2 chip. 01:02:33.000 --> 01:02:41.000 So again, I think maybe with Apple N one things are fine now, but if you have an Apple M 2 chip, it's possible that they're not fine for you. 01:02:41.000 --> 01:02:50.000 So this link maybe isn't gonna help anymore, but you'll have to do like a web search to see like I have an Apple M 2 computer and I want, what do I do? 01:02:50.000 --> 01:03:01.000 Yeah. Okay. So once we have Keras, we're gonna, we're basically gonna do exactly what we did with this notebook, the SK Learn one, but in Keras. 01:03:01.000 --> 01:03:05.000 So we're going to use the Keras version of the data. So remember this, 60,000 observations of a 28 by 28 pixelated image. 01:03:05.000 --> 01:03:19.000 So. Because this is in a grid in order to put it as the input for our feed forward, so let's go back to that picture. 01:03:19.000 --> 01:03:33.000 The feed forward networks are expecting basically a column vector or just a regular vector as opposed to a 2 by 2 grid or whatever by whatever grid of values. 01:03:33.000 --> 01:03:40.000 So the first thing we have to do is reshape the images. So we're going to reshape them using reshape. 01:03:40.000 --> 01:03:51.000 So we're gonna set the row argument to negative one so that way the 60,000 will carry over and then we're changing the next to be 28 by 28 because that will give us the number of columns we need. 01:03:51.000 --> 01:04:02.000 So instead of being arranged in a grid. The number of columns is going to be arranged just as like a row vector. 01:04:02.000 --> 01:04:05.000 Then the next thing we need to do is scale the data, which I think I did in the previous notebook, but I just forgot to say it. 01:04:05.000 --> 01:04:15.000 So scaling the data for the images is going to be just dividing by the maximum pixel value. 01:04:15.000 --> 01:04:20.000 So it's common in a lot of image problems for the maximum pixel value to be 255. 01:04:20.000 --> 01:04:27.000 And so this will scale it so that the values are going from 0 to one. So the minimum is already 0. 01:04:27.000 --> 01:04:34.000 We don't have to do that. But it's standard with image to images to scale it just by doing a mid max scaling where you're taking the 0 it goes to 0 and then the maximum goes to one in a linear way. 01:04:34.000 --> 01:04:46.000 So this is what you'll do instead of standard scalar for image data. 01:04:46.000 --> 01:04:59.000 Okay, so before we dive into building the networks, are there any questions just about the data? 01:04:59.000 --> 01:05:02.000 Please can you explain the issue part again? 01:05:02.000 --> 01:05:08.000 Yeah, so. 01:05:08.000 --> 01:05:13.000 This is what the first entry of X Train looks like. 01:05:13.000 --> 01:05:20.000 So it's this grid of pixel values. And so if we go back, it's like one of these. 01:05:20.000 --> 01:05:30.000 Right? And so, The way it's. Imported from Keras is it's in this grid already like rows of columns of pixel values. 01:05:30.000 --> 01:05:37.000 But if you remember from our setup. Feed-forward networks aren't built to take that in. 01:05:37.000 --> 01:05:43.000 So you need to go back to like sort of the more traditional matrix of X where you have rows of columns. 01:05:43.000 --> 01:05:52.000 And so we are going to do that. Let me minimize that. There we go. We're going to do that with this line called the reshape. 01:05:52.000 --> 01:05:58.000 So the first thing as you put in, so normally we would do like what we do reshape negative 1 one, right? 01:05:58.000 --> 01:06:06.000 So the negative one is just saying, however many rows we already like we want and for us it's going to be the 60,000 for the training 10,000 for the test. 01:06:06.000 --> 01:06:19.000 Like reshape will take care of that. And then the 28 by 28 takes this grid which is a 28 comma 28 and then just turns it into a number of columns. 01:06:19.000 --> 01:06:26.000 And so whereas before we had 60,000 by 28 by 28, now we have 60,000 by 784. 01:06:26.000 --> 01:06:38.000 And if we wanted to look at this as an observation. So x train at 0. Now we can see it's like a row vector instead. 01:06:38.000 --> 01:06:39.000 Thank you. 01:06:39.000 --> 01:06:50.000 Yeah. Okay. So now we're going to import everything we need to build the network in Keras. 01:06:50.000 --> 01:07:01.000 So we have all these things. One thing I'll point out is If you're running an older version of Caris, this particular code, the 2 categorical may not work for you. 01:07:01.000 --> 01:07:12.000 So try and instead uncommenting this. So it used to be the 2 categorical was stored somewhere else in an earlier version. 01:07:12.000 --> 01:07:18.000 So if you have an earlier version and this isn't working for you, try uncommenting this and then hopefully it works. 01:07:18.000 --> 01:07:27.000 And then if it still doesn't work, try looking up the particular documentation for your version. So Caris has made other like unlike SK Learn which I've found to have been pretty stable over the past few years. 01:07:27.000 --> 01:07:36.000 Caris has tended to have made like 2 or 3 different changes, which is like. Change the location of things like too categorical, which gets annoying if you're trying to teach. 01:07:36.000 --> 01:07:56.000 Okay. So the first thing we need to make is an empty model object. So the way to make an empty model object is you call models, which is the sub package that has all the different types of models you can make, and then you do sequential. 01:07:56.000 --> 01:08:08.000 So it's called sequential because we're going to be adding layers in sequence. So now we have an empty model object that's all ready to go stored in a variable that I have called model. 01:08:08.000 --> 01:08:27.000 This is the architecture of the model we are going to be building. So we have an input layer that has 28 by 28 nodes and then here when I say 20 by I mean multiplying so like the 700 something nodes that's going to have a Rayleigh activation for the first hidden layer, which will have 16 01:08:27.000 --> 01:08:34.000 nodes. The second hidden layer will also use a radio activation for the 16 nodes. And then at the end. 01:08:34.000 --> 01:08:45.000 We have something called the softmax activation for 10 nodes. Each of these nodes is modeling the probability that your observation is the digit. 01:08:45.000 --> 01:08:57.000 So this one will be the probability or observations a 0, probability or observations a one, probability or observations a 2, and so forth. And that is done through the softmax activation. 01:08:57.000 --> 01:09:03.000 So we don't explicitly define that in this notebook, but if you go to the neural network practice problems, I go through like what the softmax is. 01:09:03.000 --> 01:09:15.000 It's basically just a sum of, like E to the negative some things divided by, it's like e to the negative something divided by the sum of e to the negative something. 01:09:15.000 --> 01:09:22.000 So that way it turns into a 0 one value that would add up to one. 01:09:22.000 --> 01:09:26.000 So icons asking softmax could have been sigmoid. So sigmoid is if you have binary classification. 01:09:26.000 --> 01:09:37.000 Which is what we had if you did the problem session, what we had earlier. Softmax is for multi-class classification. 01:09:37.000 --> 01:09:48.000 So here we have 10 possible classes. So that's why we're using softmax. 01:09:48.000 --> 01:09:55.000 Okay, so I will then, make a quick. No, we're going to make a code chunk. 01:09:55.000 --> 01:10:07.000 You only want to run the code chunk one. So basically with the way Keras works is anytime you add a layer to the model, if you rerun that code, it's gonna think like, okay, this person wants to add another layer. 01:10:07.000 --> 01:10:19.000 And so if you go through the process of just experimenting and rerunning the same code junk over and over and over again, you'll quickly end up with a neural network that maybe has a huge hodgepodge of different layers that you didn't want and wouldn't work. 01:10:19.000 --> 01:10:26.000 So when you're adding layers to your model, you want to make sure you're being really careful and only adding the layers you want to add. 01:10:26.000 --> 01:10:27.000 So to add a layer, you call the variable that is storing the model. So you'll do model. 01:10:27.000 --> 01:10:38.000 Dot add. Then I'm going to call. Is layers a thing? 01:10:38.000 --> 01:10:43.000 Yes, layers. Dot. Dense. And let's just double check. 01:10:43.000 --> 01:10:54.000 Yes, okay, good. So the first argument to dense is the number of nodes. So remember I said feed forward multi-layer dense. 01:10:54.000 --> 01:11:02.000 It's all the same type. It just means that we want. A layer that has every node feeding it to every subsequent node. 01:11:02.000 --> 01:11:14.000 Okay, so dense, we put in the number of nodes you want, which is 16. You put in the activation function you want, which is Rayloo, which I think might be the default, but I think it's always good to just state it. 01:11:14.000 --> 01:11:19.000 Now for the first layer only, the first dense layer, you need to say how many nodes it's expecting to receive. 01:11:19.000 --> 01:11:27.000 So you need to put in an argument called, input. 01:11:27.000 --> 01:11:35.000 Let's see, right. Oh, what is the name of it? I think it's just called input shape. 01:11:35.000 --> 01:11:43.000 And then this is going to be a 2 pull and then what you'll typically do is, 01:11:43.000 --> 01:11:48.000 Next dot shape. 01:11:48.000 --> 01:11:54.000 X train dot shape. Okay, I'm gonna cheat just because I'm having like a brain. 01:11:54.000 --> 01:12:07.000 A brain slip. I don't wanna get it wrong. Okay, there we go. 01:12:07.000 --> 01:12:12.000 So it's going to be the number of nodes in the input layer followed by a comma. 01:12:12.000 --> 01:12:17.000 Following the same logic from our SK Learn discussion earlier. So you have to specify this for the first layer because it tells the neural network what to expect. 01:12:17.000 --> 01:12:26.000 So it's saying, alright, this is what you should expect. These are the number of input nodes. 01:12:26.000 --> 01:12:34.000 Now for the second layer. It's the same exact thing with one difference. So, where's dot dense. 01:12:34.000 --> 01:12:41.000 And if we had a different number of nodes, it wouldn't be the same exact thing, but we do have the same number, so we put a 16 there. 01:12:41.000 --> 01:12:48.000 And then an activation equals Rayloo. Now the one difference here is that we do not have to put an input shape. 01:12:48.000 --> 01:12:54.000 Once you have like filled in a layer from the caras, it already will like be able to infer like, okay, the previous layer was this hidden layer with 16 and so that's what goes here. 01:12:54.000 --> 01:13:04.000 So that's it doesn't need the input shape anymore. Now the final layer we need to add. 01:13:04.000 --> 01:13:15.000 Is also a dense layer. And then it's going to be the number of, output nodes, which is 10. 01:13:15.000 --> 01:13:21.000 And then the activation here is not going to be Ray Loo. It's going to be softmax. 01:13:21.000 --> 01:13:28.000 Okay. Alright. And so here's a useful thing that. I don't think I show in this notebook yet. 01:13:28.000 --> 01:13:37.000 So there's this model dot summary. So when you call this, it gives you a nice little summary table about your model. 01:13:37.000 --> 01:13:43.000 So it allows you to see, you know, what type of layer you have, the shape, the output shape of that layer. 01:13:43.000 --> 01:13:48.000 So output shape being that number you specified before. And then the number of parameters you're fitting. 01:13:48.000 --> 01:14:00.000 So you can see how this very quickly like. We have, let's see. Almost 13,000 parameters with what seems to be a pretty small model. 01:14:00.000 --> 01:14:03.000 Okay. 01:14:03.000 --> 01:14:18.000 Okay, so I do see we have a couple of questions. Yahoo is asking, is the softmax activation something like the following all data based on probability bins are categorized so the softmax function, we could. 01:14:18.000 --> 01:14:43.000 Maybe instead of just telling you to look, we'll go look. 01:14:43.000 --> 01:14:55.000 Of that function is going to be we're ZI is the ice entry of Z. Divided by the sum from Jake was one to K of E to the ZJ. 01:14:55.000 --> 01:15:04.000 So for this neural network, it's going to be. 01:15:04.000 --> 01:15:12.000 Like here would be E to the sum of these H's. You know, divided by the sum of those ease. 01:15:12.000 --> 01:15:21.000 So this is the, this is the softmax. And so basically it's just a way of trying to get a probability. 01:15:21.000 --> 01:15:26.000 From your weighted sums. 01:15:26.000 --> 01:15:29.000 Alright, let me have another question. Is asking is choosing the number of layers essentially guest and check like it is for the number of nodes. 01:15:29.000 --> 01:15:39.000 Yeah, kind of. So like one thing you might do is sort of like using that comment on the universal approximator where it's like you might try, okay, this was one big player. 01:15:39.000 --> 01:15:51.000 What if I tried 2 smaller layers, that sort of thing. And then like you might be able to try like keeping an eye on the number of parameters. 01:15:51.000 --> 01:16:03.000 So like maybe if you have 2 small layers. You know, just seeing like as you do the different layer trade offs that gives you different numbers of parameters. 01:16:03.000 --> 01:16:17.000 I guess like from what I haven't seen anything and it's not like I'm actively looking, but like I have not seen anything of like a rule of thumb of always trying this, you know, these sorts of layers and these sorts of number of nodes. 01:16:17.000 --> 01:16:28.000 I think it is. Typically field dependent. So like image classification has good rules of thumbs because that is rules of thumb. 01:16:28.000 --> 01:16:40.000 Because that is a really activ area of development for tech companies. So you just want to do like a sort of a dive into the literature of like what exists. 01:16:40.000 --> 01:16:49.000 If it's a problem that isn't like super well research, you're probably just gonna be like guessing and checking with cross validation or a validation set. 01:16:49.000 --> 01:16:57.000 Okay, so we have our model, it's all built. The last, the next thing we need to do before we can fit it is do what's called compile the model. 01:16:57.000 --> 01:17:05.000 So this is a step where we're going to choose an optimizer, which is just the algorithm used to run gradient descent. 01:17:05.000 --> 01:17:16.000 A loss function, which is the last function used in that gradient descent and then a metric, which is just going to be some sort of metric that the like the model will keep track for us as we go through. 01:17:16.000 --> 01:17:26.000 So for us, in this example, we're going to use the RMS prop as our algorithm for gradient descent or just the fitting algorithm. 01:17:26.000 --> 01:17:37.000 You could go to the, the Keras documentation, to see what other ones are available and like what the different trade-offs are. 01:17:37.000 --> 01:17:44.000 We're going to use a last function called categorical cross entropy. So this is an last function called categorical cross entropy. 01:17:44.000 --> 01:17:47.000 So this is a last function called categorical cross entropy. So this is a cross entropy that has for multi-class. 01:17:47.000 --> 01:17:54.000 And then the thing we're going to keep track of is accuracy. 01:17:54.000 --> 01:18:00.000 Okay, so just like I've been saying how it can take a long time to fit a model, I'm just gonna make a validation set. 01:18:00.000 --> 01:18:15.000 And I wanna do a quick aside on something called 2 categorical. So for, if you have a multi class classification problem, you first need to transform that training data and the validation data using a function called 2 categorical. 01:18:15.000 --> 01:18:26.000 So this is the original training data. For the Y, so it's got the number. So it's the number 5 is the first observation followed by the number 0, followed by the number 4 and so forth. 01:18:26.000 --> 01:18:33.000 What that does is it then 2 categorical will take that. And turns it into a 2D array. 01:18:33.000 --> 01:18:39.000 Each row represents one of our observations. So this first row would be the one that is number 5. 01:18:39.000 --> 01:18:42.000 The second row is the one that is 0. And so forth. And then each of the columns represents whether or not this observation is of that class. 01:18:42.000 --> 01:18:56.000 So for instance, We've got the 0 we can't see right because 5 is in the dot dot dot region, but thankfully for the second row. 01:18:56.000 --> 01:19:03.000 Our observation is a 0, so you can see a one there and then there would be zeros everywhere else. 01:19:03.000 --> 01:19:12.000 And then for the last observation it's an 8 so you can see a one in the position that would correspond to 8 the second to last and then zeros everywhere else. 01:19:12.000 --> 01:19:14.000 So that's what 2 categorical does. This is something we have to do for Caris. That's what Keras is expecting to receive. 01:19:14.000 --> 01:19:25.000 Kind of like how we had to use reshape for SK Learn when we were fitting something with one dimension. 01:19:25.000 --> 01:19:32.000 That's why you know we have to use categorical to categorical here for Keras because that's what it's expecting to get. 01:19:32.000 --> 01:19:38.000 Okay, so when we call fit to the model, we specify, the number of Epic's, the batch size, so by default it does batch gradient descent. 01:19:38.000 --> 01:19:49.000 So I'm gonna specify it here as a hundred and I'm storing in a variable because I'm gonna use it later. 01:19:49.000 --> 01:19:55.000 My batch size is 512. I don't have a particular reason. This is just sort of coming from the book I read when I was learning it. 01:19:55.000 --> 01:19:59.000 They said, we're gonna use the batch size of 512. So that's what I did. 01:19:59.000 --> 01:20:08.000 And now I'm gonna call a model dot fit. So the first argument is X, X train followed by Y train. 01:20:08.000 --> 01:20:15.000 Followed by the number of epics, which I'm just gonna copy and paste. 01:20:15.000 --> 01:20:23.000 Followed by the batch size. 01:20:23.000 --> 01:20:31.000 And then the last argument you can give it is the validation set. 01:20:31.000 --> 01:20:36.000 And then let's just double check if this is the list since we're running out of time. 01:20:36.000 --> 01:20:46.000 Tuple. After that big old spiel about 2 categorical, I forgot it. So 2 categorical. 01:20:46.000 --> 01:20:57.000 And then we'll put in a tuple here. So you can put in a tuple of validation data, so we'll do And then 2 categorical. 01:20:57.000 --> 01:21:06.000 Why? And of course I had a an error. 01:21:06.000 --> 01:21:14.000 It's just epochs, not N, Epox. There we go. Okay. 01:21:14.000 --> 01:21:18.000 So here it is fitting and a nice thing about Keras is it like gives you a little progress report so you can see where it is. 01:21:18.000 --> 01:21:31.000 So it says this is my, you know, this bar indicates that this is the training progress on the third Then it gives you this little progress bar that will fill up as the training goes. 01:21:31.000 --> 01:21:38.000 And then once the training's done, it gives you a little summary. So this is the accuracy and the loss on the training set. 01:21:38.000 --> 01:21:43.000 And then you have the validation loss and the validation accuracy as well. So this is a nice thing. 01:21:43.000 --> 01:21:44.000 It gives you like some piece of mind of like, okay, you know, it's still training. 01:21:44.000 --> 01:21:56.000 So that's nice. And then you could set it up so that like if you were running this on like a server, all this would print out to some sort of log file that you could periodically check. 01:21:56.000 --> 01:22:02.000 Okay, so we've got so. . Just making a comment that Adams is also common. 01:22:02.000 --> 01:22:16.000 Yep, so that was what is used by default by SK Learn, right? And then Yahoo is asking, is the 2 categorical something like one hot encoding we learned before? 01:22:16.000 --> 01:22:26.000 Kind of yes. So it's gonna basically go through and see here all the possible classes. Because I, I think that you would need it to be in this sort of numeric form already. 01:22:26.000 --> 01:22:30.000 So I don't know that you could put in strings. I think it needs to be integers. 01:22:30.000 --> 01:22:32.000 And then it will recognize the integers and then arrange them accordingly and then give the zeros and ones. 01:22:32.000 --> 01:22:45.000 So unlike before though, we want to keep all of our columns and not do K minus one. 01:22:45.000 --> 01:22:56.000 Okay, so notice here that when I fit my model, I stored the output in something called history. And so we can look at history. 01:22:56.000 --> 01:23:05.000 And so we can see that's like this weird, thing. So, but the key is that history has this thing itself called history. 01:23:05.000 --> 01:23:12.000 Which is a dictionary. And so this dictionary contains the training loss. 01:23:12.000 --> 01:23:16.000 The training accuracy. 01:23:16.000 --> 01:23:19.000 The validation loss. 01:23:19.000 --> 01:23:27.000 And the validation accuracy. And so then we can go through. I guess I wanna store this as a variable. 01:23:27.000 --> 01:23:35.000 History dipped for dictionary. So then you can use this to plot like the accuracy. 01:23:35.000 --> 01:23:39.000 And the validation accuracy. As time goes on. So here it looks like they're virtually identical. 01:23:39.000 --> 01:24:00.000 But like what you'll tend to see is if you train for too long, the training set will start to get slightly, you know, will continue to get better because if you keep training and your network is large enough, like you can basically learn the training set perfectly, but typically then you'll see when that starts to happen, the 01:24:00.000 --> 01:24:08.000 validation set performance starts to decrease. So it doesn't really happen in this picture, but it does in general happen. 01:24:08.000 --> 01:24:10.000 And so typically what you want to do you can want to find like the place where the validation performance starts to level off. 01:24:10.000 --> 01:24:29.000 And then that's like the number of epics you'd wanna use. For your model and so after you build one network sort of going off of some of the questions we've had of how do you choose a number of layers in a layer size. 01:24:29.000 --> 01:24:40.000 This is like tuning the model architecture. You'll try a different one. So here we've got the same exact model, but now with 32 nodes in each hidden layer. 01:24:40.000 --> 01:24:45.000 And so this will train. 01:24:45.000 --> 01:24:59.000 And when it's done, we can compare the validation performances for both sets, for both models. 01:24:59.000 --> 01:25:05.000 Okay, so we've got neural network one, which was that 16 by 16 neural network 2, which was that, 32 by 32. 01:25:05.000 --> 01:25:16.000 And so we can see that neural network to trains faster and kinda levels off whereas neural network one takes a little bit longer to train. 01:25:16.000 --> 01:25:32.000 But does seem to maybe over time start to outperform. I would maybe want to go a little bit longer for both of them to see like if that can that pattern continues. 01:25:32.000 --> 01:25:37.000 So it looks like here I. 01:25:37.000 --> 01:25:44.000 That's funny. I don't know, I always go back and read what she said. I think when I did it earlier when I was writing the notebook, so there's a little bit of randomness here, right? 01:25:44.000 --> 01:25:45.000 Cause you're choosing random weights and stuff like that. The 32 by 32 outperformed it when I wrote it originally. 01:25:45.000 --> 01:25:55.000 So basically that's like what you'd do. So if we were like imagining that we're in a world where we're done. 01:25:55.000 --> 01:26:05.000 This would be like I would look and say, okay, I found the one which performed better, and then I would go back and retrain to the per the number of, epics, which gave me the best validation performance. 01:26:05.000 --> 01:26:12.000 So for us, that was. 30 on the second model. And so that's what I did here. 01:26:12.000 --> 01:26:19.000 And then when it's all done, I can make predictions so I can do like dot predict. 01:26:19.000 --> 01:26:22.000 X. 01:26:22.000 --> 01:26:29.000 Train at let's say 18. 01:26:29.000 --> 01:26:32.000 Oh no. 01:26:32.000 --> 01:26:38.000 Okay, so let's do, let's just not do that part. 01:26:38.000 --> 01:26:45.000 And then I could subset it for like the eighteenth row. Okay, so this is the prediction. 01:26:45.000 --> 01:26:53.000 Of observations. And then I could do, compare it. Let's change this to 18. 01:26:53.000 --> 01:26:58.000 So it looks like a one and if we go here the value for one isn't actually particularly, particularly high. 01:26:58.000 --> 01:27:16.000 Looks like the highest one is for 9 8 7 6 if I'm reading it correctly so what you could do is do like an NP. 01:27:16.000 --> 01:27:22.000 And this will say, yeah, that the higher the likely predicted value from here is a 6. So this is an incorrect prediction. 01:27:22.000 --> 01:27:28.000 You know why? Because I'm looking at different sets. So let's change this to a wide train. 01:27:28.000 --> 01:27:37.000 There we go. And then what you can do is you could get like arg max to get all the predicted. 01:27:37.000 --> 01:27:51.000 And you could predict get the validation score on your final model using like just the regular accuracy score from Okay, so before I pause for questions, I wanted to give. 01:27:51.000 --> 01:27:57.000 A brief outline of what the rest of the neural network content is. So notebook number 5 gives an introduction to. 01:27:57.000 --> 01:28:13.000 Convolutional neural networks which are used in image classification in video tasks. Notebook number 6 gives a very brief introduction to recurrent neural networks which are used in sequential data like time series and NLP data. 01:28:13.000 --> 01:28:22.000 Notebook number 7 gives you just like talks about like typically what you'll do because these models take so long to train is that you'll train it and then save it. 01:28:22.000 --> 01:28:31.000 And so this is teaching you how to, you know, load the model that you've saved. And then the one I want to end on is this future directions. 01:28:31.000 --> 01:28:38.000 So here are some good places to keep learning. Theoretically, you could get pretty far with that book I've linked to before neural networks and deep learning. 01:28:38.000 --> 01:28:49.000 I don't think it covers some of the Most, recent developments with like transformers and auto encoders and stuff, but it's a good foundational book. 01:28:49.000 --> 01:28:56.000 For applied if you want to learn how to do things in Keras, there's deep learning with Python, which is what I used. 01:28:56.000 --> 01:29:05.000 Another good just really general purpose machine learning book. Is this one, which will load soon. 01:29:05.000 --> 01:29:16.000 It's called, hands on machine learning with sidekit learn cancer and tensor flow so this is they have a third edition that came out this past year that you might be interested. 01:29:16.000 --> 01:29:29.000 If you were looking to buy a book on machine learning and Python, I would suggest to buying this one even though they have like the second edition online for free, I would say I, you know, it's worth the purchase. 01:29:29.000 --> 01:29:34.000 This is like the best book I've gotten on how to do stuff in Python for machine learning. 01:29:34.000 --> 01:29:44.000 And then if you want to learn Pi Torch, here's a nice, I haven't gone through this book but this is the same publisher that did the Keras book that I like. 01:29:44.000 --> 01:29:53.000 So this is a book you can do for learning Pie Torch, which, some people have said is starting to gain a little bit more popularity than Kara, so it might be worth looking into. 01:29:53.000 --> 01:30:05.000 Okay, so that's it. I will stop the recording just as a quick last day thing thanks so much for coming all the lectures if you're watching them asynchronously. 01:30:05.000 --> 01:30:09.000 Thanks so much for coming all the lectures. If you're watching them asynchronously, thanks so much for watching them and I've enjoyed being the lecturer. 01:30:09.000 --> 01:30:18.000 And yeah, so. You know, go out and learn data sciencey stuff and do data science.