Introduction to Convolutional Neural Networks I Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We're gonna continue learning about neural networks in particular. We're going to uh do our first in a two part series on the introduction to convolutional neural networks. So this will be a different neural network type that we're learning about uh in these two videos. So let me go ahead and share my Jupiter notebook and we'll get started.
So in this notebook, we're going to introduce the idea the theoretical general idea behind neural networks. Uh Alongside that, we will briefly touch on the mathematics that's going on behind the scenes. Um It's a little bit more complicated notation, but I do provide references at the end for those of you interested in those details. Uh I will also show you I believe this will happen in the next video.
So part two of this convolutional neural network series, uh how to build one of these in car using that M N I S T data set we talked about. And then at the end I said we will point you to additional resources where you can take uh next steps in learning more about convolutional neural networks. So convolutional neural networks are designed to sort of address a deficiency with just straightforward deep networks or feed forward learning networks as we called them.
Um Essentially being that some types of data are structured in a way that there is a kind of dependence. Uh One such dependence uh is grid based data. So the idea here um with grid based data is the value in the grid to the right or to the left or below or above or diagonally from your current grid uh is uh and those grids are likely related to the value in your current grid.
Um And so essentially saying that, you know, in a feed forward network, we're implicitly making this assumption that all of the features are independent of one another. Uh In this convolutional neural network, we're gonna make the assumption that the um in grid based data, uh we have a joint uh um adjacent grid spots are maybe related to our current value uh in this very specific dependence structure.
So that's the idea behind a convolutional neural network is we're going to look at maybe this dependent structure by paying attention to small regions, local areas of the grid uh at a time. Whereas the feed forward network looks at all of the grid uh at once. And so what do we mean? So let's say for instance, we have this image of a hand drawn five a feed forward network would break down all of the pixel values arrange them in a, in a, a row vector or a column vector depending on what we're
doing. Uh And then feed all of these into uh the hidden layers with a weighted sum all at once. So the weighted sum of the feed forward network takes uh includes every single pixel value, every single grid value uh at once. And that's what feeds into your hidden layers. Um That's how feed forward networks are looking at all of the data at once as opposed to maybe local chunks.
Uh So convolutional neural networks kind of look at it in a different manner. Uh So they look at local chunks and what do we mean by that? Well, we're gonna uh sort of give you a sense with these images, then we'll give you a better sense with maybe some uh math. So uh in local chunks, we mean that convolutional neural networks draw these little grids on top of the grid, smaller grids that look at a smaller portion of the image.
So for instance, some of these smaller portions may include the upper left hand corner of the grid, uh somewhere in the middle of the grid and somewhere down below in the grid. And so essentially what's going on is you make this sort of square grid, uh in theory, it could be rectangular, but in practice, it's a square uh this square grid and then uh it will slide it around from left to right kind of like a typewriter if you're typing in, um, English, uh, and then it would, uh, once it gets
to the end of this, it will go down one unit and do the same sliding from left to right. Once it gets to the end, it will go down one unit and slide from left to right. Once it gets to the end, it will go down another unit until it gets all the way to the bottom. And so essentially our hidden our hidden layer, uh weighted sums are gonna be of the entries uh covered in this um covered in this moving grid, which is known as a filter.
Uh And so maybe this is a bit abstract because we're just looking at pictures. Um So maybe it will become more concrete once we see these weighted sums. ... So the weighted sums come from the three parts of a convolutional neural network. And the stuff that we were just talking about is really just the first part, the convolutional layer. So this convolutional layer is the uh sort of implementation of this visual representation that we just talked about where I'm sliding this yellow
square around my grid and focusing just on the areas within that uh square. So I think it is, I don't want to write it out notation because it's a real nightmare because there's a like four or five different indices. Um But what I'm gonna do is we're just gonna work through a a silly example that isn't a real data set. But let's say hypothetically uh we were given a grid that is uh 10 by 10, uh 10 by 10 pixels.
And then that grid is um represented by this two D array. So maybe each value in this grid uh represents the pixel intensity for some image. If you were to try and plot this, it would just be like a random noise uh because that's what it was generated with. But we can say hypothetically uh when we look at our images here, they'll be represented in a, in a sort of two D array like this where each uh entry corresponds to the intensity of the pixel at that grid point.
So what you do in a convolutional layer is you're going to take a in our example, a three by three. But in general, an F by F grid uh square grid and then slide it around just like I said. So you start in the upper left hand corner and if it's a three by three, we would do here and then we'll shift it over one and then we shift it over one. And essentially what that shifting over and sliding represents is we're gonna take a bunch of smaller uh weighted sums instead of one large weighted sum that
includes every value. So visually what we're saying is uh this F by F grid is going to contain the weights that we apply to each value here. So as an example, here's one of these sliding grids. OK. Uh And one of these sliding grids will have this corresponding weight. And so the weighted sum that we did in the feed forward network where each grid point would have its own weight is now being represented by another like a series of separate hidden sums uh represented by these filters.
So the filters, these square grids that I'm talking about the filters um hold the weights for each of the sliding processes. Uh And so, for instance, in this example, one sample weight that I'm showing here is a nine and that would correspond to the 2 24. And so the weighted sum that we would do corresponding to this sliding process for this grid would be nine times 24 plus nine times 1 86 plus five times 2, 20 plus two times 2, 24 plus 10 times eight plus eight times 74.
And you just essentially do what's known as a dot product between these two if you are considering them as uh vectors. Uh And so nine, each corresponding entry is multiplied. So the weights in the filter are multiplied by the values in the grid. Uh And then they are added together which in this example gives us 6057. Uh So in general, uh here we had a 10 by 10 grid with a three by three filter sliding over it.
And so if you sit down and you know, work it out, that's eight by eight possible grid locations. So there are uh eight by eight uh possible positions for this green square. Uh So if we start over here, we can slide it over eight times and then we can slide it down eight times. So in general, if you have uh an L by B grid and you're sliding an F by F filter, there are this many potential positions, meaning that your uh hidden layer will have uh L minus F plus one times B minus F plus one uh values
like nodes. Uh So in this example, we would have eight times eight nodes in the hidden layer. Um That's 64. Now, this was a um yeah. So this is just showing you that, for instance, using the example, we have above, this would be the uh hidden layer that results the value in the hidden layer that results uh from this weighted sum, which we would then apply a nonlinear activation function to uh I just haven't pictured that here.
But once we have the 6057 that would get hit by uh a ray function, which is the typical one for convolutional neural networks. Um But this is what you get in the weight from the weighted sum process. And so each of the grid points being multiplied by the weights in the filter would go into the corresponding spot in the hidden layer. And so here, if this hidden layer is an eight by eight hidden layer, um Now I in showing this example to make it simpler, uh I showed that the hidden layer just
has a single two by two or a single grid in it a single uh uh eight by eight grid. But in practice, what tends to happen is hidden layers will consist of multiple grids. And so for instance, what we would do is say if we wanted a hidden layer that was um 16 deep, we would do this weighted sum process 16 times. Uh And so we would end up with 16, 8 by eight grids uh in hidden layer one.
And I'll have a picture down below that maybe makes this more clear. So you would essentially have 16 moving filter processes. So 16 times the number of weights stored in the filters. Um And then in general, uh in this example, if we had an L by A B grid, we would end up with this many um hidden layer nodes. So here's sort of the visualization of what I'm talking about.
So here's your original data grid. And then here is the uh it's an L by B and then we're sliding over an F by F filter, we do the convolutional weights to get these now hidden layers which each hidden layer is um L minus F plus one by B minus F plus one. And there are D it has a depth of D. And so in these squares, it's sort of like the nodes that we talked about in the feed forward network.
But now they're arranged to correspond to um the process that generates them. OK. So you might notice in general that uh the sliding grid process um because of the sliding grid process, the, the grid points in the middle are represented more in these layers and the grid points on the outside. Um So this happens uh because they're just showing, they show up more in the weighted sum.
So they'll have more influence on the uh final model. Um So if you want to make sure that you don't pay more attention to the inside than you do the outside of the grid, what you can do is you can add in paddings of zeros. Uh So for instance, you could add rows and columns of zeros to your grid before doing the filtering process. And so that makes sure that the values on the periphery of your grid uh are show up in the weights just as much as um show up in the weighted sums just as much as the
values on the interior of your grid. So this is called padding. Uh And this is something that we'll see below that kiss does automatically for you. Uh Well, here, so the stride value that's like how far over you move. So here, when we were talking about it, we talked about uh what would be known as a stride value of one where I would take this green square after doing this weighted sum, move it over one, do the next weighted sum, move it over one, do the next weighted sum and so forth.
Um You can also set that stride value to be whatever you'd like. So stride value of one is pretty typical, but you could choose two or three. Um And that's just gonna give you a um uh I don't know if sparser is the right word, but a sparser uh representation of the grids. And I, I, I don't like using that word because that is a very specific meaning, but you're just gonna get uh a less of a representation of all the values and the weighted sums.
OK. All right. So that's part one. That's what we are talking about up above with this imaging. So the convolutional layer is the uh is the layer that goes by getting these like filters sliding around the grid and doing a bunch of smaller weighted sums as opposed to just one massive weighted sum. Um The next layer is known as a pooling layer. And so it's pretty common that uh you would set up essentially like after hidden layer one, you may set up another convolutional layer that does
basically the same process. But now over all of these grids uh as opposed to just the single input data grid. Uh So once you have maybe one or two convolutional layers. What becomes common is you'll add something called a pooling layer. So a pooling layer is done to um as you can imagine going back to the convolutional layer, uh these doing so many weighted sums uh this many times uh gives you a lot of parameters in your model, which can be incredibly difficult to fit with small sample
sizes. And so what a pooling layer does is it um sort of down samples to try and make it so that we have fewer parameters um than the model without the pooling layer. So how does it do that? It works by shrinking your data by um sampling for the most important features. Um I know that sounds vague, maybe it will become more clear as we show you. So a pulling layer works essentially by sliding a grid around in much the same way as the convolutional layer.
But this time instead of taking weighted sums, it takes a function of the grid. So for instance, if we were sliding a two by two grid, which is what we're going to uh show you below. If we are sliding a two by two grid here with a stride value of two, you'd start up in the upper left hand corner and you would look at all the values in this grid. And in this example, you would take the maximum value and record that in your pooling hidden layer.
So for instance, here in the upper left hand one, we had a four, that was the largest value. So that's why the four gets put there. Now we have a stride of two. So our square would move over two units. And so now we're focusing on the eight, the three, the nine and the zero. And here the maximum value is nine. So we would record a nine in the example, pooling hidden layer.
So the pulling hidden layers, the same exact value uh same exact depth of the uh first of whatever hidden layer you're working with, but it will have a different width and length for for lack of a better term. Uh So here's maybe a a clear visual description of what's going on. Instead of watching my mouse trace it out, you can look at the tracings that I've done by making these images.
So for instance, the example, pulling hidden layer applied to this pink square would look at it and say 065 and four. Uh The maximum value is six and so a six gets recorded in the pulling hidden layer. And this blue one, we would see we have a nine, a two, a zero and a three. The maximum value there is a nine. So the nine is what gets recorded. So in um in practice, maximums get used quite often, you could use anything.
So you could take the the norm of the entries like the Euclidean norm where you add up the squares and take the square root. Uh You could take maybe the L one norm, you could do a lot of things. But it's been found that in practice the uh the one with highest value, uh highest absolute value works best. OK. Uh Also you can use different pulling grid sizes and strides, but two by two and a stride of two are the most common.
OK? And we have talked about already that the pulling hidden layers have the same depth as their input layer. Uh In contrast to convolutional neural nets, they may have a larger depth than their input layer. Um And so what do we mean here? Um Essentially what we're saying here right is in the earlier when we did the convolutional layer, uh the original data grid turned out to have then oh like a deeper hidden layer.
So it has meaning it has multiple grids of dimension. Uh The dimension of the num the number of grids is equal to the to D. Uh Whereas with the pooling layer, the number of grids from the hidden layer to the pulling layer is the same. So if your hidden layer had two grids, your pooling layer has two grids. The only thing that changes are the dimensions of the individual grids.
The final part of the convolutional neural network is the fully connected layer. So in some sense, we can think of the convolutional and the pooling layer as a massive preprocessing step for a uh a deep neural network or a feed forward neural network or uh uh a multi layer Perceptron from S K learn, right. Um So the convolutional steps and the pulling steps really process the data for us first, then we feed it into uh a single hidden layer, uh deep neural network, feed forward neural
network. Um And so, uh yeah, that's basically it, the last layer is you just make uh a single hidden layer and then it gets fed into the output layer to make you the estimates and the predictions. OK. So the three steps of a convolutional neural network are the convolutional layers which go through and take weighted sums, smaller weighted sums uh lots of them, which is why we have more grids than we started out with.
Uh we have then the pooling layers which down sample the layers of the convolutional layer uh to try and limit the number of parameters we need. And then finally, once you have maybe a couple of those alternating you'll put on the fully connected layer, which is just the same thing that we talked about in the feed forward network. Essentially, meaning that the convolutional and pooling layers were just a large preprocessing step for this grid based data.
OK. So that's the theory uh in the next video, you'll see how to build a convolutional neural network in kiss with the M N I S T data. Set. I hope you enjoyed learning about the theory. I know it can be a little bit confusing. So if you're stuck on something or you don't get it, I suggest maybe looking over the notes and reading through the words a little bit.
Uh If you still don't get it after that, I think when I was first learning it, what helped is sometimes you just take a break and walk away from it, let it progress process in your head and then try and come back and figure it out later. Um And even if after that all that fails, you can always, you can always send me a question on Slack and I will do my best to answer or whoever uh whoever you're most used to asking for help on the Air Institute Slack, you can ask them as well.
OK. So I hope you enjoyed the first in this two part series on Convolutional Neural Networks. I enjoyed having you. I will see you in the next part. Convolutional neural networks. Part two. Bye.