MNIST Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We are in the neural networks section, but we're going to be talking just about a data set which we'll use in the next few notebooks. So let me go ahead and share my Jupiter notebook where we can learn more about the M N I S T data set. So this is a data set that we will formally introduce. We're going to use it quite a bit in the coming notebooks.
When we learn more about neural networks, we'll show you how to load in a version of the data with S K learn, which is slightly different than another version of the data that we will learn about in carrots. So the M N I S T which stands for a modified National Institute of Standards and Technology data set is a collection of pixelated images of hand drawn digits.
So these are the counting numbers 0123456789 each image that in this data set. So each observation is a hand drawn image that's been pixelated. Uh Each image is broken into a grid of pixels where each pixel has a gray scale value from 0 to 255 0 is the uh represents no marking at all. Whereas 255 represents the darkest marking of this hand drawn image.
So for more background information on this data set, you can click on this link which takes you to the Wikipedia for the M N I S T database. Uh The original data set had 60,000 training images. So that's 60,000 handwritten digits uh and 10,000 test images. And both of these sets are split up so that each digit has equal representation uh over the course of these neural network notebooks, we are going to use both versions of these data sets.
So we'll use the S K learn one where we set up the theoretical multi layer network and we'll use the kiss one when we show you how to build more flexible neural networks uh using the car a package. So let's start with the car. Uh The K learn version. This is stored in the S K learn data sets module which can be found here. It has a function called load digits, which we will then load, we'll use this to contain uh well to, we'll use this to load X and Y.
And actually let me restart my kernel real quick. I forgot to do that ahead of time. OK? And then we can go through to do, all right. ... So now we have X and Y we can see that X contains not 60,000 images. So the length of X here tells you how many images it contains, which is 1797. Uh So for instance, this is what one image looks like values ranging from 0 to 2 55.
And then this corresponds to what is a zero. Um So how many pixels does an S K learn image have 64? So it's uh this version of the data set is a lower resolution version. Uh And it has fewer observations. So there are 101,797 images in this data set that are an eight by eight pixelated grills grid. Um And so this is a smaller version of the data set which allows it to load more quickly uh than the car version and take up less memory in your computer.
And then it's also a less uh has a lower resolution than the actual version of the data set. Uh And again, this is for memory storage purposes and it would help your algorithms run more quickly because it has fewer features, right? So here are some examples of images from this data set. So there's a zero, here's a one, here's a two, a three, a four and a five.
And so the problem with this lower resolution one in terms of trying to maybe fit the data better uh is that it can be hard to tell. So for instance, here, the two and the three uh because of the way they're pixel, it do kind of look similar. And the five also kind of looks similar to the three above. And so with so few pixels, it is uh uh this uh low resolution does make it slightly harder to tell what these images are.
Um But it is nice for demonstrating algorithms uh because you don't have to wait very long for things to train or for things to load or for things to make predictions. Um which makes it nice for just demonstrating algorithms and S T learn documentation or for testing new algorithms that maybe you've developed on and for an image data set and you just want a nice quick baseline.
The other version of the data set that we'll work with is the car version. So this can be found in the car, a package whose documentation can be found at this link. Uh This is a Python package built for making neural network models on top of the tensor flow. Um architecture maybe is the right word. Uh So it's more user friendly than tensor flow, but it uses tensor flow in the back end.
Uh You may not have this package installed on your computer yet, but don't worry. Um you don't have to have it installed to understand what we're gonna cover in this notebook. I'm just showing you how to load the data sets once you have car installed, you could come back and run this and install it yourself. Uh So the M N I S T data set and Kiss can be found with the data sets module.
So we're going to import it here. So from kiss dot datasets, import M N I S T. Uh If this is your first time, this will take a while to load because it has to download the data. Um And if it's your first time downloading it, it's a little bit slower than if you're running it for the second time. Uh Now we can load the data by calling M N I S T dot Load data.
And we can take a look. The nice thing about this one is it splits it according to the original uh train test split. So we can see the original training set was 60,000 images uh that are 28 by 28 pixel grids. And then the test set has 10,000 images again with 28 by 28 pixelated grids. So let's take a look at some of these. So these images are of a much higher resolution.
So we can see the numbers and the way they were drawn much more clearly, it's much closer to, if you were to look at them on the actual page, they were written on still clearly pixelated but much less pixelated or it appears to be much less pixelated uh than the S K learn version. And so for instance, if we're going to train algorithms on this like a neural network, um it may be better to use these higher resolution ones because they have uh more features.
Um And you know, they are more clearly distinguish between the two types of numbers. So uh what are we gonna use these for? So in the next notebook for the multi layer networks, if you're watching this in sequence, uh we will build some models in S K learn. We're gonna use the S Q learn version of the data for that just because it's easier. Uh then we'll learn how to build some models in Caris. When we use the models in Kiss, we'll use the Kiss version of the data set because it's easier um
again, for ease of use and that's what the data sets were built for for these specific purposes. OK. So I hope you enjoyed watching this video. I enjoyed telling you about the M N I S T data set. I always think it's kind of fun to plot the images and see what they look like and what digit they're supposed to be. Um Yeah, so have a great rest of your day. Uh Bye.