WEBVTT

00:00:00.000 --> 00:00:01.000
Awesome. Okay? So I'm gonna go ahead and start recording.

00:00:01.000 --> 00:00:06.000
Alright, so welcome. Today's the second day of lecture for the 2023 may boot camp.

00:00:06.000 --> 00:00:12.000
So today we're gonna start the data content so we're gonna start with data collection.

00:00:12.000 --> 00:00:21.000
So let me go ahead and share my Jupiter notebook and get my chat window open.

00:00:21.000 --> 00:00:28.000
So this is what you should see if you are able to successfully clone the repository and open it with a Jupiter notebook.

00:00:28.000 --> 00:00:37.000
If you're unable to get to something like this, I ask that you hold your questions until after lecture is done, and then I'll stick around for a few minutes to make sure that you're able to get it.

00:00:37.000 --> 00:00:53.000
Set up. Okay. So every day, when we do lectures, or if you're watching the lectures asynchronously, either through the pre recorded or the live lectures you're gonna go here click on lectures and then navigate your way to the content we'll be

00:00:53.000 --> 00:00:59.000
covering. So today, we're gonna skip introduction, which we kind of did yesterday and go straight to data collection.

00:00:59.000 --> 00:01:00.000
So remember the goal of the boot camp is to give you the skills to complete an end-to-end data.

00:01:00.000 --> 00:01:06.000
Science project that you can talk about in interviews, or write about on your resume when applying to data science positions.

00:01:06.000 --> 00:01:16.000
They're also helpful for giving you skills. If you let's say you want to stay in research.

00:01:16.000 --> 00:01:22.000
But you want to do like data science, research, these projects will also help you get skills for that type of work.

00:01:22.000 --> 00:01:27.000
So the first key, though, for doing any sort of projects that is it in a job, position, or is it as part of a research project, is collecting data?

00:01:27.000 --> 00:01:46.000
So you can't do a data science project without data so that's what today is going to cover is giving you skills to find data sets, either that already exists or give you the skills to create your own datasets using the Internet so that's what we're going to cover today the first notebook we're going to

00:01:46.000 --> 00:01:51.000
work on is this data source websites. So what's going to happen for all lectures?

00:01:51.000 --> 00:02:04.000
Now that we've gone through the intro is, I'm gonna open a lecture version of the of the notebook, and this lecture version I'll fill out during the live lecture and upload later tonight or tomorrow morning.

00:02:04.000 --> 00:02:07.000
So that way. If you want to come back and see. Well, what was Matt?

00:02:07.000 --> 00:02:09.000
What we're Matt's notes during the lecture.

00:02:09.000 --> 00:02:11.000
You can always check this copy. So once my kernel starts, we'll be able to code and stuff.

00:02:11.000 --> 00:02:18.000
But I don't think there's much code in this one.

00:02:18.000 --> 00:02:22.000
Okay, so, what are we gonna talk about in this notebook?

00:02:22.000 --> 00:02:26.000
We're gonna talk about data source websites. So what are data source websites?

00:02:26.000 --> 00:02:27.000
These are just websites that exist on the Internet, that are you can just use as sources of data.

00:02:27.000 --> 00:02:35.000
So they have data sets that are all ready to go.

00:02:35.000 --> 00:02:38.000
So there are 2 main types that we're going to talk about.

00:02:38.000 --> 00:02:43.000
The first is known as a data repository, and the second is a data competition site.

00:02:43.000 --> 00:02:44.000
So there are additional types of data websites, but these are the main 2 that we'll focus on today.

00:02:44.000 --> 00:02:51.000
So what's a data repository?

00:02:51.000 --> 00:02:54.000
So this is any website where data sets are deposited.

00:02:54.000 --> 00:03:06.000
So there's a couple of reasons why these might exist, maybe for academic research, or to house some data that a website or company's using.

00:03:06.000 --> 00:03:16.000
There's lots of different reasons to have a data repository, some very specific examples are, maybe there was some housing data associated with public academic, published academic research.

00:03:16.000 --> 00:03:21.000
Maybe a news organization did a data focused piece. And they're now they're holding that data.

00:03:21.000 --> 00:03:34.000
So other people can look at it or check their facts, and then the final is, sometimes there will be a websites where you have benchmark data sets that are used to compare different algorithms that are being developed.

00:03:34.000 --> 00:03:37.000
We'll see some examples of that throughout the boot camp.

00:03:37.000 --> 00:03:44.000
One of the main ones is known as academic Repository, so, as an example, here's a link to the Uc.

00:03:44.000 --> 00:03:48.000
Irvine Machine Learning, Repository. So these have various different data sets that you can check out so like the newest ones as well as the most popular.

00:03:48.000 --> 00:03:58.000
So this Iris dataset, you'll see when we do classification.

00:03:58.000 --> 00:04:11.000
I believe so. Here you can see it has, like, Creator, the person that donated the data, set various information pieces of information about the data set, like what the data columns look like as well as papers that site.

00:04:11.000 --> 00:04:23.000
The data. Once you go to in this particular website, if you go to the data folder, there will be links to the data itself that you could then click to download the data.

00:04:23.000 --> 00:04:32.000
So that's an example. Here's some additional examples that you might be interested in checking out other examples are github repositories.

00:04:32.000 --> 00:04:33.000
So these are repositories that just exist to store data.

00:04:33.000 --> 00:04:38.000
These are used a lot by websites and news organizations.

00:04:38.000 --> 00:04:39.000
So, for example, here are links to the 5, 38.

00:04:39.000 --> 00:04:44.000
The New York Times and the the putting dot cool websites.

00:04:44.000 --> 00:04:53.000
So if we click on this it will take us to the Github Repository for 5, 38, and then they have a data repository within that.

00:04:53.000 --> 00:05:00.000
And so this contains all the data for their for their website, their articles.

00:05:00.000 --> 00:05:04.000
And I believe the one we'll look at. In this notebook is called Candy Power Rating.

00:05:04.000 --> 00:05:10.000
So this was a piece that they did on, I believe, within 5, 38.

00:05:10.000 --> 00:05:14.000
They rated the various different types of Halloween candy one year and then made a fun video about it.

00:05:14.000 --> 00:05:21.000
So if we click on the dot Csv file, this is the data set that we would be using.

00:05:21.000 --> 00:05:32.000
So this is what it looks like here, but if we would want to copy it, we'll probably want the raw version of the data, and so we could then use our web browser save feature or print feature to save it as a Csv file and then upload it.

00:05:32.000 --> 00:05:40.000
So I believe that I've already done that. So if I import P.

00:05:40.000 --> 00:05:48.000
Andas, and then run this. I thought that might happen. So what's go ahead?

00:05:48.000 --> 00:05:50.000
I'm not gonna spend the time to download it.

00:05:50.000 --> 00:05:56.000
You can download it, yourself, but if you download it you'll be able to do this, and then I'll download it.

00:05:56.000 --> 00:05:59.000
After the lecture, because we don't need to waste time watching me download a file.

00:05:59.000 --> 00:06:03.000
But once we do that, you'll be able to load it the other way.

00:06:03.000 --> 00:06:05.000
You can do it with Github. That's nice is because we have a link to the raw Csv file on the Github website.

00:06:05.000 --> 00:06:16.000
Pandas allows us to just input a link. And then read the link directly from the Internet.

00:06:16.000 --> 00:06:23.000
Now, you'll need an Internet access, an Internet connection to do this but you can see now that I've run this.

00:06:23.000 --> 00:06:25.000
It's been connected, and now I can look at the data.

00:06:25.000 --> 00:06:33.000
It's in my Jupiter notebook. I could do other things to data like, do a random sample.

00:06:33.000 --> 00:06:34.000
Of size, 4 and now we can see the random sample.

00:06:34.000 --> 00:06:40.000
So this is something you can do with the raw. Github file, or any other online address that has a link to a data file.

00:06:40.000 --> 00:06:51.000
You can use pandas and just provide the link. Assuming you're connected to the Internet.

00:06:51.000 --> 00:06:52.000
So that's sort of it for data repositories.

00:06:52.000 --> 00:06:59.000
So just so quick a guidelines, and how you can use a data repository.

00:06:59.000 --> 00:07:12.000
So if you're using data that you didn't yourself create, it's important to make sure that you provide a citation of where you got the data, make sure you follow whatever data you use guidelines are associated.

00:07:12.000 --> 00:07:22.000
So sometimes data repositories will say, if you use this, please site this particular paper, or it'll have rules that you can use this, but not for use in any sort of commercial products.

00:07:22.000 --> 00:07:24.000
So you couldn't use the data to train an algorithm that you adventure and monetize.

00:07:24.000 --> 00:07:35.000
So just read the the guidelines that they have available on their website and follow those.

00:07:35.000 --> 00:07:41.000
And then, even if they don't have guidelines on how to site, it's always don't pretend that you generate the data yourself.

00:07:41.000 --> 00:07:48.000
Make sure you site where you got the data, even if their guidelines don't say, Hey, make sure you cite us.

00:07:48.000 --> 00:07:49.000
The other type of data website that you might be interested in are known as data competition websites.

00:07:49.000 --> 00:07:59.000
So these are websites that are exists to provide data competitions.

00:07:59.000 --> 00:08:05.000
So these websites will do a large number of things, including just publicly store data.

00:08:05.000 --> 00:08:11.000
But they'll also host the competition. They'll specify rules as outlined by whoever's providing them.

00:08:11.000 --> 00:08:25.000
The competition. They'll accept entries to the company from people like you that might want to enter, and then they'll provide criteria and help determine the winner or winners of the competition.

00:08:25.000 --> 00:08:28.000
So, while a lot of these websites, the main purpose is to provide, you know, a competition that you can join.

00:08:28.000 --> 00:08:33.000
They also often serve as a source of data for personal projects.

00:08:33.000 --> 00:08:43.000
So. For instance, one of the most popular is caggle.com, and so cagle.com.

00:08:43.000 --> 00:08:51.000
Now you can see they have the competitions here that you could go through, but they also just have regular data sets that aren't necessarily involved with the competition.

00:08:51.000 --> 00:08:57.000
So you could click on a data set. Let's say, maybe vehicle data set.

00:08:57.000 --> 00:09:00.000
And then we can scroll down, and we can see.

00:09:00.000 --> 00:09:05.000
Here's what the data set looks like. And then we can see that there's different files.

00:09:05.000 --> 00:09:19.000
So if we click on this, a different data set loads, and then if we click this download button, it would download all of this data for us to have an architecture again, these sorts of websites have rules about how to use the data licenses and that sort of thing.

00:09:19.000 --> 00:09:26.000
So be mindful of what the website is asking you to do.

00:09:26.000 --> 00:09:30.000
If you're going to use the data set and only follow that.

00:09:30.000 --> 00:09:41.000
So, for instance, if you got your data set from a competition, a lot of competitions will say you could not publish this data or your work until the competition closes so be mindful of that.

00:09:41.000 --> 00:09:45.000
If you're using you know what are the rules, what are the regulations for the data competition?

00:09:45.000 --> 00:09:48.000
If that's where you're getting your data from.

00:09:48.000 --> 00:09:59.000
Another important part of this is these casual competitions often come with monitorary prizes, so you know, if you do well enough, you could win money or just swag.

00:09:59.000 --> 00:10:00.000
So, of them are just offering the reward of knowledge, which is maybe all you're looking for.

00:10:00.000 --> 00:10:06.000
So just keep that in mind.

00:10:06.000 --> 00:10:17.000
Okay, so for the sake of time, I will skip the example of extracting data, because I trust you guys to be able to click the download button and move your files around.

00:10:17.000 --> 00:10:29.000
But you can try and practice on your own by going through this and seeing if you can get the Irs dot Csv file on into this folder and loaded by following these instructions.

00:10:29.000 --> 00:10:43.000
Okay? So before we move on to the next notebook, are there any questions about data site, competition websites or data repositories? Anything like that?

00:10:43.000 --> 00:10:45.000
I have a question.

00:10:45.000 --> 00:10:46.000
Yeah.

00:10:46.000 --> 00:10:47.000
So you're talking about when you're getting data from sales like a Github repo.

00:10:47.000 --> 00:10:52.000
Even if the authors say not to site or like.

00:10:52.000 --> 00:10:53.000
Don't say anything about signing that you should still cite it right.

00:10:53.000 --> 00:10:57.000
Still site where you got the data right? How should I sign it?

00:10:57.000 --> 00:11:02.000
Yeah, so, just so you should just say, like data was achieved from like this website.

00:11:02.000 --> 00:11:11.000
And then like, if it's like in a paper or something like, Follow, whatever the standards are for citations. So it's just like like you would any other source.

00:11:11.000 --> 00:11:12.000
Gotcha gotcha. Thank you.

00:11:12.000 --> 00:11:17.000
Yup! Yup!

00:11:17.000 --> 00:11:25.000
Any other questions?

00:11:25.000 --> 00:11:31.000
Okay.

00:11:31.000 --> 00:11:39.000
Alright! So the next thing we're gonna talk about is, maybe you have data that's stored in some sort of database.

00:11:39.000 --> 00:11:40.000
And basically, this is just going to be, how can I get that data from the database into a pandas?

00:11:40.000 --> 00:11:54.000
So basically, how can I load database data into things like Pandas into python, into Jupiter notebooks?

00:11:54.000 --> 00:12:01.000
So you can manipulate it. So in the previous examples from the websites, those were Csv files, and those are just singular files.

00:12:01.000 --> 00:12:07.000
But sometimes data is complicated too complicated to just have a series of one off files.

00:12:07.000 --> 00:12:14.000
So in situations where data is that kind of complicated people will often store it in a database.

00:12:14.000 --> 00:12:20.000
So in this notebook, we'll talk we'll introduce the idea of a database.

00:12:20.000 --> 00:12:31.000
We'll introduce the language that people use to communicate with databases or relational databases, and then we'll show you how to access the data, using a package called Sequel Alchemy.

00:12:31.000 --> 00:12:42.000
So what is a relational date base? So a lot of times, businesses or other entities will have a series of data tables that are interrelated with one another.

00:12:42.000 --> 00:12:52.000
So, for instance, let's imagine we have a hypothetical business, and in that business, you know, people make it's it's a sales business.

00:12:52.000 --> 00:12:56.000
So they sell. They sell items, come out. So maybe they have a table that keeps track of all the purchases that get made, and maybe they have.

00:12:56.000 --> 00:13:14.000
You set up a profile so like Amazon, you have to have an Amazon profile to make a purchase, so they also have a table of their customers, and so they might have this purchases table, a purchase id that keeps track.

00:13:14.000 --> 00:13:18.000
Of the unique identity of the purchase, as well as a customer.

00:13:18.000 --> 00:13:22.000
Id column that keeps track of the customer that made the purchase.

00:13:22.000 --> 00:13:27.000
And then you know all the other related data they might want, like the name of the product.

00:13:27.000 --> 00:13:32.000
How many the price? All that sort of thing, and then they at the same time.

00:13:32.000 --> 00:13:42.000
This could then be linked to a customer table where the customer, Id would be linked between both the purchases and the customer table, and then within that customer table.

00:13:42.000 --> 00:13:53.000
It would just contain information on the individual customers. So if you're thinking of profiles you've created, it might have things like, you know, your address, your credit card information.

00:13:53.000 --> 00:13:58.000
If it's Amazon something like that, you know things like your age.

00:13:58.000 --> 00:13:59.000
Other information, about your customers that you might want to have in order to.

00:13:59.000 --> 00:14:06.000
You know, in order to, you know. Better sell to them, or something like that.

00:14:06.000 --> 00:14:08.000
And so the customer id column of the purchases table can be linked to the customer.

00:14:08.000 --> 00:14:13.000
Id column of the purchases table can be linked to the customer table, and this can allow you to.

00:14:13.000 --> 00:14:29.000
Weary the data in a way that you could say, Okay, give me the purchases or all of my mail customers from ages 20 to 40, and then maybe you can use that.

00:14:29.000 --> 00:14:34.000
Make some sort of marketing decisions like, Oh, okay, they tend to buy this.

00:14:34.000 --> 00:14:37.000
So I will advertise these things, or they tend to not buy this so maybe I'll incentivize them to buy it with Cooper.

00:14:37.000 --> 00:14:39.000
's that sort of thing. So sometimes data is stored in a relational database like this, and you need to be able to take it out using python.

00:14:39.000 --> 00:14:44.000
So sometimes data is stored in a relational database like this, and you need to be able to take it out using python.

00:14:44.000 --> 00:15:00.000
So that is where we're going to have sort of a middleman called the structured query language or sequel that people tend to just write sequel code.

00:15:00.000 --> 00:15:08.000
If they're accessing data. But this is not going to be a boot camp on how to write sequel code.

00:15:08.000 --> 00:15:09.000
So we're going to give you the very basic tools of how you can use.

00:15:09.000 --> 00:15:19.000
Some very basic sequel to get data out of the databases, and then into Python, which is where we are going to be working.

00:15:19.000 --> 00:15:25.000
So the way that you get data out of a database is by writing what's known as a query using sequel.

00:15:25.000 --> 00:15:30.000
The query you will specify. I would like to get this data from this table.

00:15:30.000 --> 00:15:45.000
And then, typically with some sort of conditional statement. So the most basic sequel query syntax is, you're going to write and so in sequel, it's standard to write sequel terms with capital letters.

00:15:45.000 --> 00:15:51.000
And then, whatever the table names are using, lowercase, letters assuming that's the table you have.

00:15:51.000 --> 00:16:09.000
So, for instance, if you want to get all of the columns from a particular table for the rows that follow some sort of conditional statement, you would write capital, select the star indicates that you'd like to get all of the columns from that table so going back to our example the star would indicate that

00:16:09.000 --> 00:16:14.000
you want, maybe all of the columns of the purchases table from table name.

00:16:14.000 --> 00:16:17.000
So table name is where you would specify the table name.

00:16:17.000 --> 00:16:23.000
So if we wanted the purchases table, we would say purchases, and then where?

00:16:23.000 --> 00:16:34.000
And then, if we'd like to specify that, maybe we only want purchases that are more than $10 or purchases of a particular item, we would provide conditional statements.

00:16:34.000 --> 00:16:39.000
So how do we write sequel code and Python?

00:16:39.000 --> 00:16:43.000
One way is to use the sequel alchemy package.

00:16:43.000 --> 00:16:48.000
So this is not something that is installed. I don't think is in standard.

00:16:48.000 --> 00:16:53.000
Is installed by default with the anaconda navigator distribution.

00:16:53.000 --> 00:16:57.000
If you're using that, so you may have to install the sequel.

00:16:57.000 --> 00:17:01.000
Alchemy, python package, in order to run this code.

00:17:01.000 --> 00:17:02.000
So one way that we can check if it's installed is just try to import it.

00:17:02.000 --> 00:17:15.000
So if you try to run this code, Chunk, that I just ran, and you receive an error, it's because you do not have sequel alchemy installed so you can follow.

00:17:15.000 --> 00:17:18.000
We have instructions. I believe it's under the first steps.

00:17:18.000 --> 00:17:30.000
Button in the website. You can click that button, and then in there will be a file on how to install python packages, using either conda Pip or the anaconda navigator.

00:17:30.000 --> 00:17:37.000
So you can try and install it. I would suggest maybe waiting until after the lecture to try cause I want to make sure you're paying attention.

00:17:37.000 --> 00:17:38.000
But if you'd like to try and code along, you can try and install it. Now.

00:17:38.000 --> 00:17:42.000
So this will also show us, like what version of sequel alchemy.

00:17:42.000 --> 00:17:48.000
So I currently have 1.4 point 2 9 installed.

00:17:48.000 --> 00:17:55.000
It's possible that your version will be a little bit later or a little bit earlier than my version.

00:17:55.000 --> 00:18:00.000
It's probably okay, but if you're ever going through something, and your code is different than my version, it's probably okay.

00:18:00.000 --> 00:18:06.000
But if you're ever going through something, and your code is different than my code like, let's say I write something, and it doesn't work it could possibly be just because of a versions are different.

00:18:06.000 --> 00:18:13.000
And so you can typically do a web search that says you know how to do blank package version blank.

00:18:13.000 --> 00:18:14.000
And then somebody out there has an answer to this, or they might just suggest that you update the package version.

00:18:14.000 --> 00:18:24.000
If if it's something that's not available in an older version.

00:18:24.000 --> 00:18:26.000
Okay, so let's try and learn about submitting.

00:18:26.000 --> 00:18:54.000
Sequel queries. But maybe before we do that now is a good time to publish for questions, and then I can also drink a drink of water.

00:18:54.000 --> 00:19:12.000
Okay, so there is a particular series of steps that you have to take when getting data out of a database with SQL alchemy. So we're going to go through those steps with a you know, sort of synthetic database that I created called cat store so we're going to imagine that this

00:19:12.000 --> 00:19:19.000
is a database for a cat store that has 2 tables, a customer's table, and a purchases table.

00:19:19.000 --> 00:19:26.000
So the first thing you have to do in order to get data out of a database using SQL.

00:19:26.000 --> 00:19:29.000
Alchemy is to do what's known as create an engine.

00:19:29.000 --> 00:19:35.000
So you have to create an engine, and then that engine will allow you to connect to the database.

00:19:35.000 --> 00:19:41.000
So to create an engine. The easiest typically what you'll do is when working with python packages.

00:19:41.000 --> 00:19:47.000
You'll import the specific classes or functions that you have, that you want to use.

00:19:47.000 --> 00:19:52.000
And so for us. We want to use the create underscore engine function.

00:19:52.000 --> 00:19:59.000
So we'll first import that from sequel alchemy and then we will then create the engine by running, create engine.

00:19:59.000 --> 00:20:08.000
You, then input a string that string first takes in the type of sequel language that was used to create the database.

00:20:08.000 --> 00:20:19.000
So for us that's going to be sqlite, other ones that you might see is just regular sequel, or my sequel, or I believe Maria dB, there are other examples.

00:20:19.000 --> 00:20:20.000
It just depends on your database. And typically those things are specified.

00:20:20.000 --> 00:20:29.000
You know you would know what you're using before you're trying to access it.

00:20:29.000 --> 00:20:39.000
That's usually specified by the person providing the database next you're going to have a colon, and then you need to provide the path that the database is stored in.

00:20:39.000 --> 00:20:46.000
So because cat Stor, dB as of this morning, is stored within the repository.

00:20:46.000 --> 00:20:50.000
You just have to put in the file name. So for us, that's cat underscore store Dot.

00:20:50.000 --> 00:20:54.000
DB, and I guess not file name but database name.

00:20:54.000 --> 00:20:55.000
So now I have a connection to, or I have an engine that will allow.

00:20:55.000 --> 00:20:58.000
Yeah, yeah.

00:20:58.000 --> 00:21:06.000
Sorry I have a question. Can you explain the reason why you had the 3 backslashes?

00:21:06.000 --> 00:21:08.000
So I believe that you can put something else here to specify the path.

00:21:08.000 --> 00:21:31.000
I don't quite remember what goes here. I think it's just to help you specify the path, but because the file is stored in this particular folder, it's just slash, and then the name of the database I've never used this to not just like access the

00:21:31.000 --> 00:21:41.000
database. That's right in the folder with me, so I would have to do like I'd have to read the documentation for sequel alchemy to figure out like y 3 precisely.

00:21:41.000 --> 00:21:45.000
Okay.

00:21:45.000 --> 00:21:46.000
Okay.

00:21:46.000 --> 00:21:48.000
No, no, I think with. Sorry if I can add something.

00:21:48.000 --> 00:22:01.000
If the data path is somewhere in another folder or another directory, then you need to specify that maybe that's why you have with 3.

00:22:01.000 --> 00:22:14.000
So in between, as you said, A, the backslashes, you need that a specific directory step by step, so that it takes you to that data data.

00:22:14.000 --> 00:22:19.000
Yup! Yup!

00:22:19.000 --> 00:22:23.000
Okay. So now that we have an engine, we can then connect to our database.

00:22:23.000 --> 00:22:37.000
And this is done by running engine, dot connect, and so connecting to the database is what it will then allow us to submit queries to the database and then get back the information so I'm going to import pandas.

00:22:37.000 --> 00:22:42.000
So this is just going to allow me to demonstrate error to display the data nicely as a data frame.

00:22:42.000 --> 00:22:46.000
So it's easier to read. So how do I submit?

00:22:46.000 --> 00:23:00.000
A query, are you first? Right? The variable where I've stored the connection, which is C, O, n, or K, and then you're going to use the function dot execute, and then within execute.

00:23:00.000 --> 00:23:04.000
You'll put a string that has the sequel. Query.

00:23:04.000 --> 00:23:11.000
So for us, that's gonna be select. And then I'm just want all of the columns.

00:23:11.000 --> 00:23:22.000
And then from the purchases table. So from purchases, and then that's gonna be Ed. I don't have any additional queries.

00:23:22.000 --> 00:23:28.000
Okay. And so I ran this. And now that I've run this, let's go ahead, and I'll add an extra code chunk.

00:23:28.000 --> 00:23:29.000
You'll notice that, like nothing came out, and you might be wondering.

00:23:29.000 --> 00:23:37.000
Well, how do I get you know the query results. So one way to do this is, you'll do so.

00:23:37.000 --> 00:23:47.000
I stored my results in a variable called results, so we can see that this is a cursor dot legacy cursor result, object. So this is a cursor, dot legacy, cursor, result. Object.

00:23:47.000 --> 00:23:52.000
So this is an object, and then how do we get them out of there?

00:23:52.000 --> 00:24:04.000
We call dot fetch all, so this will return a list of Tuples of all of the results from our query, and then I'll point out once we run this once, if we try and run it again, it will be empty.

00:24:04.000 --> 00:24:15.000
So? Why is that so? Fetch all will be, hey? That's all.

00:24:15.000 --> 00:24:22.000
We'll return everything and sort of just spin it out as it goes, and then once it spits it out like it's out of the results.

00:24:22.000 --> 00:24:24.000
Object so Rohan is asking, what is a legacy?

00:24:24.000 --> 00:24:30.000
Cursor result, that's just the name of the sequel class that contains the results from the query.

00:24:30.000 --> 00:24:43.000
So that's what they've called the class, and then it has different methods and objects or attributes that we can use to get all of our stuff so I'm gonna rerun this.

00:24:43.000 --> 00:24:51.000
But I'm gonna edit it a little bit. So I'm gonna store it in a data frame then I'm gonna put my results.

00:24:51.000 --> 00:25:04.000
So results. Dot fetch all, and then I'm going to name the column. So if you want to know the names of the columns of the database, you can do results.

00:25:04.000 --> 00:25:17.000
Dot keys so keys is gonna return. Dot keys will return all the column names from the table that you query, okay, so this is what it looks like as a data frame maybe makes it a little bit easier to read.

00:25:17.000 --> 00:25:23.000
So we've got purchase id customer id number of items, free tax price and then purchase type.

00:25:23.000 --> 00:25:28.000
So these are the columns of the purchases, table.

00:25:28.000 --> 00:25:38.000
So I'm going to comment this out, for now and then I will rerun this down here to show off something else.

00:25:38.000 --> 00:25:43.000
So we did fetch all but maybe you don't want to return all of the results at once.

00:25:43.000 --> 00:25:49.000
Maybe you want to return them one at a time, and so we can write results.

00:25:49.000 --> 00:25:57.000
Start, fetch one, and so, if I again notice, I've rerun the query.

00:25:57.000 --> 00:26:06.000
So the data has been repopulated. So fetch one will give you the first tuple and the tuple corresponding to the first returned row.

00:26:06.000 --> 00:26:14.000
So if we go scroll up and down, so if we go scroll up and down, we can see this is the first row or the 0 row in python.

00:26:14.000 --> 00:26:22.000
This is the first row of the table, and then if I were to run, fetch one again, we's gonna get returned.

00:26:22.000 --> 00:26:25.000
Is the second row.

00:26:25.000 --> 00:26:31.000
Okay. So now that's row 2, where I guess. Row one and python, and then another feature.

00:26:31.000 --> 00:26:32.000
You can have is, instead of getting one row at a time.

00:26:32.000 --> 00:26:41.000
You can do results dot.

00:26:41.000 --> 00:26:50.000
Fetch many, and then provide an integer, input which will then tell it to return the next N rows so positive integers.

00:26:50.000 --> 00:26:53.000
So if I did like 4, it will return. Rose 3 through 6, because I've already returned Rose.

00:26:53.000 --> 00:27:03.000
One and 2. So an important note is that when you use fetch one or fetch many, the results are returned sequentially.

00:27:03.000 --> 00:27:08.000
We're going to go ahead and skip the practices just for the sake of time, of the live lecture.

00:27:08.000 --> 00:27:18.000
But with that being said, maybe now is a good time to pause to answer some questions.

00:27:18.000 --> 00:27:19.000
So let's see, Laura's asking. I have an error saying not found with this command.

00:27:19.000 --> 00:27:30.000
So that probably suggests that you did not. So it's all the update on it.

00:27:30.000 --> 00:27:32.000
So earlier today, I had to add the cat store dot. dB database to the repository, and when I added it, you code would run fine.

00:27:32.000 --> 00:27:48.000
But if you did not pull this update into your version of the repository, you did not have database file.

00:27:48.000 --> 00:27:53.000
And so typically what that means is so like when you.

00:27:53.000 --> 00:27:56.000
When you run this, if you do not have a database of this name, it will just create an empty database. And so maybe you ran.

00:27:56.000 --> 00:28:06.000
This created the empty database. And now it's trying to do this sort of thing.

00:28:06.000 --> 00:28:07.000
But the empty database does not have a purchases.

00:28:07.000 --> 00:28:25.000
Table, so what you'll need to do is either you'll have to first go through and delete the cat store database and then pull the updates so that you get the one that well, so Laura, like I said that if you ran if you're trying to run this

00:28:25.000 --> 00:28:36.000
step by step, then you will have this, because if it doesn't exist, it will create the file.

00:28:36.000 --> 00:28:41.000
Okay, so let's see.

00:28:41.000 --> 00:28:42.000
Yeah.

00:28:42.000 --> 00:28:43.000
So I have a question like the cat store dot.

00:28:43.000 --> 00:28:49.000
DB, you said that this command, sqlite. So that was created using sqlite.

00:28:49.000 --> 00:28:51.000
And then you are like sending queries to fetch them.

00:28:51.000 --> 00:28:56.000
So if it were actually created with a different sequel. Language, then would you have to like use?

00:28:56.000 --> 00:29:07.000
Because the queries can differ a little bit if you have different, if you have like different database preparations like SQL.

00:29:07.000 --> 00:29:20.000
Versus Mysql, or whatever. So then, would your queries need to change, based on, if it were like created using my sequel, or something like that?

00:29:20.000 --> 00:29:25.000
Yeah. So your queries would have to change. Be the ones that the database was created with.

00:29:25.000 --> 00:29:31.000
Okay.

00:29:31.000 --> 00:29:32.000
Okay.

00:29:32.000 --> 00:29:39.000
But like for the most part, like most queries, are the same, and then we're things like, Get yeah, like, it's really more of, only like the more slightly advanced stuff where the different languages.

00:29:39.000 --> 00:29:41.000
I guess, engine. I figure what they're called. But we're like Mysql and Maria dB.

00:29:41.000 --> 00:29:46.000
Like, they only differ on little bit more advanced things, like basic things.

00:29:46.000 --> 00:29:47.000
I see.

00:29:47.000 --> 00:29:49.000
Yeah, like those typically will be the same.

00:29:49.000 --> 00:29:56.000
I see, and usually it will be like provider by the database, like what this thing is skewed or.

00:29:56.000 --> 00:30:02.000
Yeah, so typically like, if you're working in like an industry setting, they, they would tell you what they used to create the database.

00:30:02.000 --> 00:30:04.000
Okay.

00:30:04.000 --> 00:30:10.000
And then like, if you're downloading it, they should also tell you somewhere, like what language was used to create the database.

00:30:10.000 --> 00:30:13.000
Got it? Thanks.

00:30:13.000 --> 00:30:21.000
So Brooks is saying that they're getting an error.

00:30:21.000 --> 00:30:23.000
I'm not quite sure why that error you're getting that error.

00:30:23.000 --> 00:30:30.000
I would suggest trying to do like a web search.

00:30:30.000 --> 00:30:35.000
So if it's an error that's not, is because the table does not exist.

00:30:35.000 --> 00:30:46.000
It's possible that you ran the code earlier before I uploaded it, and so then, like the new like table didn't get updated I'm not entirely sure why you're getting that.

00:30:46.000 --> 00:31:00.000
Not an executable object here. So Rohan, asking, What if we want only to only select some of the columns so just like Brooklyn, said, you'll have to specify the column name, and so we could see an example of that.

00:31:00.000 --> 00:31:11.000
We could do results equals dot, execute, select.

00:31:11.000 --> 00:31:16.000
Maybe we want.

00:31:16.000 --> 00:31:33.000
Purchase, id coma pre tax price. So if you have more than one column, you can separate them with a comma, and then from purchases.

00:31:33.000 --> 00:31:37.000
And then.

00:31:37.000 --> 00:31:45.000
Let's just copy this, and then we can show what that looks like.

00:31:45.000 --> 00:32:00.000
Okay. Alright!

00:32:00.000 --> 00:32:07.000
Okay, so we can use sequel to calculate some basic statistics.

00:32:07.000 --> 00:32:08.000
So you can do things like getting the number of results that are returned for a particular query.

00:32:08.000 --> 00:32:14.000
So you can just do this by doing count either of star or of the column.

00:32:14.000 --> 00:32:26.000
So here you can see that this tells you that the database has 20 rows for this table.

00:32:26.000 --> 00:32:33.000
You can get the maximum of a specified column by doing Max, and then the column name.

00:32:33.000 --> 00:32:40.000
You can do the same thing to get the minimum, but just changing it from Max to Min.

00:32:40.000 --> 00:32:46.000
And you can get things like the average or the you know, arithmetic mean with Avg.

00:32:46.000 --> 00:32:59.000
And so these are useful. You can also do the same thing with pandas, but sometimes it's more useful to do it, using sequel, because maybe it's faster and sequel than it is in Pandas.

00:32:59.000 --> 00:33:05.000
You can also just use pandas and read the stuff directly into a data frame without having to do this.

00:33:05.000 --> 00:33:10.000
Dot fetch all process so you can specify. There's a couple different ways.

00:33:10.000 --> 00:33:13.000
You can specify like that. You want a particular query.

00:33:13.000 --> 00:33:26.000
So if we wanted to say, I want a pandas data frame out of the query, Select star from maybe let's make it customers this time.

00:33:26.000 --> 00:33:33.000
And then, after the query, you input the connection. So here's the customers table.

00:33:33.000 --> 00:33:36.000
You can also specify like maybe instead of since we just wanted the entire table here, we can just use read Underscore, SQL.

00:33:36.000 --> 00:33:51.000
Underscore table directly, so we could do. I think we just have to put in the name let's see, it's maybe a string.

00:33:51.000 --> 00:33:54.000
So let's try that customers. And then connection there we go.

00:33:54.000 --> 00:34:00.000
So just put in the string name of the table, followed by the connection to the database.

00:34:00.000 --> 00:34:11.000
And so now we have, now we have that, and I think let's go ahead and also show an example where we use a conditional because I haven't done that yet.

00:34:11.000 --> 00:34:20.000
And so here is a query with a conditional. So we can do.

00:34:20.000 --> 00:34:31.000
Pd, dot read underscore SQL. Underscore query, and then for this we'll do select, and we'll still get all of the columns from customers.

00:34:31.000 --> 00:34:55.000
And then where? So this indicates that you want some sort of condition to be met, and maybe we can say where the age is greater than 27, and then we would say, connection, and so now you have all of the rows of customers, where the customers have an age that's more than

00:34:55.000 --> 00:35:19.000
27. Okay, so now, maybe before I go through the process of like saying, we're done with the database, I'll go ahead and pause for more questions about either submitting queries with execute or using a read sequel query or Readsql table.

00:35:19.000 --> 00:35:27.000
So? Which queries do data scientists most mostly use, like everyday, like a daily basis.

00:35:27.000 --> 00:35:28.000
So I.

00:35:28.000 --> 00:35:32.000
I select would be one of them, of course, but in general.

00:35:32.000 --> 00:35:33.000
So it's just depends on like where you work and what types of things you're doing.

00:35:33.000 --> 00:35:39.000
There are people who don't use sequel at all, because they're at a different stage of the team.

00:35:39.000 --> 00:35:47.000
So like, maybe they're not in charge of like bringing the data over.

00:35:47.000 --> 00:35:51.000
So I think it just depends. It depends on the data you're working with.

00:35:51.000 --> 00:35:54.000
It depends on the job that you're in. I mean.

00:35:54.000 --> 00:36:01.000
Select is gonna be used in any query. I'm pretty sure.

00:36:01.000 --> 00:36:03.000
Yeah, I think select would be used in any query.

00:36:03.000 --> 00:36:04.000
The things that you know. We've kind of really just given you, like the basics.

00:36:04.000 --> 00:36:17.000
There are also lots of other things that make sequel a little bit more complicated, like understanding how to join tables.

00:36:17.000 --> 00:36:18.000
Yeah.

00:36:18.000 --> 00:36:20.000
That sort of thing. Yeah. So like, this is just a very basics.

00:36:20.000 --> 00:36:37.000
And then, like, if you are, if you are working on a project for this boot camp that is using SQL, like you're gonna probably have to do more work like learning a little bit more sequel on your own time.

00:36:37.000 --> 00:36:38.000
Thanks.

00:36:38.000 --> 00:36:46.000
Yup, and so then Laura has asked, is this all possible in python as well, but easier using sequel?

00:36:46.000 --> 00:37:00.000
So you can do like if you're just working with a data set, that's a single file like you can just use pandas if you're working like, you know, if you're like working with a SQL database like, let's say, you're working with a SQL

00:37:00.000 --> 00:37:01.000
database. But you're ultimately just using like one or 2 tables, you could just quickly use pandas.

00:37:01.000 --> 00:37:08.000
There are things that like sequels, like faster at some stuff.

00:37:08.000 --> 00:37:09.000
So like sequels, faster querying tables. And it's also probably easier to query the tables with sequel than it is like with pandas alone.

00:37:09.000 --> 00:37:22.000
So there's times when you want to use pandas alone. So there! There are times when you want to use pandas.

00:37:22.000 --> 00:37:28.000
There are times when you want to use SQL. It just depends on the project.

00:37:28.000 --> 00:37:29.000
Yeah.

00:37:29.000 --> 00:37:30.000
I have a question what if I wanted to use SQL.

00:37:30.000 --> 00:37:31.000
Alchemy to. Let's say I had data.

00:37:31.000 --> 00:37:33.000
And I actually want to create an SQL database. Can I do that with SQL.

00:37:33.000 --> 00:37:37.000
Alcamy.

00:37:37.000 --> 00:37:43.000
Yeah, so you can do that. So I think in the practice problems for SQL or for the data and databases like the practice problems.

00:37:43.000 --> 00:37:54.000
And maybe I'll show that now. So in addition to like problem sessions and stuff, I also just have a bunch of problems that you can like practice with or learn more.

00:37:54.000 --> 00:38:08.000
So if you go to the data collection folder, I think in this database file there is code there on how to create a table and then how to insert stuff into the table so you can do that with SQL alchemy as well, it.

00:38:08.000 --> 00:38:20.000
The query you use just changes. So it's like, create table insert into.

00:38:20.000 --> 00:38:21.000
Yeah.

00:38:21.000 --> 00:38:28.000
Thanks.

00:38:28.000 --> 00:38:29.000
Okay, so let's imagine that we're now done with the database which we are.

00:38:29.000 --> 00:38:47.000
In this lecture. So in order to be done with the database and not have issues running around in the background, you have to first close the connection, so this will close your connection to the database, and once you're all the way done with like you don't think you're going to be

00:38:47.000 --> 00:38:52.000
connecting to the database at all later today, you will also dispose of the engine.

00:38:52.000 --> 00:38:56.000
So this is engine dot dispose, and this is just SQL.

00:38:56.000 --> 00:39:09.000
Alchemy, syntax that you have to do to disconnect from the database entirely, and once you've done that, you're no longer connected to the database, and you could like, I don't know not have to worry about like you're gonna change the

00:39:09.000 --> 00:39:17.000
database or maybe you're still like, maybe you forget to close your Jupiter notebook properly and still having an open connection to the database running in the background.

00:39:17.000 --> 00:39:20.000
This is just what you have to do when you're done.

00:39:20.000 --> 00:39:33.000
So we've introduced the concepts now of relational database, how you can access the data stored within one and then how to submit queries and get them into Pandema's data frames.

00:39:33.000 --> 00:39:35.000
If you'd like to learn more. As I mentioned, there are the practice problems where you can learn about creating and doing some very basic joins.

00:39:35.000 --> 00:39:48.000
Beyond that you'll probably have to start learning some sequel, and that sort of thing.

00:39:48.000 --> 00:39:51.000
So we have a question, how can I install SQL. Alchemy?

00:39:51.000 --> 00:39:56.000
Do I need to do that in Jupiter? So to install SQL.

00:39:56.000 --> 00:40:00.000
Alchemy. There are installation, instructions.

00:40:00.000 --> 00:40:04.000
Where were they?

00:40:04.000 --> 00:40:16.000
Their installation instructions here, that being said like, if you've never installed any python packages before, go to the Institute data science website, click on the first steps.

00:40:16.000 --> 00:40:17.000
And within there is a file that explains the different ways that you can go about installing python packages.

00:40:17.000 --> 00:40:28.000
So you're gonna want to learn how to do that cause there's gonna be a number of times where there are packages where we're gonna use them, or you're gonna want to use them on your project.

00:40:28.000 --> 00:40:35.000
And you don't already have them install.

00:40:35.000 --> 00:40:42.000
Sorry a question after we connect to the database and do the query.

00:40:42.000 --> 00:40:52.000
And finally, before closing or discounting from the database, we are saving the data in our computer, not right?

00:40:52.000 --> 00:40:56.000
We didn't do that right.

00:40:56.000 --> 00:40:59.000
Well, so!

00:40:59.000 --> 00:41:03.000
If you are making updates. Like, let's say you're in charge of maintaining the database.

00:41:03.000 --> 00:41:12.000
If you are making updates, like when you update, if you are to submit a query that updated the database that it saves like immediately.

00:41:12.000 --> 00:41:15.000
I believe or like. Maybe it's just I think it.

00:41:15.000 --> 00:41:17.000
I'd have to double check, because I don't.

00:41:17.000 --> 00:41:19.000
I've I like don't do a database management, but at some point it would be saved.

00:41:19.000 --> 00:41:29.000
Either. After you make the execution or after you close the database.

00:41:29.000 --> 00:41:32.000
I'm not entirely sure which the pandas stuff that's like nuts.

00:41:32.000 --> 00:41:34.000
I mean, it's sort some of them are stored in a variable.

00:41:34.000 --> 00:41:41.000
Actually in this notebook. I don't think any of them are but we could store it in a variable.

00:41:41.000 --> 00:41:42.000
And then we would have that data frame stored currently in this notebook. I don't think any of them are, but we could store it in a variable, and then we would have that data frame stored in a variable.

00:41:42.000 --> 00:41:53.000
I don't think any of our data frames are stored in a variable, but, like, what if you are working with the data?

00:41:53.000 --> 00:42:01.000
Presumably you wouldn't close the connection until after you've already stored the data in a variable or something like that, or you're just done entirely working with it.

00:42:01.000 --> 00:42:02.000
Yeah.

00:42:02.000 --> 00:42:06.000
Yeah, okay.

00:42:06.000 --> 00:42:16.000
Okay.

00:42:16.000 --> 00:42:18.000
The next thing we're gonna talk about is web scraping.

00:42:18.000 --> 00:42:22.000
But first they should.

00:42:22.000 --> 00:42:24.000
Go back to the lectures and not the practice problems.

00:42:24.000 --> 00:42:30.000
That's why things looked weird. So we're gonna talk about web scraping with beautiful soup.

00:42:30.000 --> 00:42:49.000
So I'm gonna open up the lecture copy. So sometimes you might be wanting to do a problem or work on a project where the data doesn't already exist in a nice clean file or a nice database that you can access so you're gonna have to but maybe it does exist on like a

00:42:49.000 --> 00:42:53.000
website and it would be pretty easy to get that data from the website.

00:42:53.000 --> 00:43:04.000
Maybe it's nicely formatted in a table or something, but maybe it's too difficult for you to do something like copy and paste, or maybe the a date exists across multiple web pages.

00:43:04.000 --> 00:43:18.000
And so I do, this by hand would be too long. So in this notebook we're gonna learn how you can write python scripts to scrape a website using beautiful soup.

00:43:18.000 --> 00:43:21.000
So beautiful, soup is a package that allows you to parse.

00:43:21.000 --> 00:43:25.000
HTML code. Again, this is another example, like SQL.

00:43:25.000 --> 00:43:30.000
Alchemy, where you need to have it installed right in order for it to work, so you can check if you have beautiful soup installed by trying any import.

00:43:30.000 --> 00:43:38.000
Bs 4. If this runs correctly and you don't have any errors and you have it installed if it doesn't run, you will need to install it.

00:43:38.000 --> 00:43:42.000
So again. You'll need to if you've never installed a python package before, you will need to check the first step.

00:43:42.000 --> 00:43:51.000
Document that I think maybe somebody has linked to linked to in the chat.

00:43:51.000 --> 00:43:58.000
You'll need to check that first step document, go through, figure out how to install package and then install Bs 4.

00:43:58.000 --> 00:44:17.000
Following either one of these directions here, so the version that I currently have installed is 4.12 point 2, and as I saw in the chat part of the reason, there was a difference between like my code working and and other people's code not working is because of a difference in versions so again, if as I go through

00:44:17.000 --> 00:44:24.000
this, if I'm doing something that works for me. But you try and copy it and it doesn't work, and it you have a different version than I do.

00:44:24.000 --> 00:44:34.000
That's usually the most like. Well, the most likely culprit is often a typo on somebody's part, but then the second most likely culprit is that you have a different version than the code you're trying to follow.

00:44:34.000 --> 00:44:48.000
So just be mindful of the Python package version that you're working with, and you know, if it's different from mine, it's possible that the code I'm writing is not going to work for you and your notebook so in order to be able to Parse

00:44:48.000 --> 00:44:51.000
HTML code and get the data out of it.

00:44:51.000 --> 00:44:52.000
We have to understand a little bit about HTML code. So we know what we're scraping.

00:44:52.000 --> 00:45:01.000
So this is going to be a sample of some HTML code.

00:45:01.000 --> 00:45:02.000
So right now, it's written as a string. So one thing about python is, if you put 3 quotation marks in a row, it's written as a string.

00:45:02.000 --> 00:45:12.000
So one thing about python is, if you put 3 quotation marks in a row, it will allow you to write a string across many different lines without having to, you know, do concatenation at all.

00:45:12.000 --> 00:45:23.000
So if we wanted to look at what the web page that this little HTML code produces, we can click on this link and see it.

00:45:23.000 --> 00:45:24.000
So it's got this title at the top that's in bold.

00:45:24.000 --> 00:45:46.000
This sentence about sisters. Each of these sisters has as blue and underlined, and if you click on it would take you to a link which is just this link right here and then there's an ellipses at the bottom, so this is what the HTML page.

00:45:46.000 --> 00:45:50.000
Looks like and we're going to go through how to parse this below.

00:45:50.000 --> 00:46:00.000
So the first thing we want to do, if we want to parse this string is, we need to first import the beautiful soup class from Bs 4.

00:46:00.000 --> 00:46:12.000
So this is just another example of when you're writing python code, you typically will only want to import the function or the classes that you're using and not import the entire package.

00:46:12.000 --> 00:46:15.000
If you can avoid it. Now we did this above here, just to check that.

00:46:15.000 --> 00:46:24.000
You have it imported, but in practice it's usually you want to just import the stuff you use. So we'll do from.

00:46:24.000 --> 00:46:32.000
Bs. 4. Import capital B. Beautiful capital S. Soup.

00:46:32.000 --> 00:46:37.000
Okay, so we're then going to make a beautiful Sup object.

00:46:37.000 --> 00:46:42.000
So how do you do that? We're going to store it in a variable called soup, with a lowercase.

00:46:42.000 --> 00:46:57.000
S. Then I type in the class beautiful soup, and then in the parentheses, I'm going to provide the string that contains the HTML code which I store it above in HTML Underscore Dock.

00:46:57.000 --> 00:47:00.000
And then you put in how like the code that this is written in.

00:47:00.000 --> 00:47:06.000
So for us. This is HTML. So you write in HTML dot parser as a string.

00:47:06.000 --> 00:47:12.000
So this second argument here tells beautiful soup. Alright, this is HTML code that I want you to parse.

00:47:12.000 --> 00:47:18.000
There are other codes, other coding languages that are used to write websites like Xml, and stuff like that.

00:47:18.000 --> 00:47:33.000
HTML is the most common. But if you you know, if you have a website that's written with some other language, you would need to change this. Input although HTML is like really the most common anymore.

00:47:33.000 --> 00:47:34.000
So, I'm gonna use a method called soup dot pretty.

00:47:34.000 --> 00:47:44.000
And so what this does is it takes the code that was provided to it, and then you can see it's got all these like slash ends in it.

00:47:44.000 --> 00:47:50.000
So this will take the code and sort of format it like someone who's coding properly would format it.

00:47:50.000 --> 00:48:01.000
And so if we print this, we'll be able to see like, if you were writing this HTML code and a code editor, it would look something like this with the indents.

00:48:01.000 --> 00:48:07.000
HTML, so this is telling us that this is an HTML document.

00:48:07.000 --> 00:48:10.000
We've got something called a head, something called a title.

00:48:10.000 --> 00:48:24.000
A body, something called a P, that has a class. So anytime an HTML code where you see a word with these brackets on either side, that is an HTML element.

00:48:24.000 --> 00:48:34.000
And so that's sort of like like those are defined in certain ways, so they will look a certain way and can hold certain things on websites.

00:48:34.000 --> 00:48:40.000
So the head element of an HTML diagram is what determines, like the metadata about the document.

00:48:40.000 --> 00:48:57.000
So, for instance, this head having the title here, tells that Web Browser, that for any tab or window to put the title up here so you can see the title of in the head is what's being shown here at the top of the tab this will often also contain other metadata

00:48:57.000 --> 00:48:59.000
like tags. Style that sort of thing. The body is like what you actually see.

00:48:59.000 --> 00:49:10.000
So the body here. All that code is like what you're seeing here.

00:49:10.000 --> 00:49:16.000
So the body that has P. Title. The B. Tells it to be bold.

00:49:16.000 --> 00:49:24.000
So this first highlighted P. Is this part. The dormouse is it? Be bold? So this first highlighted P. Is this part the dormouse's story. So P.

00:49:24.000 --> 00:49:25.000
Element stands for parag, a B element to make my text bold you don't need to remember all this stuff.

00:49:25.000 --> 00:49:33.000
The main thing to take away is that HTML.

00:49:33.000 --> 00:49:46.000
Documents are made up of elements. The elements hold different pieces of information, and elements also have things like metadata, like the class which we can use to our advantage to scrape data.

00:49:46.000 --> 00:49:54.000
So? How does beautiful soup go through this stuff? If you were to write this out in a nice little graph diagram, HTML!

00:49:54.000 --> 00:50:01.000
Code follows what's known as a tree structure and that's tree from the field of graph theory.

00:50:01.000 --> 00:50:08.000
What do we mean? So there's like a node up top, and then everything sort of branches down from that, and then you can follow along and get to the element you'd like to get.

00:50:08.000 --> 00:50:14.000
And so that's how beautiful soup works. So we've got.

00:50:14.000 --> 00:50:28.000
HTML up top! That's the document. And then within the document we have a head and a body elements, and so the HTML node would be considered the parent of the head and the body, and vice versa.

00:50:28.000 --> 00:50:36.000
These would be the children of the HTML. Each of these children have their own children, so the head has a title, child, and then the body has 3 p.

00:50:36.000 --> 00:50:46.000
Children. These P. Children, then have a bold that's how the child of this P. The second P.

00:50:46.000 --> 00:50:52.000
Has 3 a children, a stands for anchor, and what you use to hold links and stuff, and then the last P.

00:50:52.000 --> 00:50:55.000
Doesn't have any kids. So we think of the different levels of the document as like generations.

00:50:55.000 --> 00:51:04.000
And this sort of nice structure is what allows us to parse through the data relatively quickly.

00:51:04.000 --> 00:51:12.000
So we're gonna now show you like, how you can use beautiful soup to traverse these particular documents.

00:51:12.000 --> 00:51:32.000
But maybe before we dive into the actual code part, are there any questions on just like basic structural stuff on HTML code, or or making the beautiful soup object?

00:51:32.000 --> 00:51:33.000
Sure!

00:51:33.000 --> 00:51:40.000
I have a question about HTML file structure. So it looks like it makes sense that a body could have multiple. P.

00:51:40.000 --> 00:51:41.000
Yeah, yeah.

00:51:41.000 --> 00:51:48.000
Childs or P children. Sorry. Right? Can each email file have multiple heads in multiple bodies?

00:51:48.000 --> 00:52:04.000
I don't know. If, like, you'd get some sort of error or not like when you would try to load the website on a browser, I would think that you typically don't have multiple heads and bodies within a single HTML document. But I don't know for sure i'd have to check.

00:52:04.000 --> 00:52:05.000
Gotcha. Thank you.

00:52:05.000 --> 00:52:10.000
Yeah.

00:52:10.000 --> 00:52:18.000
Any other questions?

00:52:18.000 --> 00:52:23.000
Okay. So the way that we can parse this is, first, you can say, remember, we stored our beautiful soup object in a variable called soup.

00:52:23.000 --> 00:52:30.000
So if you were to then put the element that you're interested in.

00:52:30.000 --> 00:52:38.000
So maybe we want to get the the title that you can do.

00:52:38.000 --> 00:52:47.000
And so now you can see we have that title. We could also go the long way of working our way through the code which is, I think, what I meant to put here first.

00:52:47.000 --> 00:52:56.000
So we would do soup dot head, so that would ensure that we are only searching within the head, and then we could do that title.

00:52:56.000 --> 00:53:02.000
There. You might also be wondering like, How do we get the text?

00:53:02.000 --> 00:53:09.000
Let me also just put this so we've already done both of these, so you might be wondering like, what if I just want the text that's stored within the title?

00:53:09.000 --> 00:53:17.000
So you can get the text stored within any. HTML element by doing dot text.

00:53:17.000 --> 00:53:18.000
Okay. And so now we have a python stripe.

00:53:18.000 --> 00:53:22.000
That is the text that was within the element. You might wanna be able to say, like, What's the parent or the child?

00:53:22.000 --> 00:53:34.000
So if I wanted to know, like, what's the parent of a particular element, you would just do dot parent.

00:53:34.000 --> 00:53:39.000
You could try, Dot, child, but here it wouldn't work, because title doesn't have a child.

00:53:39.000 --> 00:53:41.000
So when you do this soup dot elements name, it's always going to give you the first example of that element.

00:53:41.000 --> 00:53:50.000
So this soup dot a we'll give you the first a.

00:53:50.000 --> 00:54:03.000
And if we went back up to our, to our code we would go through, go through, go through, and we see the first A that shows up is this one with class Sister Id.

00:54:03.000 --> 00:54:09.000
Link one, and it contains lc, and then we can see that that is what was pulled up.

00:54:09.000 --> 00:54:17.000
So you can access the information sort of this metadata with in a brackets sort of like a python dictionary.

00:54:17.000 --> 00:54:19.000
So if I wanted the class I could do soup dot a square brackets.

00:54:19.000 --> 00:54:27.000
The string class, and we can see that the class is sister.

00:54:27.000 --> 00:54:31.000
So you might have noticed that there's more than one A.

00:54:31.000 --> 00:54:32.000
So how do I get all of them? So there's this function called Dot.

00:54:32.000 --> 00:54:39.000
Find all with an underscore between find and all.

00:54:39.000 --> 00:54:49.000
If you, input the element you are interested in, it will return all of the A's or all of the examples of that element that it can find as a list.

00:54:49.000 --> 00:54:52.000
And then we could just loop through the list like so so I'm uncomon saying the for Loop.

00:54:52.000 --> 00:55:02.000
So I don't have to type it out. So this will loop through that list and print out the class in the text of every A.

00:55:02.000 --> 00:55:05.000
You could also use like a list comprehension, and so forth.

00:55:05.000 --> 00:55:09.000
So I see that Jacob has posted a question and a large HTML. File.

00:55:09.000 --> 00:55:13.000
Is there an easier way to find the section you need with the data, other than just parsing through the huge file and figuring out what the parents and the children are.

00:55:13.000 --> 00:55:20.000
So, Jacob, we will see some examples of how to do that.

00:55:20.000 --> 00:55:25.000
Just a little bit. So if you're if you could hold on to your seats for like a few more minutes, we will get to an example.

00:55:25.000 --> 00:55:31.000
Where we have a network website. And we show how to do that.

00:55:31.000 --> 00:55:32.000
Okay. So instead of, I wrote this as exercises, you can try and do it with me.

00:55:32.000 --> 00:55:42.000
But to save time, I'm just gonna like code it up loud and talk it out as we go through.

00:55:42.000 --> 00:55:50.000
I guess if you really don't want to know and you want to practice later, you can sort of like walk away from your computer for 2 min and come back.

00:55:50.000 --> 00:55:52.000
So if we wanted to find the first P. How would we do that?

00:55:52.000 --> 00:55:56.000
Well, we do. Soup and then remember, the first one is just dot P.

00:55:56.000 --> 00:56:05.000
Will return the first one. Okay? And then we can see the things like the class and the string by doing a soup dot. P.

00:56:05.000 --> 00:56:08.000
So to get the class, we access it like a dictionary.

00:56:08.000 --> 00:56:13.000
So square brackets, class, and then we could do soup. Dot P.

00:56:13.000 --> 00:56:19.000
If we want to get the text within the P. We would do dot text.

00:56:19.000 --> 00:56:23.000
For all the A's. In the document. We want to find their hyper references.

00:56:23.000 --> 00:56:28.000
So remember soup dot find underscore all of a you might notice.

00:56:28.000 --> 00:56:29.000
In addition to classes, they have this thing called Href.

00:56:29.000 --> 00:56:38.000
So that is like when I click on Lc, and it takes me to this link.

00:56:38.000 --> 00:56:41.000
That is what's contained in the hyper ref.

00:56:41.000 --> 00:56:50.000
So we could use for a in soup. Dot find all a we can print the hyper ref of that.

00:56:50.000 --> 00:57:01.000
A so just like a dictionary. But instead of class this time we have a trap.

00:57:01.000 --> 00:57:19.000
Okay, so are there any questions before we move on to showing you an example with a real web page?

00:57:19.000 --> 00:57:22.000
Awesome.

00:57:22.000 --> 00:57:28.000
So we're gonna go through an example where we scrape the sports section of 5 38.

00:57:28.000 --> 00:57:33.000
And so what we're gonna pretend that we're doing is like let's say, we've been hired by somebody.

00:57:33.000 --> 00:57:34.000
And this used to be a very hypothetical thing.

00:57:34.000 --> 00:57:52.000
But now it's maybe a little bit more of a real thing with things like chat gpt, we're gonna imagine that somebody hired us to scrape like websites like 5 38 to get their reporting because maybe we're then gonna use that writing to train some sort of AI bots

00:57:52.000 --> 00:57:58.000
to generate new articles in sort of the sports world.

00:57:58.000 --> 00:58:07.000
And so for our task. The the the people have tasked us with is we wanna provide, like the titles of these articles, the author.

00:58:07.000 --> 00:58:11.000
And then I don't remember. Maybe like the hyper ref of like where the article was.

00:58:11.000 --> 00:58:12.000
I don't remember what I said, so that's what our goal is.

00:58:12.000 --> 00:58:14.000
So as part of that goal, we need to download the HTML code of this website.

00:58:14.000 --> 00:58:30.000
But we don't want it to download it by hand. We want to use python to do it so we can run it as a script, and while the scripts running we can run it as a script, and while the scripts running we can you know go do whatever we want to

00:58:30.000 --> 00:58:33.000
do so. The way to do this is with the request package.

00:58:33.000 --> 00:58:43.000
So I'm gonna copy the URL. And so the request package, which you can import just by doing import requests.

00:58:43.000 --> 00:58:48.000
This is a built-in package, so you shouldn't have to worry about that installing it.

00:58:48.000 --> 00:58:53.000
So with requests, and it looks like I've already copied the URL here from myself.

00:58:53.000 --> 00:58:57.000
You can send a request to the website server to provide you with the HTML.

00:58:57.000 --> 00:59:05.000
Code for that given page. So we would do our which is what I'm gonna store it in requests.

00:59:05.000 --> 00:59:23.000
Dot get, and then you input the URL, which for us is, you know, this URL right here and then what's gets returned is, I guess I wanted to show I wanted to show. Let's do it again without the are because I wasn't supposed to do that yet what you'll see now as you

00:59:23.000 --> 00:59:31.000
get back a response from the server, and then this response if we don't store it in anything, is just going to tell us what the status code of their responses.

00:59:31.000 --> 00:59:39.000
So for us the status code was 200, a 200 response means that everything went a okay.

00:59:39.000 --> 00:59:43.000
And you got the HTML code you're looking for.

00:59:43.000 --> 00:59:46.000
If you see things like 404, or anything in the 5 hundreds.

00:59:46.000 --> 00:59:57.000
That means that something went wrong. So, for instance, like a 404 response means that you send a request, and there will like it.

00:59:57.000 --> 01:00:04.000
Couldn't find the website you specified 500 responses typically mean that there's something wrong on the side of the website.

01:00:04.000 --> 01:00:10.000
So you can find all of the possible reports codes, for a request at this link, and go through them on your own timer.

01:00:10.000 --> 01:00:11.000
For instance, maybe you get a response, and you want to check out what it means.

01:00:11.000 --> 01:00:19.000
You can find it here. So typically 4 hundreds and 5 hundreds means that something went wrong.

01:00:19.000 --> 01:00:26.000
And you're not getting your data, 200 means that you got your data like you wanted to.

01:00:26.000 --> 01:00:30.000
So now we have the response stored in a variable called are.

01:00:30.000 --> 01:00:46.000
And so we could even check what was the status of our response with our dot status underscore code, and we can see that it was 200, which means we got the data we wanted, and then the HTML code is stored within our dot, content.

01:00:46.000 --> 01:00:50.000
And so you can see here, this is. This is the HTML code.

01:00:50.000 --> 01:00:54.000
It's much messier than the little simple file we had up above.

01:00:54.000 --> 01:00:58.000
So we can now parse this with beautiful soup.

01:00:58.000 --> 01:01:04.000
So it's stored in our dot content. And then let's provide also the input.

01:01:04.000 --> 01:01:07.000
HTML dot parser.

01:01:07.000 --> 01:01:13.000
And now this is just like a sanity check. 5 38 sports.

01:01:13.000 --> 01:01:14.000
So I see we have a question, what about Captchas?

01:01:14.000 --> 01:01:20.000
Would we be blocked as a bot for trying to scrape websites?

01:01:20.000 --> 01:01:21.000
Yep, that can happen. So a lot of websites will now prevent something like this from happening.

01:01:21.000 --> 01:01:32.000
If it notices that you're trying to use like this sort of requests, there are workarounds, but it just is dependent upon the website.

01:01:32.000 --> 01:01:37.000
So some of them like, if they have a capscha, you know, those are typically hard to get around right.

01:01:37.000 --> 01:01:46.000
That's the whole point of them. But there are things like there's a package called selenium that allows you to interact with Javascript type stuff on a web page to try and get data.

01:01:46.000 --> 01:01:47.000
But for you know a lot of websites you can usually just do something like this, and it will be okay.

01:01:47.000 --> 01:01:57.000
Where the problem comes in as if you're trying to send a lot of requests to the same website.

01:01:57.000 --> 01:02:02.000
Like, let's say I tried to send thousands of requests to 5 38.

01:02:02.000 --> 01:02:06.000
In a short amount of time I would probably get blocked by 5 38.

01:02:06.000 --> 01:02:13.000
So, and it'll later in the notebook we talk about like what you should do to prevent like that sort of thing when you're scraping.

01:02:13.000 --> 01:02:19.000
But that can happen.

01:02:19.000 --> 01:02:22.000
Okay. So I believe Jacob asked earlier, like, how can I go through without having to like parse the code code?

01:02:22.000 --> 01:02:39.000
This is an example where, if you were to try and just read the code as it is it would take you a very long time, so what we're gonna use is something called the web developer tools.

01:02:39.000 --> 01:02:42.000
So I'm using Firefox. If you're using a different web browser, you will still have web developer tools.

01:02:42.000 --> 01:02:50.000
But you'll have to double-check how to get to them. I believe.

01:02:50.000 --> 01:02:53.000
Here I show how to do it for firefighters.

01:02:53.000 --> 01:03:01.000
Google chrome and safari, which I believe are the 3 most popular if you're using a different web browser, you'll have to figure it out on your own with a web search.

01:03:01.000 --> 01:03:08.000
So go to for Firefox. You go to browser tools and then click on web developer tools.

01:03:08.000 --> 01:03:19.000
Although I need to go back to the 5 38 page and then do it there.

01:03:19.000 --> 01:03:24.000
Okay? And so what's nice about this is this is the console.

01:03:24.000 --> 01:03:39.000
But in the inspector you can see like the HTML code, and you'll notice like, how things are changing as I huver over them, what we're gonna use is there's this little tool here that it's in safari chrome and firefox so i'm

01:03:39.000 --> 01:03:43.000
assuming it's in most browsers, web developer tools.

01:03:43.000 --> 01:03:59.000
If you click on this little icon that has a box with the cursor arrow in its you click on this, and it will allow you to go to the different elements of the website, and then if you go back and look at the code at the bottom, it's highlighting the HTML code that

01:03:59.000 --> 01:04:03.000
is coding up that part of the website. So for us, what we're gonna focus on in this example is getting the titles of the articles.

01:04:03.000 --> 01:04:15.000
And then I believe, either in a problem, session or as a practice problem, you can go through and try and get the rest of the information I mentioned that we might want to get so for us.

01:04:15.000 --> 01:04:25.000
Well how we can use this is by clicking on the elements.

01:04:25.000 --> 01:04:29.000
Then coming back down to the code and then looking here, and so we can see that this title is stored within an H.

01:04:29.000 --> 01:04:38.000
2, and that this H. 2 has the class article dash, title, entry, dash, type.

01:04:38.000 --> 01:04:46.000
And so we can go back, and we will then demonstrate this, so we can do soup.

01:04:46.000 --> 01:04:53.000
Dot find underscore all. And now we want the h twos.

01:04:53.000 --> 01:04:56.000
But this is, gonna give us more h 2 s. Than we want.

01:04:56.000 --> 01:05:04.000
So you can further, we haven't seen this before, but you can further specify that I want the h twos.

01:05:04.000 --> 01:05:07.000
Then you provide a dictionary, where in that dictionary you'll specify.

01:05:07.000 --> 01:05:10.000
I want all the h twos of a particular class. And so what's that class going to be?

01:05:10.000 --> 01:05:21.000
It's gonna be this one. So copy, paste, string, paste.

01:05:21.000 --> 01:05:22.000
And so now you'll see that we get a bunch of H.

01:05:22.000 --> 01:05:26.000
Twos, and what we could do is loop through that.

01:05:26.000 --> 01:05:35.000
So for a in here francs a dot text.

01:05:35.000 --> 01:05:38.000
Okay. And so another thing, we can do this is just a string function.

01:05:38.000 --> 01:05:43.000
We can get rid of all that annoying white space.

01:05:43.000 --> 01:05:50.000
And then this just gives us all the titles. So we've got the titles, and we could go back and double check.

01:05:50.000 --> 01:06:10.000
So we got the Andrew Mccutcheon one Fernando tates with Anthony Richardson being after that, books and busting bucks, busting and then we've got Mckayle Bridges Mlb project probably the

01:06:10.000 --> 01:06:13.000
Prospects. Mckale Bridges, Mlb.

01:06:13.000 --> 01:06:18.000
Prospects, and then let's go ahead and check the last 3 raise Wnba and big bad brewing.

01:06:18.000 --> 01:06:33.000
So we got all the titles we wanted very quickly, just like this, and if we wanted to do it, and like a quick without a for loop, we could use a list.

01:06:33.000 --> 01:06:48.000
Comprehension. So a dot text dot strip for a in this soup dot find all.

01:06:48.000 --> 01:06:53.000
So now we have it in a list which we could put in like a data frame or something.

01:06:53.000 --> 01:07:00.000
So to do this, to get the authors, you would do the same exact thing where?

01:07:00.000 --> 01:07:09.000
Let's go ahead and use our tool. So we'll click on this.

01:07:09.000 --> 01:07:13.000
Hey! We can see that the author is stored in A. P.

01:07:13.000 --> 01:07:20.000
That has the class single metadata card space V card. So let's try that.

01:07:20.000 --> 01:07:39.000
So we would do soup. Dot find all P class, and then just get rid of this part that got copied over, and then once again, we can do 4 P. In.

01:07:39.000 --> 01:07:52.000
We want p, dot text, okay? And so then we could do extra pleading for this if we would like and get rid of the buys if we wanted to.

01:07:52.000 --> 01:08:00.000
Okay, so are there any questions before we move on to the last section of this notebook?

01:08:00.000 --> 01:08:07.000
I had a question. This might just be preference, but I noticed that you have some single quotes of times, and then double quotes other times.

01:08:07.000 --> 01:08:15.000
Is that just? Yeah. I guess any insight on that.

01:08:15.000 --> 01:08:21.000
Yeah, so this is just preference. I just kinda like the way like, this is the HTML element.

01:08:21.000 --> 01:08:28.000
This is the HTML element this is the HTML element.

01:08:28.000 --> 01:08:36.000
The text. So I guess, like, it's just like an inter a preference that I've internalized, that there's you could just do all single quotes or all double quotes.

01:08:36.000 --> 01:08:39.000
It wouldn't matter like here. I guess I did double quotes.

01:08:39.000 --> 01:08:43.000
I think I just use whatever my fingers do at the time.

01:08:43.000 --> 01:08:44.000
Thank you.

01:08:44.000 --> 01:08:46.000
Yeah.

01:08:46.000 --> 01:08:55.000
Sorry, just as a quick, can you show us how to get rid of these buys in the and last?

01:08:55.000 --> 01:08:56.000
Thank you.

01:08:56.000 --> 01:09:04.000
Yes, so these are all python strings, and so Python strings have this built in method called Replace.

01:09:04.000 --> 01:09:10.000
And so I would just replace, buy space with nothing.

01:09:10.000 --> 01:09:23.000
The empty string, and then, to be extra safe, I would probably throw in a dot strip at the end to get rid of any white space on either side.

01:09:23.000 --> 01:09:31.000
Any other questions?

01:09:31.000 --> 01:09:32.000
Yup Yup, so you could have also. This would work as well because of that dot.

01:09:32.000 --> 01:09:35.000
Thank you. So it is by space. Yes.

01:09:35.000 --> 01:09:40.000
Strip at the end, so Strip will get rid of the white space on the outside of the string.

01:09:40.000 --> 01:09:48.000
Yes, right? Thank you.

01:09:48.000 --> 01:09:49.000
Yeah.

01:09:49.000 --> 01:09:50.000
Okay. Can you go back to the HTML on the other page?

01:09:50.000 --> 01:09:53.000
So down there, it says, by a class equals author.

01:09:53.000 --> 01:10:01.000
URL fn, so could you also, right have it? Search for the class?

01:10:01.000 --> 01:10:08.000
That's just Author URL Fn, and it would just get rid of the buy as well, or something.

01:10:08.000 --> 01:10:15.000
Yeah, so we could try that. Yeah. So let's do 4 a and soup dot finds all, and then I'll use this as an example.

01:10:15.000 --> 01:10:25.000
To show it doesn't just have to be class. So you notice that this also has a thing called a L that is equal to author.

01:10:25.000 --> 01:10:31.000
So we could do. R. E. L. Poland, author.

01:10:31.000 --> 01:10:35.000
And then I should probably finish the rest of my for loop.

01:10:35.000 --> 01:10:47.000
Print a dot text. So here you get that. Now, here's the reason why I'm cheated because I recorded this video like 2 weeks ago for the pre recorded.

01:10:47.000 --> 01:10:52.000
If you'll notice like the bottom. Here we have Neil, Pain and Terence Doyle.

01:10:52.000 --> 01:10:54.000
And so if we go to the box like these 2, both wrote the article together.

01:10:54.000 --> 01:11:04.000
So when you do it with just the A, the authors get separated out.

01:11:04.000 --> 01:11:07.000
So you. That's why I went with using the P.

01:11:07.000 --> 01:11:20.000
Because the authors are contained together there. So that that's sort of the reason and sort of the thing that, like one of the things we'll talk about in just a second is, you know, the web code stuff is kind of messy.

01:11:20.000 --> 01:11:21.000
And so sometimes you'll have to like 5 30. It's actually a really nice clean website with its code.

01:11:21.000 --> 01:11:37.000
But sometimes it's kind of messy. So you just have to play around and double check like, is this doing what I think it's doing before, like implementing it to run overnight to scrape data or something.

01:11:37.000 --> 01:11:47.000
So relay question is, if you open up the developer view on chrome and right-click, there are some options to copy identifiers for it.

01:11:47.000 --> 01:12:05.000
So, for example, you can copy the full X path. Is there a nice method for soup or another package to convert this full path, which is a unique identifier for the element into the text, such that you can just have the relevant HTML.

01:12:05.000 --> 01:12:08.000
So!

01:12:08.000 --> 01:12:13.000
Is this like? So I guess there's 2 questions.

01:12:13.000 --> 01:12:22.000
So one would be like, are you saying that you want to put this as part of like the URL for the request? So it would take you to that specific element.

01:12:22.000 --> 01:12:23.000
Okay. Okay.

01:12:23.000 --> 01:12:26.000
No, I'm saying that after you'd have the code with soup.

01:12:26.000 --> 01:12:27.000
Yeah.

01:12:27.000 --> 01:12:29.000
And it's this big mess you can use developer tools to highlight on the page.

01:12:29.000 --> 01:12:36.000
Something you think is interesting. You've identify it visually and then chrome will also provide you a full path to that encode.

01:12:36.000 --> 01:12:45.000
And so then, ideally, you could just pass that path. 2 beautiful soup, and then it would pull up the content.

01:12:45.000 --> 01:12:47.000
At that location.

01:12:47.000 --> 01:12:54.000
So I think you should be able to, but you would have to like.

01:12:54.000 --> 01:13:00.000
I don't know that you can provide it with the slashes so like you could do like soup.

01:13:00.000 --> 01:13:13.000
Dot HTML dot body dot div I don't know what the you know at 2 is, but then, like so like you could try.

01:13:13.000 --> 01:13:18.000
And then, you know, use this and write a function to clean it up.

01:13:18.000 --> 01:13:22.000
Well!

01:13:22.000 --> 01:13:23.000
Okay, I mean.

01:13:23.000 --> 01:13:29.000
But that would do it. Yeah, yeah, I think it might be possible, but I would have to look into it.

01:13:29.000 --> 01:13:30.000
Yup, hey? Yeah.

01:13:30.000 --> 01:13:36.000
Okay, thank you. I think that makes sense.

01:13:36.000 --> 01:13:42.000
Any other questions?

01:13:42.000 --> 01:13:47.000
Okay, so we'll wrap up this notebook by just like, what are some common problems that you'll run into?

01:13:47.000 --> 01:13:50.000
And we've touched on a couple of them on the way.

01:13:50.000 --> 01:13:55.000
So the first is just that, like the website, can websites can be written by anybody.

01:13:55.000 --> 01:14:05.000
You just need to have something to write them on, and then own a domain and have also a server that you can access.

01:14:05.000 --> 01:14:09.000
So that means there's not like somebody that's just like guarding the Internet, making sure everybody's code looks really nice.

01:14:09.000 --> 01:14:15.000
So, sometimes there's really messy code. There's also code.

01:14:15.000 --> 01:14:21.000
That's just like not. Well labeled. So like this was really easy, because 5, 38 HTML code is really well labeled and searchable.

01:14:21.000 --> 01:14:27.000
There will often be websites that maybe you want the data from.

01:14:27.000 --> 01:14:30.000
But, like they don't have classes or ids, or anything.

01:14:30.000 --> 01:14:35.000
So you might just have to try and play around with like for loops to make it work.

01:14:35.000 --> 01:14:40.000
So that's a problem and a lot of times web scraping just requires a lot of trial and error.

01:14:40.000 --> 01:14:42.000
And honestly, sometimes it's just not doable with, like the how people maintain their websites.

01:14:42.000 --> 01:14:49.000
So that's like more of an issue on your end of things.

01:14:49.000 --> 01:15:00.000
The other thing that we sort of touched on earlier is you can be banned for sending too many requests, so like, if you sent a lot of requests in a very short amount of time.

01:15:00.000 --> 01:15:04.000
A lot of websites. Servers are set up to then prevent you from sending a request, or receiving a response to your request.

01:15:04.000 --> 01:15:14.000
For some amount of time, so the best way to prevent this from happening, and also just to be like.

01:15:14.000 --> 01:15:23.000
So every time you send a request for some amount of time. So the best way to prevent this from happening, and also just to be like, so every time you send a request, it's kinda like somebody clicking on the website, so if you send a whole bunch of requests in a short amount of time, it can overload

01:15:23.000 --> 01:15:26.000
the website servers. And you're not the only one who's trying to access the website.

01:15:26.000 --> 01:15:36.000
So it's just kind of like good ethical practices to not, you know, overload a website servers with requests just to get their data.

01:15:36.000 --> 01:15:45.000
So one way to prevent this from happening is seize the time module which will has this function called dot sleep where you can input a certain amount of sleeping time.

01:15:45.000 --> 01:15:51.000
And then like what's say? You wanted to go through all of these articles and pull the text out of them.

01:15:51.000 --> 01:15:58.000
You could put in a sleep timer of like a few seconds in between polls that will, you know, be a little bit nicer on the servers.

01:15:58.000 --> 01:16:03.000
This debt loss. Does, you know, decrease your chances of being flagged?

01:16:03.000 --> 01:16:07.000
It's also just about, you know, being like a good Internet citizen.

01:16:07.000 --> 01:16:14.000
Another thing that can happen that was also talked about earlier is, even if you let's say you do this, and you're like all right, I'm gonna wait a long time between requests.

01:16:14.000 --> 01:16:20.000
Just sending a request with the request. Module can automatically flag you as a bot.

01:16:20.000 --> 01:16:40.000
So if your flag is a bot this way, there are some like workarounds that exist but there's nothing like that's always set in stone, always going to worker ads that exist. But there's nothing like that's always set in stone. Always going to work so you'll just have

01:16:40.000 --> 01:16:43.000
to do like a website like you'll have to do a web search for like I got this particular block for my request.

01:16:43.000 --> 01:16:44.000
How can I get around it? Sometimes? It's just not possible.

01:16:44.000 --> 01:16:49.000
But there may be other ways to get the data like using in the Api, which we'll talk about in the next notebook.

01:16:49.000 --> 01:16:54.000
The last thing is, there's sometimes user, interactive content.

01:16:54.000 --> 01:17:00.000
So sometimes data might not be available until a user interacts with something.

01:17:00.000 --> 01:17:13.000
And then the data sent from the server based on the interaction with something. And then the data sent from the server based on that interaction. So there are ways to work around that one such way as the selenium package which you can walk through on your own this sort of mimics having a browser open

01:17:13.000 --> 01:17:17.000
and then receiving data from your like inputs from your code and then receiving the data from the websites.

01:17:17.000 --> 01:17:23.000
So I encourage you to check it out. If you're interested in getting data that requires user interactions.

01:17:23.000 --> 01:17:31.000
So that's it for this notebook, because we only have 9 min left, and then a not take questions just yet.

01:17:31.000 --> 01:17:34.000
So I have time to go through the last notebook. This is a short notebook, so don't worry about the limited time.

01:17:34.000 --> 01:17:39.000
So another way to get data from websites or applications is using what's known as an Api.

01:17:39.000 --> 01:17:52.000
So Api stands for application programming, interface and it's sort of a go between for 2 applications.

01:17:52.000 --> 01:17:59.000
And so for our purposes, like we can think of ourselves as a single application that wants to get data from a website or another app.

01:17:59.000 --> 01:18:05.000
And then the people are. You know that whatever we wanna get data from is the other application.

01:18:05.000 --> 01:18:10.000
So one way I like to think of this is like your Api is kind of like a waiter at a restaurant.

01:18:10.000 --> 01:18:15.000
So you, the customer, come in. You look at the menu which for us is like looking at the website.

01:18:15.000 --> 01:18:16.000
Seeing what data we want, we tell the Api using Python, hey?

01:18:16.000 --> 01:18:22.000
I want this. You know, I want this data. Then the Api takes a request interprets it in a way that the app can understand.

01:18:22.000 --> 01:18:31.000
The servers of the app can understand, and then gives it to the web app.

01:18:31.000 --> 01:18:34.000
Then, after the web app gets the request it figures out if it can even do what we're asking it to do.

01:18:34.000 --> 01:18:40.000
So sometimes you need to have authors to get certain data.

01:18:40.000 --> 01:18:43.000
That sort of thing. It prepares its reply.

01:18:43.000 --> 01:18:45.000
So maybe the reply isn't the data we requested.

01:18:45.000 --> 01:18:52.000
Or maybe it's a response, saying, I'm sorry I can't provide that for you gives that response to the Api.

01:18:52.000 --> 01:19:01.000
Who then takes the response, interprets it in a way that you know we can understand in terms of python, and then provides it to us.

01:19:01.000 --> 01:19:07.000
So in a lot of cases we might be able to write some sort of beautiful soup code to get it.

01:19:07.000 --> 01:19:16.000
But sometimes if that's not useful, and there are Api's that we could use pretty quickly with what are known as python Wrappers.

01:19:16.000 --> 01:19:18.000
So if an Api exists, and it's easily accessible through Python, it's probably better to use the python package.

01:19:18.000 --> 01:19:28.000
That's been written to access that Api than just trying to write the beautiful soup codes.

01:19:28.000 --> 01:19:34.000
So like, for one example, scraping, we'dit or twitter using beautiful soup is very difficult.

01:19:34.000 --> 01:19:43.000
So it's probably better to use the python wrapper for the Api, even though it twitter, is sort of a special case.

01:19:43.000 --> 01:19:44.000
Now I probably would avoid scraping that data just based on what's going on right now.

01:19:44.000 --> 01:19:51.000
And if I don't know that it's free anymore.

01:19:51.000 --> 01:19:57.000
So Api's are a thing that don't need python.

01:19:57.000 --> 01:19:58.000
But there are python wrappers for Api's.

01:19:58.000 --> 01:20:07.000
These are python packages that have been written specifically to take python commands as input to the Api.

01:20:07.000 --> 01:20:16.000
So there are people out there that want to access Apis using python, that write these sorts of packages and then provide them open source.

01:20:16.000 --> 01:20:25.000
Some very common exist are popular examples, are spot for spotify there's one called Spotify for Reddit.

01:20:25.000 --> 01:20:30.000
There's the pro package, which is Python reddit api wrapper.

01:20:30.000 --> 01:20:31.000
The New York Times has one called Pi and Y Times, which allows you to get data from the New York Times.

01:20:31.000 --> 01:20:45.000
And so forth. So for this notebook will end like we're just for giving an example of using one of these python wrappers.

01:20:45.000 --> 01:20:50.000
So we're gonna show you how to use the pi and y times wrapper.

01:20:50.000 --> 01:20:56.000
And in order to use this, you need to have it installed and it's not standard to have it installed.

01:20:56.000 --> 01:20:58.000
So, you know, from what we've talked about earlier today, you can install it. Following those instructions.

01:20:58.000 --> 01:21:16.000
If you run this and it runs just fine, you have it installed most I don't know about all, but most Apis, you need to have a developer key so you can get the New York Times.

01:21:16.000 --> 01:21:29.000
Api developer key. By following these 2 links and going through the instructions once you have the Api, there's a python file which I may have, I'll double check that.

01:21:29.000 --> 01:21:32.000
I've uploaded it. If I haven't, I'll I'll upload it. After the lecture.

01:21:32.000 --> 01:21:38.000
But there's a python file called my underscore Api underscore info dot pi.

01:21:38.000 --> 01:21:42.000
Then you can edit it and change this string from your key.

01:21:42.000 --> 01:21:48.000
Here to provide your Api key, which you would get by following the instructions on those websites.

01:21:48.000 --> 01:21:51.000
Once you have that Api key. This is going to allow you to access data using the New York Times. Api.

01:21:51.000 --> 01:22:13.000
So importantly. Api keys that authentication numbers are sort of your identity to the Api so it's important that you keep it secret and don't like go posting it on public repositories because once you do that, somebody else can get it and do whatever they want with your

01:22:13.000 --> 01:22:17.000
key so think of this is sort of like a social security card.

01:22:17.000 --> 01:22:24.000
But for accessing an Api, something like that. So typically what you can do and what I'll do is I'll make a python file.

01:22:24.000 --> 01:22:25.000
That's only on my own computer or on the server that only I have access to.

01:22:25.000 --> 01:22:38.000
And then I'll import that python. File that's only on my own computer or on the server that only I have access to. And then I'll import that python, the function from that python file get New York times key. And then run it without ever looking at

01:22:38.000 --> 01:22:48.000
the results. Another thing you can do is you can, I believe, set the settings of a server or your computer to store your Api key in there, and it would just have it uploaded.

01:22:48.000 --> 01:22:51.000
I don't know how to do that, but I know it's something people do.

01:22:51.000 --> 01:22:55.000
So I have an Api key so I'll be showing you how to use it.

01:22:55.000 --> 01:23:03.000
Once you have it, and then, if you're interested in working with this, you can check out these instructions here to figure out how to do.

01:23:03.000 --> 01:23:08.000
So, okay, so how do we do this? So the first thing we need to do is import.

01:23:08.000 --> 01:23:14.000
The class that allows us to connect to the Api. So from Pi and Y.

01:23:14.000 --> 01:23:19.000
Times we'll import nyt Api.

01:23:19.000 --> 01:23:22.000
So, if you're trying to run this later, you would edit it.

01:23:22.000 --> 01:23:25.000
So you do not have the file mapped underscore.

01:23:25.000 --> 01:23:34.000
Api underscore info. But I do. You would have to edit this for my Api info, and then it'll import the function.

01:23:34.000 --> 01:23:38.000
Apparently I lied. I guess I don't have this file.

01:23:38.000 --> 01:23:41.000
But so that will make the rest of this difficult to finish.

01:23:41.000 --> 01:23:44.000
But that's okay, because we only have a few minutes, anyway.

01:23:44.000 --> 01:23:57.000
So probably what's best is, if you're really interested in seeing me access the New York Times, Api, you can go through and watch the video that I've made for this lecture.

01:23:57.000 --> 01:24:02.000
I guess I forgot to check that. I had this one in there which is not for not good.

01:24:02.000 --> 01:24:06.000
But basically you'll just go through. You would then provide hypothetically like the key.

01:24:06.000 --> 01:24:07.000
Hi, Matt!

01:24:07.000 --> 01:24:08.000
Yeah.

01:24:08.000 --> 01:24:13.000
I think it's you're using Mac underscore Api and disco info, instead of my.

01:24:13.000 --> 01:24:18.000
So this is supposed to be a file that has my Api key.

01:24:18.000 --> 01:24:22.000
Yeah.

01:24:22.000 --> 01:24:32.000
So like this is a different file from the my one like this one has my key that you guys don't have access to. So that's why it says Matt. Instead of my.

01:24:32.000 --> 01:24:33.000
I see.

01:24:33.000 --> 01:24:38.000
Yeah, yeah. But once you have that, this is how you do it.

01:24:38.000 --> 01:24:48.000
So the function get, underscore. Ny. Times. Underscore Key would provide the string of your key, and then the next argument you would provide as par States equals.

01:24:48.000 --> 01:25:07.000
True, this just allows you to use date times. Then you can again like check out the completed version this will show you how to get the results with like specific keyword queries, like, for instance, basketball, and then the dates like I want articles from for example, march.

01:25:07.000 --> 01:25:11.000
First, 2023 to April nineteenth, 2023.

01:25:11.000 --> 01:25:18.000
Okay. So that will be the I again, because I don't have the file in this current version of the repository.

01:25:18.000 --> 01:25:22.000
I would have to add it again. Check it out on the pre-wed lecture.

01:25:22.000 --> 01:25:28.000
If you want to see the code in action. This was just an example.

01:25:28.000 --> 01:25:29.000
It's not super important. You just the important thing is that you get the gist of.

01:25:29.000 --> 01:25:34.000
There are Api wrap first for python.

01:25:34.000 --> 01:25:41.000
They have documentation links that you can read through to figure out how to use them.

01:25:41.000 --> 01:25:42.000
Here's the example for pi and y times, and then these are sometimes easier to use than writing beautiful soup code.

01:25:42.000 --> 01:25:58.000
Okay. And with that I will close it off, for today I will stick around for a little bit to answer any questions, but until then I will see tomorrow what we start using data science tools like models.

01:25:58.000 --> 01:26:07.000
That sort of thing.