WEBVTT 00:00:00.000 --> 00:00:01.000 Awesome. Okay? So I'm gonna go ahead and start recording. 00:00:01.000 --> 00:00:06.000 Alright, so welcome. Today's the second day of lecture for the 2023 may boot camp. 00:00:06.000 --> 00:00:12.000 So today we're gonna start the data content so we're gonna start with data collection. 00:00:12.000 --> 00:00:21.000 So let me go ahead and share my Jupiter notebook and get my chat window open. 00:00:21.000 --> 00:00:28.000 So this is what you should see if you are able to successfully clone the repository and open it with a Jupiter notebook. 00:00:28.000 --> 00:00:37.000 If you're unable to get to something like this, I ask that you hold your questions until after lecture is done, and then I'll stick around for a few minutes to make sure that you're able to get it. 00:00:37.000 --> 00:00:53.000 Set up. Okay. So every day, when we do lectures, or if you're watching the lectures asynchronously, either through the pre recorded or the live lectures you're gonna go here click on lectures and then navigate your way to the content we'll be 00:00:53.000 --> 00:00:59.000 covering. So today, we're gonna skip introduction, which we kind of did yesterday and go straight to data collection. 00:00:59.000 --> 00:01:00.000 So remember the goal of the boot camp is to give you the skills to complete an end-to-end data. 00:01:00.000 --> 00:01:06.000 Science project that you can talk about in interviews, or write about on your resume when applying to data science positions. 00:01:06.000 --> 00:01:16.000 They're also helpful for giving you skills. If you let's say you want to stay in research. 00:01:16.000 --> 00:01:22.000 But you want to do like data science, research, these projects will also help you get skills for that type of work. 00:01:22.000 --> 00:01:27.000 So the first key, though, for doing any sort of projects that is it in a job, position, or is it as part of a research project, is collecting data? 00:01:27.000 --> 00:01:46.000 So you can't do a data science project without data so that's what today is going to cover is giving you skills to find data sets, either that already exists or give you the skills to create your own datasets using the Internet so that's what we're going to cover today the first notebook we're going to 00:01:46.000 --> 00:01:51.000 work on is this data source websites. So what's going to happen for all lectures? 00:01:51.000 --> 00:02:04.000 Now that we've gone through the intro is, I'm gonna open a lecture version of the of the notebook, and this lecture version I'll fill out during the live lecture and upload later tonight or tomorrow morning. 00:02:04.000 --> 00:02:07.000 So that way. If you want to come back and see. Well, what was Matt? 00:02:07.000 --> 00:02:09.000 What we're Matt's notes during the lecture. 00:02:09.000 --> 00:02:11.000 You can always check this copy. So once my kernel starts, we'll be able to code and stuff. 00:02:11.000 --> 00:02:18.000 But I don't think there's much code in this one. 00:02:18.000 --> 00:02:22.000 Okay, so, what are we gonna talk about in this notebook? 00:02:22.000 --> 00:02:26.000 We're gonna talk about data source websites. So what are data source websites? 00:02:26.000 --> 00:02:27.000 These are just websites that exist on the Internet, that are you can just use as sources of data. 00:02:27.000 --> 00:02:35.000 So they have data sets that are all ready to go. 00:02:35.000 --> 00:02:38.000 So there are 2 main types that we're going to talk about. 00:02:38.000 --> 00:02:43.000 The first is known as a data repository, and the second is a data competition site. 00:02:43.000 --> 00:02:44.000 So there are additional types of data websites, but these are the main 2 that we'll focus on today. 00:02:44.000 --> 00:02:51.000 So what's a data repository? 00:02:51.000 --> 00:02:54.000 So this is any website where data sets are deposited. 00:02:54.000 --> 00:03:06.000 So there's a couple of reasons why these might exist, maybe for academic research, or to house some data that a website or company's using. 00:03:06.000 --> 00:03:16.000 There's lots of different reasons to have a data repository, some very specific examples are, maybe there was some housing data associated with public academic, published academic research. 00:03:16.000 --> 00:03:21.000 Maybe a news organization did a data focused piece. And they're now they're holding that data. 00:03:21.000 --> 00:03:34.000 So other people can look at it or check their facts, and then the final is, sometimes there will be a websites where you have benchmark data sets that are used to compare different algorithms that are being developed. 00:03:34.000 --> 00:03:37.000 We'll see some examples of that throughout the boot camp. 00:03:37.000 --> 00:03:44.000 One of the main ones is known as academic Repository, so, as an example, here's a link to the Uc. 00:03:44.000 --> 00:03:48.000 Irvine Machine Learning, Repository. So these have various different data sets that you can check out so like the newest ones as well as the most popular. 00:03:48.000 --> 00:03:58.000 So this Iris dataset, you'll see when we do classification. 00:03:58.000 --> 00:04:11.000 I believe so. Here you can see it has, like, Creator, the person that donated the data, set various information pieces of information about the data set, like what the data columns look like as well as papers that site. 00:04:11.000 --> 00:04:23.000 The data. Once you go to in this particular website, if you go to the data folder, there will be links to the data itself that you could then click to download the data. 00:04:23.000 --> 00:04:32.000 So that's an example. Here's some additional examples that you might be interested in checking out other examples are github repositories. 00:04:32.000 --> 00:04:33.000 So these are repositories that just exist to store data. 00:04:33.000 --> 00:04:38.000 These are used a lot by websites and news organizations. 00:04:38.000 --> 00:04:39.000 So, for example, here are links to the 5, 38. 00:04:39.000 --> 00:04:44.000 The New York Times and the the putting dot cool websites. 00:04:44.000 --> 00:04:53.000 So if we click on this it will take us to the Github Repository for 5, 38, and then they have a data repository within that. 00:04:53.000 --> 00:05:00.000 And so this contains all the data for their for their website, their articles. 00:05:00.000 --> 00:05:04.000 And I believe the one we'll look at. In this notebook is called Candy Power Rating. 00:05:04.000 --> 00:05:10.000 So this was a piece that they did on, I believe, within 5, 38. 00:05:10.000 --> 00:05:14.000 They rated the various different types of Halloween candy one year and then made a fun video about it. 00:05:14.000 --> 00:05:21.000 So if we click on the dot Csv file, this is the data set that we would be using. 00:05:21.000 --> 00:05:32.000 So this is what it looks like here, but if we would want to copy it, we'll probably want the raw version of the data, and so we could then use our web browser save feature or print feature to save it as a Csv file and then upload it. 00:05:32.000 --> 00:05:40.000 So I believe that I've already done that. So if I import P. 00:05:40.000 --> 00:05:48.000 Andas, and then run this. I thought that might happen. So what's go ahead? 00:05:48.000 --> 00:05:50.000 I'm not gonna spend the time to download it. 00:05:50.000 --> 00:05:56.000 You can download it, yourself, but if you download it you'll be able to do this, and then I'll download it. 00:05:56.000 --> 00:05:59.000 After the lecture, because we don't need to waste time watching me download a file. 00:05:59.000 --> 00:06:03.000 But once we do that, you'll be able to load it the other way. 00:06:03.000 --> 00:06:05.000 You can do it with Github. That's nice is because we have a link to the raw Csv file on the Github website. 00:06:05.000 --> 00:06:16.000 Pandas allows us to just input a link. And then read the link directly from the Internet. 00:06:16.000 --> 00:06:23.000 Now, you'll need an Internet access, an Internet connection to do this but you can see now that I've run this. 00:06:23.000 --> 00:06:25.000 It's been connected, and now I can look at the data. 00:06:25.000 --> 00:06:33.000 It's in my Jupiter notebook. I could do other things to data like, do a random sample. 00:06:33.000 --> 00:06:34.000 Of size, 4 and now we can see the random sample. 00:06:34.000 --> 00:06:40.000 So this is something you can do with the raw. Github file, or any other online address that has a link to a data file. 00:06:40.000 --> 00:06:51.000 You can use pandas and just provide the link. Assuming you're connected to the Internet. 00:06:51.000 --> 00:06:52.000 So that's sort of it for data repositories. 00:06:52.000 --> 00:06:59.000 So just so quick a guidelines, and how you can use a data repository. 00:06:59.000 --> 00:07:12.000 So if you're using data that you didn't yourself create, it's important to make sure that you provide a citation of where you got the data, make sure you follow whatever data you use guidelines are associated. 00:07:12.000 --> 00:07:22.000 So sometimes data repositories will say, if you use this, please site this particular paper, or it'll have rules that you can use this, but not for use in any sort of commercial products. 00:07:22.000 --> 00:07:24.000 So you couldn't use the data to train an algorithm that you adventure and monetize. 00:07:24.000 --> 00:07:35.000 So just read the the guidelines that they have available on their website and follow those. 00:07:35.000 --> 00:07:41.000 And then, even if they don't have guidelines on how to site, it's always don't pretend that you generate the data yourself. 00:07:41.000 --> 00:07:48.000 Make sure you site where you got the data, even if their guidelines don't say, Hey, make sure you cite us. 00:07:48.000 --> 00:07:49.000 The other type of data website that you might be interested in are known as data competition websites. 00:07:49.000 --> 00:07:59.000 So these are websites that are exists to provide data competitions. 00:07:59.000 --> 00:08:05.000 So these websites will do a large number of things, including just publicly store data. 00:08:05.000 --> 00:08:11.000 But they'll also host the competition. They'll specify rules as outlined by whoever's providing them. 00:08:11.000 --> 00:08:25.000 The competition. They'll accept entries to the company from people like you that might want to enter, and then they'll provide criteria and help determine the winner or winners of the competition. 00:08:25.000 --> 00:08:28.000 So, while a lot of these websites, the main purpose is to provide, you know, a competition that you can join. 00:08:28.000 --> 00:08:33.000 They also often serve as a source of data for personal projects. 00:08:33.000 --> 00:08:43.000 So. For instance, one of the most popular is caggle.com, and so cagle.com. 00:08:43.000 --> 00:08:51.000 Now you can see they have the competitions here that you could go through, but they also just have regular data sets that aren't necessarily involved with the competition. 00:08:51.000 --> 00:08:57.000 So you could click on a data set. Let's say, maybe vehicle data set. 00:08:57.000 --> 00:09:00.000 And then we can scroll down, and we can see. 00:09:00.000 --> 00:09:05.000 Here's what the data set looks like. And then we can see that there's different files. 00:09:05.000 --> 00:09:19.000 So if we click on this, a different data set loads, and then if we click this download button, it would download all of this data for us to have an architecture again, these sorts of websites have rules about how to use the data licenses and that sort of thing. 00:09:19.000 --> 00:09:26.000 So be mindful of what the website is asking you to do. 00:09:26.000 --> 00:09:30.000 If you're going to use the data set and only follow that. 00:09:30.000 --> 00:09:41.000 So, for instance, if you got your data set from a competition, a lot of competitions will say you could not publish this data or your work until the competition closes so be mindful of that. 00:09:41.000 --> 00:09:45.000 If you're using you know what are the rules, what are the regulations for the data competition? 00:09:45.000 --> 00:09:48.000 If that's where you're getting your data from. 00:09:48.000 --> 00:09:59.000 Another important part of this is these casual competitions often come with monitorary prizes, so you know, if you do well enough, you could win money or just swag. 00:09:59.000 --> 00:10:00.000 So, of them are just offering the reward of knowledge, which is maybe all you're looking for. 00:10:00.000 --> 00:10:06.000 So just keep that in mind. 00:10:06.000 --> 00:10:17.000 Okay, so for the sake of time, I will skip the example of extracting data, because I trust you guys to be able to click the download button and move your files around. 00:10:17.000 --> 00:10:29.000 But you can try and practice on your own by going through this and seeing if you can get the Irs dot Csv file on into this folder and loaded by following these instructions. 00:10:29.000 --> 00:10:43.000 Okay? So before we move on to the next notebook, are there any questions about data site, competition websites or data repositories? Anything like that? 00:10:43.000 --> 00:10:45.000 I have a question. 00:10:45.000 --> 00:10:46.000 Yeah. 00:10:46.000 --> 00:10:47.000 So you're talking about when you're getting data from sales like a Github repo. 00:10:47.000 --> 00:10:52.000 Even if the authors say not to site or like. 00:10:52.000 --> 00:10:53.000 Don't say anything about signing that you should still cite it right. 00:10:53.000 --> 00:10:57.000 Still site where you got the data right? How should I sign it? 00:10:57.000 --> 00:11:02.000 Yeah, so, just so you should just say, like data was achieved from like this website. 00:11:02.000 --> 00:11:11.000 And then like, if it's like in a paper or something like, Follow, whatever the standards are for citations. So it's just like like you would any other source. 00:11:11.000 --> 00:11:12.000 Gotcha gotcha. Thank you. 00:11:12.000 --> 00:11:17.000 Yup! Yup! 00:11:17.000 --> 00:11:25.000 Any other questions? 00:11:25.000 --> 00:11:31.000 Okay. 00:11:31.000 --> 00:11:39.000 Alright! So the next thing we're gonna talk about is, maybe you have data that's stored in some sort of database. 00:11:39.000 --> 00:11:40.000 And basically, this is just going to be, how can I get that data from the database into a pandas? 00:11:40.000 --> 00:11:54.000 So basically, how can I load database data into things like Pandas into python, into Jupiter notebooks? 00:11:54.000 --> 00:12:01.000 So you can manipulate it. So in the previous examples from the websites, those were Csv files, and those are just singular files. 00:12:01.000 --> 00:12:07.000 But sometimes data is complicated too complicated to just have a series of one off files. 00:12:07.000 --> 00:12:14.000 So in situations where data is that kind of complicated people will often store it in a database. 00:12:14.000 --> 00:12:20.000 So in this notebook, we'll talk we'll introduce the idea of a database. 00:12:20.000 --> 00:12:31.000 We'll introduce the language that people use to communicate with databases or relational databases, and then we'll show you how to access the data, using a package called Sequel Alchemy. 00:12:31.000 --> 00:12:42.000 So what is a relational date base? So a lot of times, businesses or other entities will have a series of data tables that are interrelated with one another. 00:12:42.000 --> 00:12:52.000 So, for instance, let's imagine we have a hypothetical business, and in that business, you know, people make it's it's a sales business. 00:12:52.000 --> 00:12:56.000 So they sell. They sell items, come out. So maybe they have a table that keeps track of all the purchases that get made, and maybe they have. 00:12:56.000 --> 00:13:14.000 You set up a profile so like Amazon, you have to have an Amazon profile to make a purchase, so they also have a table of their customers, and so they might have this purchases table, a purchase id that keeps track. 00:13:14.000 --> 00:13:18.000 Of the unique identity of the purchase, as well as a customer. 00:13:18.000 --> 00:13:22.000 Id column that keeps track of the customer that made the purchase. 00:13:22.000 --> 00:13:27.000 And then you know all the other related data they might want, like the name of the product. 00:13:27.000 --> 00:13:32.000 How many the price? All that sort of thing, and then they at the same time. 00:13:32.000 --> 00:13:42.000 This could then be linked to a customer table where the customer, Id would be linked between both the purchases and the customer table, and then within that customer table. 00:13:42.000 --> 00:13:53.000 It would just contain information on the individual customers. So if you're thinking of profiles you've created, it might have things like, you know, your address, your credit card information. 00:13:53.000 --> 00:13:58.000 If it's Amazon something like that, you know things like your age. 00:13:58.000 --> 00:13:59.000 Other information, about your customers that you might want to have in order to. 00:13:59.000 --> 00:14:06.000 You know, in order to, you know. Better sell to them, or something like that. 00:14:06.000 --> 00:14:08.000 And so the customer id column of the purchases table can be linked to the customer. 00:14:08.000 --> 00:14:13.000 Id column of the purchases table can be linked to the customer table, and this can allow you to. 00:14:13.000 --> 00:14:29.000 Weary the data in a way that you could say, Okay, give me the purchases or all of my mail customers from ages 20 to 40, and then maybe you can use that. 00:14:29.000 --> 00:14:34.000 Make some sort of marketing decisions like, Oh, okay, they tend to buy this. 00:14:34.000 --> 00:14:37.000 So I will advertise these things, or they tend to not buy this so maybe I'll incentivize them to buy it with Cooper. 00:14:37.000 --> 00:14:39.000 's that sort of thing. So sometimes data is stored in a relational database like this, and you need to be able to take it out using python. 00:14:39.000 --> 00:14:44.000 So sometimes data is stored in a relational database like this, and you need to be able to take it out using python. 00:14:44.000 --> 00:15:00.000 So that is where we're going to have sort of a middleman called the structured query language or sequel that people tend to just write sequel code. 00:15:00.000 --> 00:15:08.000 If they're accessing data. But this is not going to be a boot camp on how to write sequel code. 00:15:08.000 --> 00:15:09.000 So we're going to give you the very basic tools of how you can use. 00:15:09.000 --> 00:15:19.000 Some very basic sequel to get data out of the databases, and then into Python, which is where we are going to be working. 00:15:19.000 --> 00:15:25.000 So the way that you get data out of a database is by writing what's known as a query using sequel. 00:15:25.000 --> 00:15:30.000 The query you will specify. I would like to get this data from this table. 00:15:30.000 --> 00:15:45.000 And then, typically with some sort of conditional statement. So the most basic sequel query syntax is, you're going to write and so in sequel, it's standard to write sequel terms with capital letters. 00:15:45.000 --> 00:15:51.000 And then, whatever the table names are using, lowercase, letters assuming that's the table you have. 00:15:51.000 --> 00:16:09.000 So, for instance, if you want to get all of the columns from a particular table for the rows that follow some sort of conditional statement, you would write capital, select the star indicates that you'd like to get all of the columns from that table so going back to our example the star would indicate that 00:16:09.000 --> 00:16:14.000 you want, maybe all of the columns of the purchases table from table name. 00:16:14.000 --> 00:16:17.000 So table name is where you would specify the table name. 00:16:17.000 --> 00:16:23.000 So if we wanted the purchases table, we would say purchases, and then where? 00:16:23.000 --> 00:16:34.000 And then, if we'd like to specify that, maybe we only want purchases that are more than $10 or purchases of a particular item, we would provide conditional statements. 00:16:34.000 --> 00:16:39.000 So how do we write sequel code and Python? 00:16:39.000 --> 00:16:43.000 One way is to use the sequel alchemy package. 00:16:43.000 --> 00:16:48.000 So this is not something that is installed. I don't think is in standard. 00:16:48.000 --> 00:16:53.000 Is installed by default with the anaconda navigator distribution. 00:16:53.000 --> 00:16:57.000 If you're using that, so you may have to install the sequel. 00:16:57.000 --> 00:17:01.000 Alchemy, python package, in order to run this code. 00:17:01.000 --> 00:17:02.000 So one way that we can check if it's installed is just try to import it. 00:17:02.000 --> 00:17:15.000 So if you try to run this code, Chunk, that I just ran, and you receive an error, it's because you do not have sequel alchemy installed so you can follow. 00:17:15.000 --> 00:17:18.000 We have instructions. I believe it's under the first steps. 00:17:18.000 --> 00:17:30.000 Button in the website. You can click that button, and then in there will be a file on how to install python packages, using either conda Pip or the anaconda navigator. 00:17:30.000 --> 00:17:37.000 So you can try and install it. I would suggest maybe waiting until after the lecture to try cause I want to make sure you're paying attention. 00:17:37.000 --> 00:17:38.000 But if you'd like to try and code along, you can try and install it. Now. 00:17:38.000 --> 00:17:42.000 So this will also show us, like what version of sequel alchemy. 00:17:42.000 --> 00:17:48.000 So I currently have 1.4 point 2 9 installed. 00:17:48.000 --> 00:17:55.000 It's possible that your version will be a little bit later or a little bit earlier than my version. 00:17:55.000 --> 00:18:00.000 It's probably okay, but if you're ever going through something, and your code is different than my version, it's probably okay. 00:18:00.000 --> 00:18:06.000 But if you're ever going through something, and your code is different than my code like, let's say I write something, and it doesn't work it could possibly be just because of a versions are different. 00:18:06.000 --> 00:18:13.000 And so you can typically do a web search that says you know how to do blank package version blank. 00:18:13.000 --> 00:18:14.000 And then somebody out there has an answer to this, or they might just suggest that you update the package version. 00:18:14.000 --> 00:18:24.000 If if it's something that's not available in an older version. 00:18:24.000 --> 00:18:26.000 Okay, so let's try and learn about submitting. 00:18:26.000 --> 00:18:54.000 Sequel queries. But maybe before we do that now is a good time to publish for questions, and then I can also drink a drink of water. 00:18:54.000 --> 00:19:12.000 Okay, so there is a particular series of steps that you have to take when getting data out of a database with SQL alchemy. So we're going to go through those steps with a you know, sort of synthetic database that I created called cat store so we're going to imagine that this 00:19:12.000 --> 00:19:19.000 is a database for a cat store that has 2 tables, a customer's table, and a purchases table. 00:19:19.000 --> 00:19:26.000 So the first thing you have to do in order to get data out of a database using SQL. 00:19:26.000 --> 00:19:29.000 Alchemy is to do what's known as create an engine. 00:19:29.000 --> 00:19:35.000 So you have to create an engine, and then that engine will allow you to connect to the database. 00:19:35.000 --> 00:19:41.000 So to create an engine. The easiest typically what you'll do is when working with python packages. 00:19:41.000 --> 00:19:47.000 You'll import the specific classes or functions that you have, that you want to use. 00:19:47.000 --> 00:19:52.000 And so for us. We want to use the create underscore engine function. 00:19:52.000 --> 00:19:59.000 So we'll first import that from sequel alchemy and then we will then create the engine by running, create engine. 00:19:59.000 --> 00:20:08.000 You, then input a string that string first takes in the type of sequel language that was used to create the database. 00:20:08.000 --> 00:20:19.000 So for us that's going to be sqlite, other ones that you might see is just regular sequel, or my sequel, or I believe Maria dB, there are other examples. 00:20:19.000 --> 00:20:20.000 It just depends on your database. And typically those things are specified. 00:20:20.000 --> 00:20:29.000 You know you would know what you're using before you're trying to access it. 00:20:29.000 --> 00:20:39.000 That's usually specified by the person providing the database next you're going to have a colon, and then you need to provide the path that the database is stored in. 00:20:39.000 --> 00:20:46.000 So because cat Stor, dB as of this morning, is stored within the repository. 00:20:46.000 --> 00:20:50.000 You just have to put in the file name. So for us, that's cat underscore store Dot. 00:20:50.000 --> 00:20:54.000 DB, and I guess not file name but database name. 00:20:54.000 --> 00:20:55.000 So now I have a connection to, or I have an engine that will allow. 00:20:55.000 --> 00:20:58.000 Yeah, yeah. 00:20:58.000 --> 00:21:06.000 Sorry I have a question. Can you explain the reason why you had the 3 backslashes? 00:21:06.000 --> 00:21:08.000 So I believe that you can put something else here to specify the path. 00:21:08.000 --> 00:21:31.000 I don't quite remember what goes here. I think it's just to help you specify the path, but because the file is stored in this particular folder, it's just slash, and then the name of the database I've never used this to not just like access the 00:21:31.000 --> 00:21:41.000 database. That's right in the folder with me, so I would have to do like I'd have to read the documentation for sequel alchemy to figure out like y 3 precisely. 00:21:41.000 --> 00:21:45.000 Okay. 00:21:45.000 --> 00:21:46.000 Okay. 00:21:46.000 --> 00:21:48.000 No, no, I think with. Sorry if I can add something. 00:21:48.000 --> 00:22:01.000 If the data path is somewhere in another folder or another directory, then you need to specify that maybe that's why you have with 3. 00:22:01.000 --> 00:22:14.000 So in between, as you said, A, the backslashes, you need that a specific directory step by step, so that it takes you to that data data. 00:22:14.000 --> 00:22:19.000 Yup! Yup! 00:22:19.000 --> 00:22:23.000 Okay. So now that we have an engine, we can then connect to our database. 00:22:23.000 --> 00:22:37.000 And this is done by running engine, dot connect, and so connecting to the database is what it will then allow us to submit queries to the database and then get back the information so I'm going to import pandas. 00:22:37.000 --> 00:22:42.000 So this is just going to allow me to demonstrate error to display the data nicely as a data frame. 00:22:42.000 --> 00:22:46.000 So it's easier to read. So how do I submit? 00:22:46.000 --> 00:23:00.000 A query, are you first? Right? The variable where I've stored the connection, which is C, O, n, or K, and then you're going to use the function dot execute, and then within execute. 00:23:00.000 --> 00:23:04.000 You'll put a string that has the sequel. Query. 00:23:04.000 --> 00:23:11.000 So for us, that's gonna be select. And then I'm just want all of the columns. 00:23:11.000 --> 00:23:22.000 And then from the purchases table. So from purchases, and then that's gonna be Ed. I don't have any additional queries. 00:23:22.000 --> 00:23:28.000 Okay. And so I ran this. And now that I've run this, let's go ahead, and I'll add an extra code chunk. 00:23:28.000 --> 00:23:29.000 You'll notice that, like nothing came out, and you might be wondering. 00:23:29.000 --> 00:23:37.000 Well, how do I get you know the query results. So one way to do this is, you'll do so. 00:23:37.000 --> 00:23:47.000 I stored my results in a variable called results, so we can see that this is a cursor dot legacy cursor result, object. So this is a cursor, dot legacy, cursor, result. Object. 00:23:47.000 --> 00:23:52.000 So this is an object, and then how do we get them out of there? 00:23:52.000 --> 00:24:04.000 We call dot fetch all, so this will return a list of Tuples of all of the results from our query, and then I'll point out once we run this once, if we try and run it again, it will be empty. 00:24:04.000 --> 00:24:15.000 So? Why is that so? Fetch all will be, hey? That's all. 00:24:15.000 --> 00:24:22.000 We'll return everything and sort of just spin it out as it goes, and then once it spits it out like it's out of the results. 00:24:22.000 --> 00:24:24.000 Object so Rohan is asking, what is a legacy? 00:24:24.000 --> 00:24:30.000 Cursor result, that's just the name of the sequel class that contains the results from the query. 00:24:30.000 --> 00:24:43.000 So that's what they've called the class, and then it has different methods and objects or attributes that we can use to get all of our stuff so I'm gonna rerun this. 00:24:43.000 --> 00:24:51.000 But I'm gonna edit it a little bit. So I'm gonna store it in a data frame then I'm gonna put my results. 00:24:51.000 --> 00:25:04.000 So results. Dot fetch all, and then I'm going to name the column. So if you want to know the names of the columns of the database, you can do results. 00:25:04.000 --> 00:25:17.000 Dot keys so keys is gonna return. Dot keys will return all the column names from the table that you query, okay, so this is what it looks like as a data frame maybe makes it a little bit easier to read. 00:25:17.000 --> 00:25:23.000 So we've got purchase id customer id number of items, free tax price and then purchase type. 00:25:23.000 --> 00:25:28.000 So these are the columns of the purchases, table. 00:25:28.000 --> 00:25:38.000 So I'm going to comment this out, for now and then I will rerun this down here to show off something else. 00:25:38.000 --> 00:25:43.000 So we did fetch all but maybe you don't want to return all of the results at once. 00:25:43.000 --> 00:25:49.000 Maybe you want to return them one at a time, and so we can write results. 00:25:49.000 --> 00:25:57.000 Start, fetch one, and so, if I again notice, I've rerun the query. 00:25:57.000 --> 00:26:06.000 So the data has been repopulated. So fetch one will give you the first tuple and the tuple corresponding to the first returned row. 00:26:06.000 --> 00:26:14.000 So if we go scroll up and down, so if we go scroll up and down, we can see this is the first row or the 0 row in python. 00:26:14.000 --> 00:26:22.000 This is the first row of the table, and then if I were to run, fetch one again, we's gonna get returned. 00:26:22.000 --> 00:26:25.000 Is the second row. 00:26:25.000 --> 00:26:31.000 Okay. So now that's row 2, where I guess. Row one and python, and then another feature. 00:26:31.000 --> 00:26:32.000 You can have is, instead of getting one row at a time. 00:26:32.000 --> 00:26:41.000 You can do results dot. 00:26:41.000 --> 00:26:50.000 Fetch many, and then provide an integer, input which will then tell it to return the next N rows so positive integers. 00:26:50.000 --> 00:26:53.000 So if I did like 4, it will return. Rose 3 through 6, because I've already returned Rose. 00:26:53.000 --> 00:27:03.000 One and 2. So an important note is that when you use fetch one or fetch many, the results are returned sequentially. 00:27:03.000 --> 00:27:08.000 We're going to go ahead and skip the practices just for the sake of time, of the live lecture. 00:27:08.000 --> 00:27:18.000 But with that being said, maybe now is a good time to pause to answer some questions. 00:27:18.000 --> 00:27:19.000 So let's see, Laura's asking. I have an error saying not found with this command. 00:27:19.000 --> 00:27:30.000 So that probably suggests that you did not. So it's all the update on it. 00:27:30.000 --> 00:27:32.000 So earlier today, I had to add the cat store dot. dB database to the repository, and when I added it, you code would run fine. 00:27:32.000 --> 00:27:48.000 But if you did not pull this update into your version of the repository, you did not have database file. 00:27:48.000 --> 00:27:53.000 And so typically what that means is so like when you. 00:27:53.000 --> 00:27:56.000 When you run this, if you do not have a database of this name, it will just create an empty database. And so maybe you ran. 00:27:56.000 --> 00:28:06.000 This created the empty database. And now it's trying to do this sort of thing. 00:28:06.000 --> 00:28:07.000 But the empty database does not have a purchases. 00:28:07.000 --> 00:28:25.000 Table, so what you'll need to do is either you'll have to first go through and delete the cat store database and then pull the updates so that you get the one that well, so Laura, like I said that if you ran if you're trying to run this 00:28:25.000 --> 00:28:36.000 step by step, then you will have this, because if it doesn't exist, it will create the file. 00:28:36.000 --> 00:28:41.000 Okay, so let's see. 00:28:41.000 --> 00:28:42.000 Yeah. 00:28:42.000 --> 00:28:43.000 So I have a question like the cat store dot. 00:28:43.000 --> 00:28:49.000 DB, you said that this command, sqlite. So that was created using sqlite. 00:28:49.000 --> 00:28:51.000 And then you are like sending queries to fetch them. 00:28:51.000 --> 00:28:56.000 So if it were actually created with a different sequel. Language, then would you have to like use? 00:28:56.000 --> 00:29:07.000 Because the queries can differ a little bit if you have different, if you have like different database preparations like SQL. 00:29:07.000 --> 00:29:20.000 Versus Mysql, or whatever. So then, would your queries need to change, based on, if it were like created using my sequel, or something like that? 00:29:20.000 --> 00:29:25.000 Yeah. So your queries would have to change. Be the ones that the database was created with. 00:29:25.000 --> 00:29:31.000 Okay. 00:29:31.000 --> 00:29:32.000 Okay. 00:29:32.000 --> 00:29:39.000 But like for the most part, like most queries, are the same, and then we're things like, Get yeah, like, it's really more of, only like the more slightly advanced stuff where the different languages. 00:29:39.000 --> 00:29:41.000 I guess, engine. I figure what they're called. But we're like Mysql and Maria dB. 00:29:41.000 --> 00:29:46.000 Like, they only differ on little bit more advanced things, like basic things. 00:29:46.000 --> 00:29:47.000 I see. 00:29:47.000 --> 00:29:49.000 Yeah, like those typically will be the same. 00:29:49.000 --> 00:29:56.000 I see, and usually it will be like provider by the database, like what this thing is skewed or. 00:29:56.000 --> 00:30:02.000 Yeah, so typically like, if you're working in like an industry setting, they, they would tell you what they used to create the database. 00:30:02.000 --> 00:30:04.000 Okay. 00:30:04.000 --> 00:30:10.000 And then like, if you're downloading it, they should also tell you somewhere, like what language was used to create the database. 00:30:10.000 --> 00:30:13.000 Got it? Thanks. 00:30:13.000 --> 00:30:21.000 So Brooks is saying that they're getting an error. 00:30:21.000 --> 00:30:23.000 I'm not quite sure why that error you're getting that error. 00:30:23.000 --> 00:30:30.000 I would suggest trying to do like a web search. 00:30:30.000 --> 00:30:35.000 So if it's an error that's not, is because the table does not exist. 00:30:35.000 --> 00:30:46.000 It's possible that you ran the code earlier before I uploaded it, and so then, like the new like table didn't get updated I'm not entirely sure why you're getting that. 00:30:46.000 --> 00:31:00.000 Not an executable object here. So Rohan, asking, What if we want only to only select some of the columns so just like Brooklyn, said, you'll have to specify the column name, and so we could see an example of that. 00:31:00.000 --> 00:31:11.000 We could do results equals dot, execute, select. 00:31:11.000 --> 00:31:16.000 Maybe we want. 00:31:16.000 --> 00:31:33.000 Purchase, id coma pre tax price. So if you have more than one column, you can separate them with a comma, and then from purchases. 00:31:33.000 --> 00:31:37.000 And then. 00:31:37.000 --> 00:31:45.000 Let's just copy this, and then we can show what that looks like. 00:31:45.000 --> 00:32:00.000 Okay. Alright! 00:32:00.000 --> 00:32:07.000 Okay, so we can use sequel to calculate some basic statistics. 00:32:07.000 --> 00:32:08.000 So you can do things like getting the number of results that are returned for a particular query. 00:32:08.000 --> 00:32:14.000 So you can just do this by doing count either of star or of the column. 00:32:14.000 --> 00:32:26.000 So here you can see that this tells you that the database has 20 rows for this table. 00:32:26.000 --> 00:32:33.000 You can get the maximum of a specified column by doing Max, and then the column name. 00:32:33.000 --> 00:32:40.000 You can do the same thing to get the minimum, but just changing it from Max to Min. 00:32:40.000 --> 00:32:46.000 And you can get things like the average or the you know, arithmetic mean with Avg. 00:32:46.000 --> 00:32:59.000 And so these are useful. You can also do the same thing with pandas, but sometimes it's more useful to do it, using sequel, because maybe it's faster and sequel than it is in Pandas. 00:32:59.000 --> 00:33:05.000 You can also just use pandas and read the stuff directly into a data frame without having to do this. 00:33:05.000 --> 00:33:10.000 Dot fetch all process so you can specify. There's a couple different ways. 00:33:10.000 --> 00:33:13.000 You can specify like that. You want a particular query. 00:33:13.000 --> 00:33:26.000 So if we wanted to say, I want a pandas data frame out of the query, Select star from maybe let's make it customers this time. 00:33:26.000 --> 00:33:33.000 And then, after the query, you input the connection. So here's the customers table. 00:33:33.000 --> 00:33:36.000 You can also specify like maybe instead of since we just wanted the entire table here, we can just use read Underscore, SQL. 00:33:36.000 --> 00:33:51.000 Underscore table directly, so we could do. I think we just have to put in the name let's see, it's maybe a string. 00:33:51.000 --> 00:33:54.000 So let's try that customers. And then connection there we go. 00:33:54.000 --> 00:34:00.000 So just put in the string name of the table, followed by the connection to the database. 00:34:00.000 --> 00:34:11.000 And so now we have, now we have that, and I think let's go ahead and also show an example where we use a conditional because I haven't done that yet. 00:34:11.000 --> 00:34:20.000 And so here is a query with a conditional. So we can do. 00:34:20.000 --> 00:34:31.000 Pd, dot read underscore SQL. Underscore query, and then for this we'll do select, and we'll still get all of the columns from customers. 00:34:31.000 --> 00:34:55.000 And then where? So this indicates that you want some sort of condition to be met, and maybe we can say where the age is greater than 27, and then we would say, connection, and so now you have all of the rows of customers, where the customers have an age that's more than 00:34:55.000 --> 00:35:19.000 27. Okay, so now, maybe before I go through the process of like saying, we're done with the database, I'll go ahead and pause for more questions about either submitting queries with execute or using a read sequel query or Readsql table. 00:35:19.000 --> 00:35:27.000 So? Which queries do data scientists most mostly use, like everyday, like a daily basis. 00:35:27.000 --> 00:35:28.000 So I. 00:35:28.000 --> 00:35:32.000 I select would be one of them, of course, but in general. 00:35:32.000 --> 00:35:33.000 So it's just depends on like where you work and what types of things you're doing. 00:35:33.000 --> 00:35:39.000 There are people who don't use sequel at all, because they're at a different stage of the team. 00:35:39.000 --> 00:35:47.000 So like, maybe they're not in charge of like bringing the data over. 00:35:47.000 --> 00:35:51.000 So I think it just depends. It depends on the data you're working with. 00:35:51.000 --> 00:35:54.000 It depends on the job that you're in. I mean. 00:35:54.000 --> 00:36:01.000 Select is gonna be used in any query. I'm pretty sure. 00:36:01.000 --> 00:36:03.000 Yeah, I think select would be used in any query. 00:36:03.000 --> 00:36:04.000 The things that you know. We've kind of really just given you, like the basics. 00:36:04.000 --> 00:36:17.000 There are also lots of other things that make sequel a little bit more complicated, like understanding how to join tables. 00:36:17.000 --> 00:36:18.000 Yeah. 00:36:18.000 --> 00:36:20.000 That sort of thing. Yeah. So like, this is just a very basics. 00:36:20.000 --> 00:36:37.000 And then, like, if you are, if you are working on a project for this boot camp that is using SQL, like you're gonna probably have to do more work like learning a little bit more sequel on your own time. 00:36:37.000 --> 00:36:38.000 Thanks. 00:36:38.000 --> 00:36:46.000 Yup, and so then Laura has asked, is this all possible in python as well, but easier using sequel? 00:36:46.000 --> 00:37:00.000 So you can do like if you're just working with a data set, that's a single file like you can just use pandas if you're working like, you know, if you're like working with a SQL database like, let's say, you're working with a SQL 00:37:00.000 --> 00:37:01.000 database. But you're ultimately just using like one or 2 tables, you could just quickly use pandas. 00:37:01.000 --> 00:37:08.000 There are things that like sequels, like faster at some stuff. 00:37:08.000 --> 00:37:09.000 So like sequels, faster querying tables. And it's also probably easier to query the tables with sequel than it is like with pandas alone. 00:37:09.000 --> 00:37:22.000 So there's times when you want to use pandas alone. So there! There are times when you want to use pandas. 00:37:22.000 --> 00:37:28.000 There are times when you want to use SQL. It just depends on the project. 00:37:28.000 --> 00:37:29.000 Yeah. 00:37:29.000 --> 00:37:30.000 I have a question what if I wanted to use SQL. 00:37:30.000 --> 00:37:31.000 Alchemy to. Let's say I had data. 00:37:31.000 --> 00:37:33.000 And I actually want to create an SQL database. Can I do that with SQL. 00:37:33.000 --> 00:37:37.000 Alcamy. 00:37:37.000 --> 00:37:43.000 Yeah, so you can do that. So I think in the practice problems for SQL or for the data and databases like the practice problems. 00:37:43.000 --> 00:37:54.000 And maybe I'll show that now. So in addition to like problem sessions and stuff, I also just have a bunch of problems that you can like practice with or learn more. 00:37:54.000 --> 00:38:08.000 So if you go to the data collection folder, I think in this database file there is code there on how to create a table and then how to insert stuff into the table so you can do that with SQL alchemy as well, it. 00:38:08.000 --> 00:38:20.000 The query you use just changes. So it's like, create table insert into. 00:38:20.000 --> 00:38:21.000 Yeah. 00:38:21.000 --> 00:38:28.000 Thanks. 00:38:28.000 --> 00:38:29.000 Okay, so let's imagine that we're now done with the database which we are. 00:38:29.000 --> 00:38:47.000 In this lecture. So in order to be done with the database and not have issues running around in the background, you have to first close the connection, so this will close your connection to the database, and once you're all the way done with like you don't think you're going to be 00:38:47.000 --> 00:38:52.000 connecting to the database at all later today, you will also dispose of the engine. 00:38:52.000 --> 00:38:56.000 So this is engine dot dispose, and this is just SQL. 00:38:56.000 --> 00:39:09.000 Alchemy, syntax that you have to do to disconnect from the database entirely, and once you've done that, you're no longer connected to the database, and you could like, I don't know not have to worry about like you're gonna change the 00:39:09.000 --> 00:39:17.000 database or maybe you're still like, maybe you forget to close your Jupiter notebook properly and still having an open connection to the database running in the background. 00:39:17.000 --> 00:39:20.000 This is just what you have to do when you're done. 00:39:20.000 --> 00:39:33.000 So we've introduced the concepts now of relational database, how you can access the data stored within one and then how to submit queries and get them into Pandema's data frames. 00:39:33.000 --> 00:39:35.000 If you'd like to learn more. As I mentioned, there are the practice problems where you can learn about creating and doing some very basic joins. 00:39:35.000 --> 00:39:48.000 Beyond that you'll probably have to start learning some sequel, and that sort of thing. 00:39:48.000 --> 00:39:51.000 So we have a question, how can I install SQL. Alchemy? 00:39:51.000 --> 00:39:56.000 Do I need to do that in Jupiter? So to install SQL. 00:39:56.000 --> 00:40:00.000 Alchemy. There are installation, instructions. 00:40:00.000 --> 00:40:04.000 Where were they? 00:40:04.000 --> 00:40:16.000 Their installation instructions here, that being said like, if you've never installed any python packages before, go to the Institute data science website, click on the first steps. 00:40:16.000 --> 00:40:17.000 And within there is a file that explains the different ways that you can go about installing python packages. 00:40:17.000 --> 00:40:28.000 So you're gonna want to learn how to do that cause there's gonna be a number of times where there are packages where we're gonna use them, or you're gonna want to use them on your project. 00:40:28.000 --> 00:40:35.000 And you don't already have them install. 00:40:35.000 --> 00:40:42.000 Sorry a question after we connect to the database and do the query. 00:40:42.000 --> 00:40:52.000 And finally, before closing or discounting from the database, we are saving the data in our computer, not right? 00:40:52.000 --> 00:40:56.000 We didn't do that right. 00:40:56.000 --> 00:40:59.000 Well, so! 00:40:59.000 --> 00:41:03.000 If you are making updates. Like, let's say you're in charge of maintaining the database. 00:41:03.000 --> 00:41:12.000 If you are making updates, like when you update, if you are to submit a query that updated the database that it saves like immediately. 00:41:12.000 --> 00:41:15.000 I believe or like. Maybe it's just I think it. 00:41:15.000 --> 00:41:17.000 I'd have to double check, because I don't. 00:41:17.000 --> 00:41:19.000 I've I like don't do a database management, but at some point it would be saved. 00:41:19.000 --> 00:41:29.000 Either. After you make the execution or after you close the database. 00:41:29.000 --> 00:41:32.000 I'm not entirely sure which the pandas stuff that's like nuts. 00:41:32.000 --> 00:41:34.000 I mean, it's sort some of them are stored in a variable. 00:41:34.000 --> 00:41:41.000 Actually in this notebook. I don't think any of them are but we could store it in a variable. 00:41:41.000 --> 00:41:42.000 And then we would have that data frame stored currently in this notebook. I don't think any of them are, but we could store it in a variable, and then we would have that data frame stored in a variable. 00:41:42.000 --> 00:41:53.000 I don't think any of our data frames are stored in a variable, but, like, what if you are working with the data? 00:41:53.000 --> 00:42:01.000 Presumably you wouldn't close the connection until after you've already stored the data in a variable or something like that, or you're just done entirely working with it. 00:42:01.000 --> 00:42:02.000 Yeah. 00:42:02.000 --> 00:42:06.000 Yeah, okay. 00:42:06.000 --> 00:42:16.000 Okay. 00:42:16.000 --> 00:42:18.000 The next thing we're gonna talk about is web scraping. 00:42:18.000 --> 00:42:22.000 But first they should. 00:42:22.000 --> 00:42:24.000 Go back to the lectures and not the practice problems. 00:42:24.000 --> 00:42:30.000 That's why things looked weird. So we're gonna talk about web scraping with beautiful soup. 00:42:30.000 --> 00:42:49.000 So I'm gonna open up the lecture copy. So sometimes you might be wanting to do a problem or work on a project where the data doesn't already exist in a nice clean file or a nice database that you can access so you're gonna have to but maybe it does exist on like a 00:42:49.000 --> 00:42:53.000 website and it would be pretty easy to get that data from the website. 00:42:53.000 --> 00:43:04.000 Maybe it's nicely formatted in a table or something, but maybe it's too difficult for you to do something like copy and paste, or maybe the a date exists across multiple web pages. 00:43:04.000 --> 00:43:18.000 And so I do, this by hand would be too long. So in this notebook we're gonna learn how you can write python scripts to scrape a website using beautiful soup. 00:43:18.000 --> 00:43:21.000 So beautiful, soup is a package that allows you to parse. 00:43:21.000 --> 00:43:25.000 HTML code. Again, this is another example, like SQL. 00:43:25.000 --> 00:43:30.000 Alchemy, where you need to have it installed right in order for it to work, so you can check if you have beautiful soup installed by trying any import. 00:43:30.000 --> 00:43:38.000 Bs 4. If this runs correctly and you don't have any errors and you have it installed if it doesn't run, you will need to install it. 00:43:38.000 --> 00:43:42.000 So again. You'll need to if you've never installed a python package before, you will need to check the first step. 00:43:42.000 --> 00:43:51.000 Document that I think maybe somebody has linked to linked to in the chat. 00:43:51.000 --> 00:43:58.000 You'll need to check that first step document, go through, figure out how to install package and then install Bs 4. 00:43:58.000 --> 00:44:17.000 Following either one of these directions here, so the version that I currently have installed is 4.12 point 2, and as I saw in the chat part of the reason, there was a difference between like my code working and and other people's code not working is because of a difference in versions so again, if as I go through 00:44:17.000 --> 00:44:24.000 this, if I'm doing something that works for me. But you try and copy it and it doesn't work, and it you have a different version than I do. 00:44:24.000 --> 00:44:34.000 That's usually the most like. Well, the most likely culprit is often a typo on somebody's part, but then the second most likely culprit is that you have a different version than the code you're trying to follow. 00:44:34.000 --> 00:44:48.000 So just be mindful of the Python package version that you're working with, and you know, if it's different from mine, it's possible that the code I'm writing is not going to work for you and your notebook so in order to be able to Parse 00:44:48.000 --> 00:44:51.000 HTML code and get the data out of it. 00:44:51.000 --> 00:44:52.000 We have to understand a little bit about HTML code. So we know what we're scraping. 00:44:52.000 --> 00:45:01.000 So this is going to be a sample of some HTML code. 00:45:01.000 --> 00:45:02.000 So right now, it's written as a string. So one thing about python is, if you put 3 quotation marks in a row, it's written as a string. 00:45:02.000 --> 00:45:12.000 So one thing about python is, if you put 3 quotation marks in a row, it will allow you to write a string across many different lines without having to, you know, do concatenation at all. 00:45:12.000 --> 00:45:23.000 So if we wanted to look at what the web page that this little HTML code produces, we can click on this link and see it. 00:45:23.000 --> 00:45:24.000 So it's got this title at the top that's in bold. 00:45:24.000 --> 00:45:46.000 This sentence about sisters. Each of these sisters has as blue and underlined, and if you click on it would take you to a link which is just this link right here and then there's an ellipses at the bottom, so this is what the HTML page. 00:45:46.000 --> 00:45:50.000 Looks like and we're going to go through how to parse this below. 00:45:50.000 --> 00:46:00.000 So the first thing we want to do, if we want to parse this string is, we need to first import the beautiful soup class from Bs 4. 00:46:00.000 --> 00:46:12.000 So this is just another example of when you're writing python code, you typically will only want to import the function or the classes that you're using and not import the entire package. 00:46:12.000 --> 00:46:15.000 If you can avoid it. Now we did this above here, just to check that. 00:46:15.000 --> 00:46:24.000 You have it imported, but in practice it's usually you want to just import the stuff you use. So we'll do from. 00:46:24.000 --> 00:46:32.000 Bs. 4. Import capital B. Beautiful capital S. Soup. 00:46:32.000 --> 00:46:37.000 Okay, so we're then going to make a beautiful Sup object. 00:46:37.000 --> 00:46:42.000 So how do you do that? We're going to store it in a variable called soup, with a lowercase. 00:46:42.000 --> 00:46:57.000 S. Then I type in the class beautiful soup, and then in the parentheses, I'm going to provide the string that contains the HTML code which I store it above in HTML Underscore Dock. 00:46:57.000 --> 00:47:00.000 And then you put in how like the code that this is written in. 00:47:00.000 --> 00:47:06.000 So for us. This is HTML. So you write in HTML dot parser as a string. 00:47:06.000 --> 00:47:12.000 So this second argument here tells beautiful soup. Alright, this is HTML code that I want you to parse. 00:47:12.000 --> 00:47:18.000 There are other codes, other coding languages that are used to write websites like Xml, and stuff like that. 00:47:18.000 --> 00:47:33.000 HTML is the most common. But if you you know, if you have a website that's written with some other language, you would need to change this. Input although HTML is like really the most common anymore. 00:47:33.000 --> 00:47:34.000 So, I'm gonna use a method called soup dot pretty. 00:47:34.000 --> 00:47:44.000 And so what this does is it takes the code that was provided to it, and then you can see it's got all these like slash ends in it. 00:47:44.000 --> 00:47:50.000 So this will take the code and sort of format it like someone who's coding properly would format it. 00:47:50.000 --> 00:48:01.000 And so if we print this, we'll be able to see like, if you were writing this HTML code and a code editor, it would look something like this with the indents. 00:48:01.000 --> 00:48:07.000 HTML, so this is telling us that this is an HTML document. 00:48:07.000 --> 00:48:10.000 We've got something called a head, something called a title. 00:48:10.000 --> 00:48:24.000 A body, something called a P, that has a class. So anytime an HTML code where you see a word with these brackets on either side, that is an HTML element. 00:48:24.000 --> 00:48:34.000 And so that's sort of like like those are defined in certain ways, so they will look a certain way and can hold certain things on websites. 00:48:34.000 --> 00:48:40.000 So the head element of an HTML diagram is what determines, like the metadata about the document. 00:48:40.000 --> 00:48:57.000 So, for instance, this head having the title here, tells that Web Browser, that for any tab or window to put the title up here so you can see the title of in the head is what's being shown here at the top of the tab this will often also contain other metadata 00:48:57.000 --> 00:48:59.000 like tags. Style that sort of thing. The body is like what you actually see. 00:48:59.000 --> 00:49:10.000 So the body here. All that code is like what you're seeing here. 00:49:10.000 --> 00:49:16.000 So the body that has P. Title. The B. Tells it to be bold. 00:49:16.000 --> 00:49:24.000 So this first highlighted P. Is this part. The dormouse is it? Be bold? So this first highlighted P. Is this part the dormouse's story. So P. 00:49:24.000 --> 00:49:25.000 Element stands for parag, a B element to make my text bold you don't need to remember all this stuff. 00:49:25.000 --> 00:49:33.000 The main thing to take away is that HTML. 00:49:33.000 --> 00:49:46.000 Documents are made up of elements. The elements hold different pieces of information, and elements also have things like metadata, like the class which we can use to our advantage to scrape data. 00:49:46.000 --> 00:49:54.000 So? How does beautiful soup go through this stuff? If you were to write this out in a nice little graph diagram, HTML! 00:49:54.000 --> 00:50:01.000 Code follows what's known as a tree structure and that's tree from the field of graph theory. 00:50:01.000 --> 00:50:08.000 What do we mean? So there's like a node up top, and then everything sort of branches down from that, and then you can follow along and get to the element you'd like to get. 00:50:08.000 --> 00:50:14.000 And so that's how beautiful soup works. So we've got. 00:50:14.000 --> 00:50:28.000 HTML up top! That's the document. And then within the document we have a head and a body elements, and so the HTML node would be considered the parent of the head and the body, and vice versa. 00:50:28.000 --> 00:50:36.000 These would be the children of the HTML. Each of these children have their own children, so the head has a title, child, and then the body has 3 p. 00:50:36.000 --> 00:50:46.000 Children. These P. Children, then have a bold that's how the child of this P. The second P. 00:50:46.000 --> 00:50:52.000 Has 3 a children, a stands for anchor, and what you use to hold links and stuff, and then the last P. 00:50:52.000 --> 00:50:55.000 Doesn't have any kids. So we think of the different levels of the document as like generations. 00:50:55.000 --> 00:51:04.000 And this sort of nice structure is what allows us to parse through the data relatively quickly. 00:51:04.000 --> 00:51:12.000 So we're gonna now show you like, how you can use beautiful soup to traverse these particular documents. 00:51:12.000 --> 00:51:32.000 But maybe before we dive into the actual code part, are there any questions on just like basic structural stuff on HTML code, or or making the beautiful soup object? 00:51:32.000 --> 00:51:33.000 Sure! 00:51:33.000 --> 00:51:40.000 I have a question about HTML file structure. So it looks like it makes sense that a body could have multiple. P. 00:51:40.000 --> 00:51:41.000 Yeah, yeah. 00:51:41.000 --> 00:51:48.000 Childs or P children. Sorry. Right? Can each email file have multiple heads in multiple bodies? 00:51:48.000 --> 00:52:04.000 I don't know. If, like, you'd get some sort of error or not like when you would try to load the website on a browser, I would think that you typically don't have multiple heads and bodies within a single HTML document. But I don't know for sure i'd have to check. 00:52:04.000 --> 00:52:05.000 Gotcha. Thank you. 00:52:05.000 --> 00:52:10.000 Yeah. 00:52:10.000 --> 00:52:18.000 Any other questions? 00:52:18.000 --> 00:52:23.000 Okay. So the way that we can parse this is, first, you can say, remember, we stored our beautiful soup object in a variable called soup. 00:52:23.000 --> 00:52:30.000 So if you were to then put the element that you're interested in. 00:52:30.000 --> 00:52:38.000 So maybe we want to get the the title that you can do. 00:52:38.000 --> 00:52:47.000 And so now you can see we have that title. We could also go the long way of working our way through the code which is, I think, what I meant to put here first. 00:52:47.000 --> 00:52:56.000 So we would do soup dot head, so that would ensure that we are only searching within the head, and then we could do that title. 00:52:56.000 --> 00:53:02.000 There. You might also be wondering like, How do we get the text? 00:53:02.000 --> 00:53:09.000 Let me also just put this so we've already done both of these, so you might be wondering like, what if I just want the text that's stored within the title? 00:53:09.000 --> 00:53:17.000 So you can get the text stored within any. HTML element by doing dot text. 00:53:17.000 --> 00:53:18.000 Okay. And so now we have a python stripe. 00:53:18.000 --> 00:53:22.000 That is the text that was within the element. You might wanna be able to say, like, What's the parent or the child? 00:53:22.000 --> 00:53:34.000 So if I wanted to know, like, what's the parent of a particular element, you would just do dot parent. 00:53:34.000 --> 00:53:39.000 You could try, Dot, child, but here it wouldn't work, because title doesn't have a child. 00:53:39.000 --> 00:53:41.000 So when you do this soup dot elements name, it's always going to give you the first example of that element. 00:53:41.000 --> 00:53:50.000 So this soup dot a we'll give you the first a. 00:53:50.000 --> 00:54:03.000 And if we went back up to our, to our code we would go through, go through, go through, and we see the first A that shows up is this one with class Sister Id. 00:54:03.000 --> 00:54:09.000 Link one, and it contains lc, and then we can see that that is what was pulled up. 00:54:09.000 --> 00:54:17.000 So you can access the information sort of this metadata with in a brackets sort of like a python dictionary. 00:54:17.000 --> 00:54:19.000 So if I wanted the class I could do soup dot a square brackets. 00:54:19.000 --> 00:54:27.000 The string class, and we can see that the class is sister. 00:54:27.000 --> 00:54:31.000 So you might have noticed that there's more than one A. 00:54:31.000 --> 00:54:32.000 So how do I get all of them? So there's this function called Dot. 00:54:32.000 --> 00:54:39.000 Find all with an underscore between find and all. 00:54:39.000 --> 00:54:49.000 If you, input the element you are interested in, it will return all of the A's or all of the examples of that element that it can find as a list. 00:54:49.000 --> 00:54:52.000 And then we could just loop through the list like so so I'm uncomon saying the for Loop. 00:54:52.000 --> 00:55:02.000 So I don't have to type it out. So this will loop through that list and print out the class in the text of every A. 00:55:02.000 --> 00:55:05.000 You could also use like a list comprehension, and so forth. 00:55:05.000 --> 00:55:09.000 So I see that Jacob has posted a question and a large HTML. File. 00:55:09.000 --> 00:55:13.000 Is there an easier way to find the section you need with the data, other than just parsing through the huge file and figuring out what the parents and the children are. 00:55:13.000 --> 00:55:20.000 So, Jacob, we will see some examples of how to do that. 00:55:20.000 --> 00:55:25.000 Just a little bit. So if you're if you could hold on to your seats for like a few more minutes, we will get to an example. 00:55:25.000 --> 00:55:31.000 Where we have a network website. And we show how to do that. 00:55:31.000 --> 00:55:32.000 Okay. So instead of, I wrote this as exercises, you can try and do it with me. 00:55:32.000 --> 00:55:42.000 But to save time, I'm just gonna like code it up loud and talk it out as we go through. 00:55:42.000 --> 00:55:50.000 I guess if you really don't want to know and you want to practice later, you can sort of like walk away from your computer for 2 min and come back. 00:55:50.000 --> 00:55:52.000 So if we wanted to find the first P. How would we do that? 00:55:52.000 --> 00:55:56.000 Well, we do. Soup and then remember, the first one is just dot P. 00:55:56.000 --> 00:56:05.000 Will return the first one. Okay? And then we can see the things like the class and the string by doing a soup dot. P. 00:56:05.000 --> 00:56:08.000 So to get the class, we access it like a dictionary. 00:56:08.000 --> 00:56:13.000 So square brackets, class, and then we could do soup. Dot P. 00:56:13.000 --> 00:56:19.000 If we want to get the text within the P. We would do dot text. 00:56:19.000 --> 00:56:23.000 For all the A's. In the document. We want to find their hyper references. 00:56:23.000 --> 00:56:28.000 So remember soup dot find underscore all of a you might notice. 00:56:28.000 --> 00:56:29.000 In addition to classes, they have this thing called Href. 00:56:29.000 --> 00:56:38.000 So that is like when I click on Lc, and it takes me to this link. 00:56:38.000 --> 00:56:41.000 That is what's contained in the hyper ref. 00:56:41.000 --> 00:56:50.000 So we could use for a in soup. Dot find all a we can print the hyper ref of that. 00:56:50.000 --> 00:57:01.000 A so just like a dictionary. But instead of class this time we have a trap. 00:57:01.000 --> 00:57:19.000 Okay, so are there any questions before we move on to showing you an example with a real web page? 00:57:19.000 --> 00:57:22.000 Awesome. 00:57:22.000 --> 00:57:28.000 So we're gonna go through an example where we scrape the sports section of 5 38. 00:57:28.000 --> 00:57:33.000 And so what we're gonna pretend that we're doing is like let's say, we've been hired by somebody. 00:57:33.000 --> 00:57:34.000 And this used to be a very hypothetical thing. 00:57:34.000 --> 00:57:52.000 But now it's maybe a little bit more of a real thing with things like chat gpt, we're gonna imagine that somebody hired us to scrape like websites like 5 38 to get their reporting because maybe we're then gonna use that writing to train some sort of AI bots 00:57:52.000 --> 00:57:58.000 to generate new articles in sort of the sports world. 00:57:58.000 --> 00:58:07.000 And so for our task. The the the people have tasked us with is we wanna provide, like the titles of these articles, the author. 00:58:07.000 --> 00:58:11.000 And then I don't remember. Maybe like the hyper ref of like where the article was. 00:58:11.000 --> 00:58:12.000 I don't remember what I said, so that's what our goal is. 00:58:12.000 --> 00:58:14.000 So as part of that goal, we need to download the HTML code of this website. 00:58:14.000 --> 00:58:30.000 But we don't want it to download it by hand. We want to use python to do it so we can run it as a script, and while the scripts running we can run it as a script, and while the scripts running we can you know go do whatever we want to 00:58:30.000 --> 00:58:33.000 do so. The way to do this is with the request package. 00:58:33.000 --> 00:58:43.000 So I'm gonna copy the URL. And so the request package, which you can import just by doing import requests. 00:58:43.000 --> 00:58:48.000 This is a built-in package, so you shouldn't have to worry about that installing it. 00:58:48.000 --> 00:58:53.000 So with requests, and it looks like I've already copied the URL here from myself. 00:58:53.000 --> 00:58:57.000 You can send a request to the website server to provide you with the HTML. 00:58:57.000 --> 00:59:05.000 Code for that given page. So we would do our which is what I'm gonna store it in requests. 00:59:05.000 --> 00:59:23.000 Dot get, and then you input the URL, which for us is, you know, this URL right here and then what's gets returned is, I guess I wanted to show I wanted to show. Let's do it again without the are because I wasn't supposed to do that yet what you'll see now as you 00:59:23.000 --> 00:59:31.000 get back a response from the server, and then this response if we don't store it in anything, is just going to tell us what the status code of their responses. 00:59:31.000 --> 00:59:39.000 So for us the status code was 200, a 200 response means that everything went a okay. 00:59:39.000 --> 00:59:43.000 And you got the HTML code you're looking for. 00:59:43.000 --> 00:59:46.000 If you see things like 404, or anything in the 5 hundreds. 00:59:46.000 --> 00:59:57.000 That means that something went wrong. So, for instance, like a 404 response means that you send a request, and there will like it. 00:59:57.000 --> 01:00:04.000 Couldn't find the website you specified 500 responses typically mean that there's something wrong on the side of the website. 01:00:04.000 --> 01:00:10.000 So you can find all of the possible reports codes, for a request at this link, and go through them on your own timer. 01:00:10.000 --> 01:00:11.000 For instance, maybe you get a response, and you want to check out what it means. 01:00:11.000 --> 01:00:19.000 You can find it here. So typically 4 hundreds and 5 hundreds means that something went wrong. 01:00:19.000 --> 01:00:26.000 And you're not getting your data, 200 means that you got your data like you wanted to. 01:00:26.000 --> 01:00:30.000 So now we have the response stored in a variable called are. 01:00:30.000 --> 01:00:46.000 And so we could even check what was the status of our response with our dot status underscore code, and we can see that it was 200, which means we got the data we wanted, and then the HTML code is stored within our dot, content. 01:00:46.000 --> 01:00:50.000 And so you can see here, this is. This is the HTML code. 01:00:50.000 --> 01:00:54.000 It's much messier than the little simple file we had up above. 01:00:54.000 --> 01:00:58.000 So we can now parse this with beautiful soup. 01:00:58.000 --> 01:01:04.000 So it's stored in our dot content. And then let's provide also the input. 01:01:04.000 --> 01:01:07.000 HTML dot parser. 01:01:07.000 --> 01:01:13.000 And now this is just like a sanity check. 5 38 sports. 01:01:13.000 --> 01:01:14.000 So I see we have a question, what about Captchas? 01:01:14.000 --> 01:01:20.000 Would we be blocked as a bot for trying to scrape websites? 01:01:20.000 --> 01:01:21.000 Yep, that can happen. So a lot of websites will now prevent something like this from happening. 01:01:21.000 --> 01:01:32.000 If it notices that you're trying to use like this sort of requests, there are workarounds, but it just is dependent upon the website. 01:01:32.000 --> 01:01:37.000 So some of them like, if they have a capscha, you know, those are typically hard to get around right. 01:01:37.000 --> 01:01:46.000 That's the whole point of them. But there are things like there's a package called selenium that allows you to interact with Javascript type stuff on a web page to try and get data. 01:01:46.000 --> 01:01:47.000 But for you know a lot of websites you can usually just do something like this, and it will be okay. 01:01:47.000 --> 01:01:57.000 Where the problem comes in as if you're trying to send a lot of requests to the same website. 01:01:57.000 --> 01:02:02.000 Like, let's say I tried to send thousands of requests to 5 38. 01:02:02.000 --> 01:02:06.000 In a short amount of time I would probably get blocked by 5 38. 01:02:06.000 --> 01:02:13.000 So, and it'll later in the notebook we talk about like what you should do to prevent like that sort of thing when you're scraping. 01:02:13.000 --> 01:02:19.000 But that can happen. 01:02:19.000 --> 01:02:22.000 Okay. So I believe Jacob asked earlier, like, how can I go through without having to like parse the code code? 01:02:22.000 --> 01:02:39.000 This is an example where, if you were to try and just read the code as it is it would take you a very long time, so what we're gonna use is something called the web developer tools. 01:02:39.000 --> 01:02:42.000 So I'm using Firefox. If you're using a different web browser, you will still have web developer tools. 01:02:42.000 --> 01:02:50.000 But you'll have to double-check how to get to them. I believe. 01:02:50.000 --> 01:02:53.000 Here I show how to do it for firefighters. 01:02:53.000 --> 01:03:01.000 Google chrome and safari, which I believe are the 3 most popular if you're using a different web browser, you'll have to figure it out on your own with a web search. 01:03:01.000 --> 01:03:08.000 So go to for Firefox. You go to browser tools and then click on web developer tools. 01:03:08.000 --> 01:03:19.000 Although I need to go back to the 5 38 page and then do it there. 01:03:19.000 --> 01:03:24.000 Okay? And so what's nice about this is this is the console. 01:03:24.000 --> 01:03:39.000 But in the inspector you can see like the HTML code, and you'll notice like, how things are changing as I huver over them, what we're gonna use is there's this little tool here that it's in safari chrome and firefox so i'm 01:03:39.000 --> 01:03:43.000 assuming it's in most browsers, web developer tools. 01:03:43.000 --> 01:03:59.000 If you click on this little icon that has a box with the cursor arrow in its you click on this, and it will allow you to go to the different elements of the website, and then if you go back and look at the code at the bottom, it's highlighting the HTML code that 01:03:59.000 --> 01:04:03.000 is coding up that part of the website. So for us, what we're gonna focus on in this example is getting the titles of the articles. 01:04:03.000 --> 01:04:15.000 And then I believe, either in a problem, session or as a practice problem, you can go through and try and get the rest of the information I mentioned that we might want to get so for us. 01:04:15.000 --> 01:04:25.000 Well how we can use this is by clicking on the elements. 01:04:25.000 --> 01:04:29.000 Then coming back down to the code and then looking here, and so we can see that this title is stored within an H. 01:04:29.000 --> 01:04:38.000 2, and that this H. 2 has the class article dash, title, entry, dash, type. 01:04:38.000 --> 01:04:46.000 And so we can go back, and we will then demonstrate this, so we can do soup. 01:04:46.000 --> 01:04:53.000 Dot find underscore all. And now we want the h twos. 01:04:53.000 --> 01:04:56.000 But this is, gonna give us more h 2 s. Than we want. 01:04:56.000 --> 01:05:04.000 So you can further, we haven't seen this before, but you can further specify that I want the h twos. 01:05:04.000 --> 01:05:07.000 Then you provide a dictionary, where in that dictionary you'll specify. 01:05:07.000 --> 01:05:10.000 I want all the h twos of a particular class. And so what's that class going to be? 01:05:10.000 --> 01:05:21.000 It's gonna be this one. So copy, paste, string, paste. 01:05:21.000 --> 01:05:22.000 And so now you'll see that we get a bunch of H. 01:05:22.000 --> 01:05:26.000 Twos, and what we could do is loop through that. 01:05:26.000 --> 01:05:35.000 So for a in here francs a dot text. 01:05:35.000 --> 01:05:38.000 Okay. And so another thing, we can do this is just a string function. 01:05:38.000 --> 01:05:43.000 We can get rid of all that annoying white space. 01:05:43.000 --> 01:05:50.000 And then this just gives us all the titles. So we've got the titles, and we could go back and double check. 01:05:50.000 --> 01:06:10.000 So we got the Andrew Mccutcheon one Fernando tates with Anthony Richardson being after that, books and busting bucks, busting and then we've got Mckayle Bridges Mlb project probably the 01:06:10.000 --> 01:06:13.000 Prospects. Mckale Bridges, Mlb. 01:06:13.000 --> 01:06:18.000 Prospects, and then let's go ahead and check the last 3 raise Wnba and big bad brewing. 01:06:18.000 --> 01:06:33.000 So we got all the titles we wanted very quickly, just like this, and if we wanted to do it, and like a quick without a for loop, we could use a list. 01:06:33.000 --> 01:06:48.000 Comprehension. So a dot text dot strip for a in this soup dot find all. 01:06:48.000 --> 01:06:53.000 So now we have it in a list which we could put in like a data frame or something. 01:06:53.000 --> 01:07:00.000 So to do this, to get the authors, you would do the same exact thing where? 01:07:00.000 --> 01:07:09.000 Let's go ahead and use our tool. So we'll click on this. 01:07:09.000 --> 01:07:13.000 Hey! We can see that the author is stored in A. P. 01:07:13.000 --> 01:07:20.000 That has the class single metadata card space V card. So let's try that. 01:07:20.000 --> 01:07:39.000 So we would do soup. Dot find all P class, and then just get rid of this part that got copied over, and then once again, we can do 4 P. In. 01:07:39.000 --> 01:07:52.000 We want p, dot text, okay? And so then we could do extra pleading for this if we would like and get rid of the buys if we wanted to. 01:07:52.000 --> 01:08:00.000 Okay, so are there any questions before we move on to the last section of this notebook? 01:08:00.000 --> 01:08:07.000 I had a question. This might just be preference, but I noticed that you have some single quotes of times, and then double quotes other times. 01:08:07.000 --> 01:08:15.000 Is that just? Yeah. I guess any insight on that. 01:08:15.000 --> 01:08:21.000 Yeah, so this is just preference. I just kinda like the way like, this is the HTML element. 01:08:21.000 --> 01:08:28.000 This is the HTML element this is the HTML element. 01:08:28.000 --> 01:08:36.000 The text. So I guess, like, it's just like an inter a preference that I've internalized, that there's you could just do all single quotes or all double quotes. 01:08:36.000 --> 01:08:39.000 It wouldn't matter like here. I guess I did double quotes. 01:08:39.000 --> 01:08:43.000 I think I just use whatever my fingers do at the time. 01:08:43.000 --> 01:08:44.000 Thank you. 01:08:44.000 --> 01:08:46.000 Yeah. 01:08:46.000 --> 01:08:55.000 Sorry, just as a quick, can you show us how to get rid of these buys in the and last? 01:08:55.000 --> 01:08:56.000 Thank you. 01:08:56.000 --> 01:09:04.000 Yes, so these are all python strings, and so Python strings have this built in method called Replace. 01:09:04.000 --> 01:09:10.000 And so I would just replace, buy space with nothing. 01:09:10.000 --> 01:09:23.000 The empty string, and then, to be extra safe, I would probably throw in a dot strip at the end to get rid of any white space on either side. 01:09:23.000 --> 01:09:31.000 Any other questions? 01:09:31.000 --> 01:09:32.000 Yup Yup, so you could have also. This would work as well because of that dot. 01:09:32.000 --> 01:09:35.000 Thank you. So it is by space. Yes. 01:09:35.000 --> 01:09:40.000 Strip at the end, so Strip will get rid of the white space on the outside of the string. 01:09:40.000 --> 01:09:48.000 Yes, right? Thank you. 01:09:48.000 --> 01:09:49.000 Yeah. 01:09:49.000 --> 01:09:50.000 Okay. Can you go back to the HTML on the other page? 01:09:50.000 --> 01:09:53.000 So down there, it says, by a class equals author. 01:09:53.000 --> 01:10:01.000 URL fn, so could you also, right have it? Search for the class? 01:10:01.000 --> 01:10:08.000 That's just Author URL Fn, and it would just get rid of the buy as well, or something. 01:10:08.000 --> 01:10:15.000 Yeah, so we could try that. Yeah. So let's do 4 a and soup dot finds all, and then I'll use this as an example. 01:10:15.000 --> 01:10:25.000 To show it doesn't just have to be class. So you notice that this also has a thing called a L that is equal to author. 01:10:25.000 --> 01:10:31.000 So we could do. R. E. L. Poland, author. 01:10:31.000 --> 01:10:35.000 And then I should probably finish the rest of my for loop. 01:10:35.000 --> 01:10:47.000 Print a dot text. So here you get that. Now, here's the reason why I'm cheated because I recorded this video like 2 weeks ago for the pre recorded. 01:10:47.000 --> 01:10:52.000 If you'll notice like the bottom. Here we have Neil, Pain and Terence Doyle. 01:10:52.000 --> 01:10:54.000 And so if we go to the box like these 2, both wrote the article together. 01:10:54.000 --> 01:11:04.000 So when you do it with just the A, the authors get separated out. 01:11:04.000 --> 01:11:07.000 So you. That's why I went with using the P. 01:11:07.000 --> 01:11:20.000 Because the authors are contained together there. So that that's sort of the reason and sort of the thing that, like one of the things we'll talk about in just a second is, you know, the web code stuff is kind of messy. 01:11:20.000 --> 01:11:21.000 And so sometimes you'll have to like 5 30. It's actually a really nice clean website with its code. 01:11:21.000 --> 01:11:37.000 But sometimes it's kind of messy. So you just have to play around and double check like, is this doing what I think it's doing before, like implementing it to run overnight to scrape data or something. 01:11:37.000 --> 01:11:47.000 So relay question is, if you open up the developer view on chrome and right-click, there are some options to copy identifiers for it. 01:11:47.000 --> 01:12:05.000 So, for example, you can copy the full X path. Is there a nice method for soup or another package to convert this full path, which is a unique identifier for the element into the text, such that you can just have the relevant HTML. 01:12:05.000 --> 01:12:08.000 So! 01:12:08.000 --> 01:12:13.000 Is this like? So I guess there's 2 questions. 01:12:13.000 --> 01:12:22.000 So one would be like, are you saying that you want to put this as part of like the URL for the request? So it would take you to that specific element. 01:12:22.000 --> 01:12:23.000 Okay. Okay. 01:12:23.000 --> 01:12:26.000 No, I'm saying that after you'd have the code with soup. 01:12:26.000 --> 01:12:27.000 Yeah. 01:12:27.000 --> 01:12:29.000 And it's this big mess you can use developer tools to highlight on the page. 01:12:29.000 --> 01:12:36.000 Something you think is interesting. You've identify it visually and then chrome will also provide you a full path to that encode. 01:12:36.000 --> 01:12:45.000 And so then, ideally, you could just pass that path. 2 beautiful soup, and then it would pull up the content. 01:12:45.000 --> 01:12:47.000 At that location. 01:12:47.000 --> 01:12:54.000 So I think you should be able to, but you would have to like. 01:12:54.000 --> 01:13:00.000 I don't know that you can provide it with the slashes so like you could do like soup. 01:13:00.000 --> 01:13:13.000 Dot HTML dot body dot div I don't know what the you know at 2 is, but then, like so like you could try. 01:13:13.000 --> 01:13:18.000 And then, you know, use this and write a function to clean it up. 01:13:18.000 --> 01:13:22.000 Well! 01:13:22.000 --> 01:13:23.000 Okay, I mean. 01:13:23.000 --> 01:13:29.000 But that would do it. Yeah, yeah, I think it might be possible, but I would have to look into it. 01:13:29.000 --> 01:13:30.000 Yup, hey? Yeah. 01:13:30.000 --> 01:13:36.000 Okay, thank you. I think that makes sense. 01:13:36.000 --> 01:13:42.000 Any other questions? 01:13:42.000 --> 01:13:47.000 Okay, so we'll wrap up this notebook by just like, what are some common problems that you'll run into? 01:13:47.000 --> 01:13:50.000 And we've touched on a couple of them on the way. 01:13:50.000 --> 01:13:55.000 So the first is just that, like the website, can websites can be written by anybody. 01:13:55.000 --> 01:14:05.000 You just need to have something to write them on, and then own a domain and have also a server that you can access. 01:14:05.000 --> 01:14:09.000 So that means there's not like somebody that's just like guarding the Internet, making sure everybody's code looks really nice. 01:14:09.000 --> 01:14:15.000 So, sometimes there's really messy code. There's also code. 01:14:15.000 --> 01:14:21.000 That's just like not. Well labeled. So like this was really easy, because 5, 38 HTML code is really well labeled and searchable. 01:14:21.000 --> 01:14:27.000 There will often be websites that maybe you want the data from. 01:14:27.000 --> 01:14:30.000 But, like they don't have classes or ids, or anything. 01:14:30.000 --> 01:14:35.000 So you might just have to try and play around with like for loops to make it work. 01:14:35.000 --> 01:14:40.000 So that's a problem and a lot of times web scraping just requires a lot of trial and error. 01:14:40.000 --> 01:14:42.000 And honestly, sometimes it's just not doable with, like the how people maintain their websites. 01:14:42.000 --> 01:14:49.000 So that's like more of an issue on your end of things. 01:14:49.000 --> 01:15:00.000 The other thing that we sort of touched on earlier is you can be banned for sending too many requests, so like, if you sent a lot of requests in a very short amount of time. 01:15:00.000 --> 01:15:04.000 A lot of websites. Servers are set up to then prevent you from sending a request, or receiving a response to your request. 01:15:04.000 --> 01:15:14.000 For some amount of time, so the best way to prevent this from happening, and also just to be like. 01:15:14.000 --> 01:15:23.000 So every time you send a request for some amount of time. So the best way to prevent this from happening, and also just to be like, so every time you send a request, it's kinda like somebody clicking on the website, so if you send a whole bunch of requests in a short amount of time, it can overload 01:15:23.000 --> 01:15:26.000 the website servers. And you're not the only one who's trying to access the website. 01:15:26.000 --> 01:15:36.000 So it's just kind of like good ethical practices to not, you know, overload a website servers with requests just to get their data. 01:15:36.000 --> 01:15:45.000 So one way to prevent this from happening is seize the time module which will has this function called dot sleep where you can input a certain amount of sleeping time. 01:15:45.000 --> 01:15:51.000 And then like what's say? You wanted to go through all of these articles and pull the text out of them. 01:15:51.000 --> 01:15:58.000 You could put in a sleep timer of like a few seconds in between polls that will, you know, be a little bit nicer on the servers. 01:15:58.000 --> 01:16:03.000 This debt loss. Does, you know, decrease your chances of being flagged? 01:16:03.000 --> 01:16:07.000 It's also just about, you know, being like a good Internet citizen. 01:16:07.000 --> 01:16:14.000 Another thing that can happen that was also talked about earlier is, even if you let's say you do this, and you're like all right, I'm gonna wait a long time between requests. 01:16:14.000 --> 01:16:20.000 Just sending a request with the request. Module can automatically flag you as a bot. 01:16:20.000 --> 01:16:40.000 So if your flag is a bot this way, there are some like workarounds that exist but there's nothing like that's always set in stone, always going to worker ads that exist. But there's nothing like that's always set in stone. Always going to work so you'll just have 01:16:40.000 --> 01:16:43.000 to do like a website like you'll have to do a web search for like I got this particular block for my request. 01:16:43.000 --> 01:16:44.000 How can I get around it? Sometimes? It's just not possible. 01:16:44.000 --> 01:16:49.000 But there may be other ways to get the data like using in the Api, which we'll talk about in the next notebook. 01:16:49.000 --> 01:16:54.000 The last thing is, there's sometimes user, interactive content. 01:16:54.000 --> 01:17:00.000 So sometimes data might not be available until a user interacts with something. 01:17:00.000 --> 01:17:13.000 And then the data sent from the server based on the interaction with something. And then the data sent from the server based on that interaction. So there are ways to work around that one such way as the selenium package which you can walk through on your own this sort of mimics having a browser open 01:17:13.000 --> 01:17:17.000 and then receiving data from your like inputs from your code and then receiving the data from the websites. 01:17:17.000 --> 01:17:23.000 So I encourage you to check it out. If you're interested in getting data that requires user interactions. 01:17:23.000 --> 01:17:31.000 So that's it for this notebook, because we only have 9 min left, and then a not take questions just yet. 01:17:31.000 --> 01:17:34.000 So I have time to go through the last notebook. This is a short notebook, so don't worry about the limited time. 01:17:34.000 --> 01:17:39.000 So another way to get data from websites or applications is using what's known as an Api. 01:17:39.000 --> 01:17:52.000 So Api stands for application programming, interface and it's sort of a go between for 2 applications. 01:17:52.000 --> 01:17:59.000 And so for our purposes, like we can think of ourselves as a single application that wants to get data from a website or another app. 01:17:59.000 --> 01:18:05.000 And then the people are. You know that whatever we wanna get data from is the other application. 01:18:05.000 --> 01:18:10.000 So one way I like to think of this is like your Api is kind of like a waiter at a restaurant. 01:18:10.000 --> 01:18:15.000 So you, the customer, come in. You look at the menu which for us is like looking at the website. 01:18:15.000 --> 01:18:16.000 Seeing what data we want, we tell the Api using Python, hey? 01:18:16.000 --> 01:18:22.000 I want this. You know, I want this data. Then the Api takes a request interprets it in a way that the app can understand. 01:18:22.000 --> 01:18:31.000 The servers of the app can understand, and then gives it to the web app. 01:18:31.000 --> 01:18:34.000 Then, after the web app gets the request it figures out if it can even do what we're asking it to do. 01:18:34.000 --> 01:18:40.000 So sometimes you need to have authors to get certain data. 01:18:40.000 --> 01:18:43.000 That sort of thing. It prepares its reply. 01:18:43.000 --> 01:18:45.000 So maybe the reply isn't the data we requested. 01:18:45.000 --> 01:18:52.000 Or maybe it's a response, saying, I'm sorry I can't provide that for you gives that response to the Api. 01:18:52.000 --> 01:19:01.000 Who then takes the response, interprets it in a way that you know we can understand in terms of python, and then provides it to us. 01:19:01.000 --> 01:19:07.000 So in a lot of cases we might be able to write some sort of beautiful soup code to get it. 01:19:07.000 --> 01:19:16.000 But sometimes if that's not useful, and there are Api's that we could use pretty quickly with what are known as python Wrappers. 01:19:16.000 --> 01:19:18.000 So if an Api exists, and it's easily accessible through Python, it's probably better to use the python package. 01:19:18.000 --> 01:19:28.000 That's been written to access that Api than just trying to write the beautiful soup codes. 01:19:28.000 --> 01:19:34.000 So like, for one example, scraping, we'dit or twitter using beautiful soup is very difficult. 01:19:34.000 --> 01:19:43.000 So it's probably better to use the python wrapper for the Api, even though it twitter, is sort of a special case. 01:19:43.000 --> 01:19:44.000 Now I probably would avoid scraping that data just based on what's going on right now. 01:19:44.000 --> 01:19:51.000 And if I don't know that it's free anymore. 01:19:51.000 --> 01:19:57.000 So Api's are a thing that don't need python. 01:19:57.000 --> 01:19:58.000 But there are python wrappers for Api's. 01:19:58.000 --> 01:20:07.000 These are python packages that have been written specifically to take python commands as input to the Api. 01:20:07.000 --> 01:20:16.000 So there are people out there that want to access Apis using python, that write these sorts of packages and then provide them open source. 01:20:16.000 --> 01:20:25.000 Some very common exist are popular examples, are spot for spotify there's one called Spotify for Reddit. 01:20:25.000 --> 01:20:30.000 There's the pro package, which is Python reddit api wrapper. 01:20:30.000 --> 01:20:31.000 The New York Times has one called Pi and Y Times, which allows you to get data from the New York Times. 01:20:31.000 --> 01:20:45.000 And so forth. So for this notebook will end like we're just for giving an example of using one of these python wrappers. 01:20:45.000 --> 01:20:50.000 So we're gonna show you how to use the pi and y times wrapper. 01:20:50.000 --> 01:20:56.000 And in order to use this, you need to have it installed and it's not standard to have it installed. 01:20:56.000 --> 01:20:58.000 So, you know, from what we've talked about earlier today, you can install it. Following those instructions. 01:20:58.000 --> 01:21:16.000 If you run this and it runs just fine, you have it installed most I don't know about all, but most Apis, you need to have a developer key so you can get the New York Times. 01:21:16.000 --> 01:21:29.000 Api developer key. By following these 2 links and going through the instructions once you have the Api, there's a python file which I may have, I'll double check that. 01:21:29.000 --> 01:21:32.000 I've uploaded it. If I haven't, I'll I'll upload it. After the lecture. 01:21:32.000 --> 01:21:38.000 But there's a python file called my underscore Api underscore info dot pi. 01:21:38.000 --> 01:21:42.000 Then you can edit it and change this string from your key. 01:21:42.000 --> 01:21:48.000 Here to provide your Api key, which you would get by following the instructions on those websites. 01:21:48.000 --> 01:21:51.000 Once you have that Api key. This is going to allow you to access data using the New York Times. Api. 01:21:51.000 --> 01:22:13.000 So importantly. Api keys that authentication numbers are sort of your identity to the Api so it's important that you keep it secret and don't like go posting it on public repositories because once you do that, somebody else can get it and do whatever they want with your 01:22:13.000 --> 01:22:17.000 key so think of this is sort of like a social security card. 01:22:17.000 --> 01:22:24.000 But for accessing an Api, something like that. So typically what you can do and what I'll do is I'll make a python file. 01:22:24.000 --> 01:22:25.000 That's only on my own computer or on the server that only I have access to. 01:22:25.000 --> 01:22:38.000 And then I'll import that python. File that's only on my own computer or on the server that only I have access to. And then I'll import that python, the function from that python file get New York times key. And then run it without ever looking at 01:22:38.000 --> 01:22:48.000 the results. Another thing you can do is you can, I believe, set the settings of a server or your computer to store your Api key in there, and it would just have it uploaded. 01:22:48.000 --> 01:22:51.000 I don't know how to do that, but I know it's something people do. 01:22:51.000 --> 01:22:55.000 So I have an Api key so I'll be showing you how to use it. 01:22:55.000 --> 01:23:03.000 Once you have it, and then, if you're interested in working with this, you can check out these instructions here to figure out how to do. 01:23:03.000 --> 01:23:08.000 So, okay, so how do we do this? So the first thing we need to do is import. 01:23:08.000 --> 01:23:14.000 The class that allows us to connect to the Api. So from Pi and Y. 01:23:14.000 --> 01:23:19.000 Times we'll import nyt Api. 01:23:19.000 --> 01:23:22.000 So, if you're trying to run this later, you would edit it. 01:23:22.000 --> 01:23:25.000 So you do not have the file mapped underscore. 01:23:25.000 --> 01:23:34.000 Api underscore info. But I do. You would have to edit this for my Api info, and then it'll import the function. 01:23:34.000 --> 01:23:38.000 Apparently I lied. I guess I don't have this file. 01:23:38.000 --> 01:23:41.000 But so that will make the rest of this difficult to finish. 01:23:41.000 --> 01:23:44.000 But that's okay, because we only have a few minutes, anyway. 01:23:44.000 --> 01:23:57.000 So probably what's best is, if you're really interested in seeing me access the New York Times, Api, you can go through and watch the video that I've made for this lecture. 01:23:57.000 --> 01:24:02.000 I guess I forgot to check that. I had this one in there which is not for not good. 01:24:02.000 --> 01:24:06.000 But basically you'll just go through. You would then provide hypothetically like the key. 01:24:06.000 --> 01:24:07.000 Hi, Matt! 01:24:07.000 --> 01:24:08.000 Yeah. 01:24:08.000 --> 01:24:13.000 I think it's you're using Mac underscore Api and disco info, instead of my. 01:24:13.000 --> 01:24:18.000 So this is supposed to be a file that has my Api key. 01:24:18.000 --> 01:24:22.000 Yeah. 01:24:22.000 --> 01:24:32.000 So like this is a different file from the my one like this one has my key that you guys don't have access to. So that's why it says Matt. Instead of my. 01:24:32.000 --> 01:24:33.000 I see. 01:24:33.000 --> 01:24:38.000 Yeah, yeah. But once you have that, this is how you do it. 01:24:38.000 --> 01:24:48.000 So the function get, underscore. Ny. Times. Underscore Key would provide the string of your key, and then the next argument you would provide as par States equals. 01:24:48.000 --> 01:25:07.000 True, this just allows you to use date times. Then you can again like check out the completed version this will show you how to get the results with like specific keyword queries, like, for instance, basketball, and then the dates like I want articles from for example, march. 01:25:07.000 --> 01:25:11.000 First, 2023 to April nineteenth, 2023. 01:25:11.000 --> 01:25:18.000 Okay. So that will be the I again, because I don't have the file in this current version of the repository. 01:25:18.000 --> 01:25:22.000 I would have to add it again. Check it out on the pre-wed lecture. 01:25:22.000 --> 01:25:28.000 If you want to see the code in action. This was just an example. 01:25:28.000 --> 01:25:29.000 It's not super important. You just the important thing is that you get the gist of. 01:25:29.000 --> 01:25:34.000 There are Api wrap first for python. 01:25:34.000 --> 01:25:41.000 They have documentation links that you can read through to figure out how to use them. 01:25:41.000 --> 01:25:42.000 Here's the example for pi and y times, and then these are sometimes easier to use than writing beautiful soup code. 01:25:42.000 --> 01:25:58.000 Okay. And with that I will close it off, for today I will stick around for a little bit to answer any questions, but until then I will see tomorrow what we start using data science tools like models. 01:25:58.000 --> 01:26:07.000 That sort of thing.