Web Scraping with BeautifulSoup Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. 11:13:15 Hi! Everybody! Welcome back in this video, we continue our data collection content by talking about how to do web scraping with beautiful soup. 11:13:24 Let me go ahead and share my Jupiter notebook, and we can get started. 11:13:27 So, in this video, we're going to give you a brief outline of how to scrape data with beautiful soup. 11:13:33 So this is going to be HTML data that we're going to be scraping. 11:13:37 We'll introduce the package. We'll show you how to parse some simple HTML code. 11:13:43 While also going over the structure of HTML code, and then at the end, we'll scrape actually, data from a web page and then give you some issues that you may encounter while scraping data with beautiful soup. 11:13:55 So the first thing we're gonna do is import beautiful soup to make sure that you have it installed. 11:14:00 So to do this, you're going to try and run import. 11:14:03 Bs 4, and then, if you're able to import it, and then run this you'll be able to see both that you haven't installed, or do not have it installed, and if you do have it installed, you'll then see what version you have installed, you'll then see what version you 11:14:16 have installed. So this notebook that you're seeing right now is being run with version 4.12 point 2. 11:14:22 If your version is a little bit later, or a little bit earlier than this version, everything should run fine. 11:14:28 But be mindful that if you're version is different from mine, you may encounter issues trying to run the exact same code. 11:14:34 It just depends on what version you're using. If you do not have beautiful, super install meaning that you are unable to run this import import block. 11:14:44 Go ahead and check out the install, instructions that I've provided. Here. 11:14:48 You may also need to consult the python. 11:14:51 Its Package Installation Instructions on the Institute website. 11:14:55 So let's start by talking about HTML, so if we're going to scrape websites, we have to understand a little bit about what you know. 11:15:02 A website is made up of, and most websites are consist of HTML. 11:15:07 There's also things like Javascript, but for our purposes, with beautiful soup, we're only going to be able to scrape the HTML. 11:15:14 So here is an HTML code chunk, and we can actually see what this looks like by going to sample. 11:15:23 HTML dot HTML, so this is a file within the repository. 11:15:29 So we have here the title, The Door Mouse's Story. 11:15:33 These this block of text here that tells the story. We have some links here. 11:15:37 For instance, if we click on them it will take us somewhere. 11:15:40 Example.com, and then an ellipse is at the end. 11:15:44 So that is an HTML. Page that we're going to go over. 11:15:46 How to scrape, so how do we scrape an HTML with beautiful soup? 11:15:51 The first thing we need to do is import. The beautiful soup object. 11:15:54 So we would do from B. S. 4. Import beautiful soup. 11:16:02 Now we want to make a beautiful Sup object using this HTML code. 11:16:07 So the first thing we do is we will store it in a variable. 11:16:11 Its standard. To store this in a variable called soup. 11:16:13 Then you type in beautiful soup, you input, the text of the HTML code, which, as you can see, is stored within HTML underscore. 11:16:25 Doc so HTML underscore Doc, the next thing you do is you pass an argument that tells beautiful soup what language it's trying to parse. So for us. 11:16:33 That's HTML. So you put in a string. 11:16:35 HTML dot parser, and then you hit. Enter. 11:16:39 So now you have a beautiful soup object which has taken in this HTML code as a string, and is now ready to be parsed. 11:16:48 So, for instance, we can print this using a function called pretty, and what pretty fie does is it will print the code in a way that it would look in a code editor if the person writing the code is following the conveyions of HTML. 11:17:04 Writing, so we'll do print dot soup, dot pretty file. 11:17:10 There we go, and there we go. So this is what it looks like. 11:17:13 If this code was written in sort of a text editor or code editor, ide sort of environment, what it would look like if you're following the standards of HTML indents and stuff. 11:17:25 So this is one way to look at it another way to look at it. 11:17:29 Is that all? HTML code has sort of a tree structure where the meaning of tree here comes from the field of graph theory. 11:17:37 So this is a graph which is just a network. If the word graph is unfamiliar to you, and it's called a tree, because this particular graph is a what's known as a tree. 11:17:48 So at the top you have the HTML document, and then inside of the HTML. 11:17:53 Document is more often than not both a head which sort of gives the Meta information. 11:17:58 So, for instance, here the head would contain this information that's telling the page to display the dormouse's story and the tab up above it also has a body which contains the content. 11:18:11 So within the body. What we're seeing here on the axial web page is the body of the HTML, and then within that body there are 3 things called peas, which you typically put your text in and believe it's short for paragraphs in and believe it's short for paragraph you have a b within 11:18:26 the first P, which says, Make one of the text items bold, and that's this, the dormouse's story, which is bold. 11:18:33 You have. What else do you have? You'll have 3 A's within the second P. 11:18:39 So there are 3 A's here, a stand for anchor, and they allow you to put links to text which is why, when we went here, we could click on Lc Lacey or Tilly, and then finally they're the last p without anything in it. 11:18:51 So within the jargon of Hdml. 11:18:55 Code, each layer of the network is called a generation. So the body generation is like the child of the HTML, because it populates down from the HTML, and then that body itself has 3 children which are the 3 P's. 11:19:13 And then if we wanted to keep going, it has a B grandchild, and then 3. 11:19:17 A grandchildren, so you sort of can think of things in generations like that. 11:19:21 If you'd like to so why don't we go ahead and show you how we can use beautiful soup to parse this code and get out the information stored within, so we can traverse the tree so by trivial, I mean go through it just by doing soup, and then you input, the 11:19:40 first attribute the first thing that you want to get the element of HTML code that you want to get. 11:19:45 So for us, let's say we wanted to get the title. 11:19:48 So the title is stored within the head, and then within the head is the title. 11:19:53 So we would do soup dot head, and we could hit. 11:19:57 Enter and show you what that gives us. Alright. And so then you could do soup, dot head, dot, title. 11:20:04 Yeah, maybe I'll put that here actually, soup dot head dot title. 11:20:09 And now you'll see the only difference between this code and then the soup dot head dot title is now we don't have the head on the outside, because we're just getting the title now, and if we wanted to get the text, well, we'll talk about that in a second an alternative 11:20:27 to the soup. Dot head dot title is sup dot title. 11:20:31 You'll get the same thing, so we'll this work. 11:20:35 So whenever you do soup dot some HTML element, it will always return the first instance of that. 11:20:40 HTML element within the tree. So the first time that title appears is this title, the first and only time, and so that's why it goes here for instance, we could also do soup dot. 11:20:53 P. And then the thing that comes out is this, you know, Dormouse's story at the top of the teamml document, because that is the first time A. P. Happens. 11:21:03 If you wanted to get the second time you could do.pa, you could do sup dot p dot next and that doesn't work. 11:21:12 So if you want to get the second time you'll have to you'll see what we have to do to get that later. 11:21:16 Okay, so let's go back. Soup, dot title. 11:21:21 If you wanted just the text without the title, how do you do that? 11:21:25 You do soup got title, and then you want just the text. 11:21:29 So you'll do dot text. So now you get the text that is contained within the HTML element of the title. 11:21:35 You can also figure out things like what is the parent of the title so like using that jargon soup dot, title, dot, parent. 11:21:45 Okay, you can find the first day, like I showed before you could find. 11:21:49 You see here that this A. Which is known as an Httl element and anchor, has things within it like class, and HTML and Id. 11:22:00 So sometimes HTML elements will have attributes, and those attributes can be accessed sort of like a python dictionary. 11:22:09 So you could do soup, dot a, and then how do we access things in a dictionary? 11:22:13 We provide square brackets, and then the key. So if we wanted the class we would do soup, a square brackets, class. 11:22:21 And now we get back the class which is sister, and in addition to doing soup dot a you could find all of the A's that exist within the document at once, with a find all so soup find underscore all a will then provide you a list that has all of the a's with the 11:22:42 document so we could then loop through that list with a for loop that I'm now uncommenting, and for every a with in this soup dot find all list I just want to print out that. 11:22:54 A so here, we go. We've got the class of the A and then the text, okay, so why don't you go ahead and try and take a second to figure out the exercises here. 11:23:05 You can pause the video and do it on your own, or you can try and code along with me, or you could code along, or you could just watch me code if that is preferred. 11:23:15 So whatever you'd like to do we're gonna work on the exercise right now. 11:23:19 Okay, so how do I find the first? P. I would do soup, dot. 11:23:25 P. And then how would I find that class? I could do soup, dot P. 11:23:29 At class, and then how I find the string that's stored within that P. 11:23:35 I could do soup, dot p dot text. Okay? The next one's asking me for all of the A's. 11:23:42 Within the document. We want to find their href. So we would do for a and soup dot find underscore all a. 11:23:52 We could do print A at Href. Okay? 11:23:59 So that is the example of the really silly example of an HTML document. 11:24:05 Very easy to work with in the real world. You'll be working with more complicated HTML documents from actual websites. 11:24:13 So in this next section of the notebook we're going to talk about how you can scrape data from real web pages. 11:24:19 So the first thing you need to do in order to scrape data from a real webpage is, you actually need to get the HTML content from that webpage. 11:24:28 So how can you do that? You can do that by using Python to send a request to the website server and then get the code back. 11:24:36 So just like when you go to a website with your browser, let's say I'm going to click on this, because that's what we're going to click on this, because that's what we're going to that's what we're going to do we're gonna 11:24:46 scrape the data contained here. So when I click on this the request is sent to the server where this website is hosted, the sports page at 5 38 comm. 11:24:57 And then that server returns the data to us. So we need we can do that with python as well. 11:25:03 So in order to do that, we need the requests module which should be found in base python. So we're gonna do import requests. 11:25:13 And once we have requests we can send a request to our particular address by doing requests. 11:25:21 Dot requests dot get, and then putting in the URL, and so you'll see. 11:25:28 I got back this response to 100. What does that mean? This is telling you the response of the server to your request. 11:25:35 So sometimes servers will not allow you to get data in this way, but because we saw a response of 200 200 means that the request was processed as desired, meaning that we sent saying we wanted this page's information, and it sent it back. 11:25:51 We didn't store it anywhere, so we don't currently have it. 11:25:54 But we'll remedy that in the next cod chunk. 11:25:56 What will happen, though, is, you'll occasionally receive something like a 400 or a 500 error, which means that you sent your request, and something went wrong. 11:26:06 So, for instance. Oh, an error of 404, which I'm sure most of us have seen on the Internet before, indicates that the web page that you requested is not there or not functioning at the time errors in the 5 hundreds tend to indicate that you don't have access to the web page 11:26:21 for various reasons, so let's go ahead and do this again. 11:26:24 So we're going to request this data. And now we're actually going to store it in a variable called R. 11:26:30 So we can check out the status of our request by doing our status underscore code. 11:26:37 So again we got the 200. So everything's working the way we want. 11:26:40 Now we can see the HTML content of this page is stored in our content, and this is, as you can see, the HTML. 11:26:50 And if we were to go through we can parse it out. 11:26:53 But I'm not gonna do that. It's very complicated or not complicated, but it's very involved and it would be difficult to read it like this. 11:27:01 So one thing we can do now that we have this is, we can parse the data. 11:27:04 So HTML dot parser parse it into a beautiful soup, and now we have the HTML that we can traverse so we're going to go ahead and pretend that we're working on an assignment where we want to get the title the author and the 11:27:22 Associated URL for all of the articles listed on this web page. 11:27:27 Okay, so busting for Major, miss. How Mchale Bridges went from supporting player to star, and so forth. 11:27:34 So, in order to do that, we need to know where those things are stored on the web page, where in the HTML code are these things stored so the way we can do that is to use browser tools or web developer, tools, so I'm using Mozilla firefox so you'll 11:27:50 see what that looks like. But below, I've provided instructions on how to get to web developer tools for both for firefox, Google Chrome and Safari. 11:28:02 Okay, so these are images from an older version of the web page, where you can see what the web developer tools look like in each browser. 11:28:09 Each of the 3 most popular browsers. If you're using a browser that I haven't covered, you'll have to, maybe pause the video and do a quick web search to see how can I access the web developer tools for my browser. 11:28:21 But they should all more or less look the same. So let's go ahead and see this. 11:28:26 So for Mozilla Firefox, I go to tools, browser tools, web developer tools. 11:28:33 And now I have, as you can see, as I hover over the HTML code at the bottom different things are highlighted. 11:28:43 Okay, so what I'm gonna do now, though, is we're gonna work on the part of getting the title. 11:28:50 We want the title of all the articles here. So most, if not all, web developer tools have this option to click on something like this that looks like a box with a cursor. 11:29:04 So when you click on that, you can go through on the web page and actually highlight individuals, and then you'll see back in the bottom of the code that it then shows you where within the HTML code, this entire element is occurring. 11:29:16 Okay. So what we're gonna do is we're going to click on this because it's a title. 11:29:23 And then we're gonna go look at the code. We can see that the titles contained within an A in particular. 11:29:31 This? A, and then that a is itself contained within, and H. 11:29:38 2. Okay. The most important part of this is that the 82 has a class of article dash, title as well as the class of entry, dash, title. 11:29:48 So we're going to go ahead and use that information with a feature of find that we and find all that we have not yet seen so we're gonna go down. 11:29:59 We're gonna show you how to get these article title. 11:30:02 So we're first gonna demonstrate with just soup dot fine. 11:30:06 So we want to find an H. 2 and fine works the same way as find all. 11:30:11 It just returns the first instance as opposed to returning all instances. 11:30:16 So soup find. After that you can put a comma, and then a dictionary, and this dictionary allows you to search, using attributes as sort of search filters. 11:30:28 So we're gonna put in that we want the class of article dash, title. 11:30:33 So we want to return all H. 2 or the first stage 2. 11:30:38 That has the article title, class, and we can go through here, and we can see this is exactly what we wanted. 11:30:44 So this gave us what we wanted. We can go back and see that's what we wanted. 11:30:49 Okay. And then, once we have that, we could do like dot text. 11:30:54 And then that would give us the title. So that's doing it one time if we wanted to get all of them, we would do soup. 11:31:03 Dot find underscore all H. 2. 11:31:09 Class, article, dash, title. 11:31:15 So that's all of them, and then we could use. 11:31:19 A list, comprehension. 11:31:25 H dot text for H, and this. So now we have a list of all the titles, and you'll see there are these things that are like the white space slash. 11:31:36 N slash t so you could get rid of those if you wanted by doing dot strip, which will remove the white space. 11:31:43 So now you have the titles you can go through, and double check that we got all the titles we did of the articles. 11:31:52 Okay. So now, as an exercise, try and do the same process of getting all of the authors of these articles, try and do that. 11:32:02 You can pause the video and try, or you can try and code along with me. 11:32:05 Okay. So once again, we have to use our web developer tools so I'm just gonna go over and hover on a particular author, and we'll click on that element. 11:32:16 So within. Here we have this A, and this A has the r E. L. 11:32:24 Attribute of author. So what I'm gonna try and do is I'm going to do soup, dot find all a, and I want it to have an earl of author. 11:32:39 And then we can once again. 11:32:43 Do the list comprehension trick. So a dot text for a in, and we can go through and check that. We did. 11:32:52 Indeed retrieve all the authors, if we would like to. 11:32:56 So let's see, we have Ben Jared, Alex, Ben Jared, Alex Neil Howard, Neil Painane, and Terrence Doyle. 11:33:08 So let's see. So maybe this is an issue. So we might need to adjust slightly because these 2 should be appearing together, not separately. 11:33:17 So let's go back to our code. 11:33:22 So let's look at the example with a Neil Pain and Terence Doyle. 11:33:29 So here you can see we've got 2 A's within this. 11:33:34 So I think maybe what we want instead of searching for A is is these this P. So single? 11:33:39 Meta data space card card. So let's try that instead. 11:33:48 This is a class. 11:33:51 And I'm pasting, and then we'll do just to be consistent when I try and switch it up with this P. 11:33:58 There we go, so that looks a little bit better, and we could do some cleaning if we wanted to get rid of the buys as well. 11:34:06 Okay. So let me go ahead and alter this part of the code to make it consistent. 11:34:19 There we go. So what are some common problems that you might encounter while Web, scraping? 11:34:25 So you might have, missy, or inconsistent code. 11:34:29 So the Hcml. Code for 538 was very nice and very easy to parse a lot of websites exist out there in the world that aren't as nice so anybody can make a website. 11:34:38 Anybody with money to have a server can make a website. 11:34:42 And so if that's the case, that that means that there's, you know, nobody. 11:34:45 Safeguarding that ensures that your website is formatted nicely. 11:34:49 So people can read your code. So with that being said, you may sometimes have to come up with creative solutions to get the data you want from a messy website. 11:34:59 Another issue you might run into is you can only usually only send like a certain number of requests within a certain period of time, before you get flagged as a BoT, a scraper so one handy trick, for this is to use the time a module that will allow you to sleep for a certain amount 11:35:17 of time. So like, let's say, we wanted to scroll through all of these articles by like doing more and like, imagining that we're scrolling through and trying to scrape all of the articles from 5 38. 11:35:30 If we wanted to write a script that would do that between each call to the web page, I would put in a sleep for maybe 2 to 5 s. 11:35:37 You want to make it long enough so that the server isn't being overloaded by the request. 11:35:43 And you're not flagging it for weird behavior. 11:35:46 So there's both the practical reasons of making sure you want to get the data that you're getting another reason is you don't want to be you don't want to overuse their server. 11:35:56 You're not the only one in the world that wants to see that website usually. 11:36:00 And you want to make sure that you're being a good citizen of the Internet for everybody else. 11:36:05 Just use the server as much as a normal person, and don't abuse it. 11:36:09 Another thing is, even if you are being a really great, you know, with your code and you're sleeping between each each request. 11:36:16 Sometimes the sometimes websites have been developed that can easily detect if you're using a request, get request and restrict your ability to take their HTML code. 11:36:27 There are some ways that you can counter this, but because it's sort of like an arms race, kind of thing where every time that comes someone comes up with a python solution, the web developer will come up with a different solution. 11:36:39 That blocks that you have to sort of just do a web search to see what's currently being done to bypass those. 11:36:45 If you are trying to get that data another issue, you might run into is some things might not be able to be pulled until a user interacts with it. 11:36:55 So maybe a user pushes a button which loads new data. 11:36:58 And what you want is that data loaded by the button push. 11:37:02 So, if that's the case, you can't get that with just beautiful soup, because the HTML code gets returned like it is someone going to the website for the first time. 11:37:11 There where you're looking to restrict data that's loaded by a Javascript command and the code you might need to look into a package called Selenium and learn how to use that selenium will allow you to sort of write scripts that mimic working 11:37:26 in a web page like clicking buttons, and waiting for requests that way. 11:37:30 So that's worth looking into. If that's an issue you run into. 11:37:35 So in summary in this notebook we talked about how you can parse HTML code with the beautiful suit package we saw a really easy phony example as well as a real-world website example. 11:37:46 If you want to learn more about beautiful soup, I encourage you to check out their documentation page. 11:37:50 Here. I hope you enjoyed learning about web scraping. 11:37:55 I enjoy teaching you about it, and I hope to see in the next video.