Data Collection Data Source Websites Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. 16:29:29 Back. And this video, we're going to start our data collection section with data source websites. 16:29:35 Let me go ahead and move over to the Jupyter notebook. 16:29:38 And we'll get started. So in this notebook in this video, we are going to be talking about some sources of data on the line. 16:29:47 So particularly websites where you can find data sources. We'll talk about things like data, or repositories as well as data competition sheets. 16:29:55 And we'll see a couple of examples of both. So what is a data repository so a data repository is, we're going to call something a data repository. 16:30:05 If it is a website, where you will just store data Stats, it's the only purpose of the website. 16:30:09 So you can have these for many reasons. So maybe you have done some academic research which has a data set that you'd like to make open for the purposes of replication studies to verify that your findings are true. 16:30:22 Those are academic repositories, or maybe you have data sets you are part of a news organization that does a lot of database to journalism. 16:30:32 And you're providing data sets that were used in your articles. Again. 16:30:35 So people can verify that what you're saying in the article is true, as we as just letting them be open so other people can find things that are interesting about the data you also have repositories that hold bchmark data sets. 16:30:49 So there are data sets that are used to test various different algorithms, just as sort of a way of calculating. 16:30:56 Comparing one algorithm to another algorithm using this sort of benchmark set. 16:31:02 So here are some examples of repositories, different types in particular, for academic repositories. 16:31:08 These are ones that are associated with academic repositories. 16:31:13 Or academic benchmarks so one that I like to use, not use, but use as an example a lot is this Uc Irvine machine learning repository. 16:31:21 So this repository just has a lot of data sets that get used in papers as well as benchmark data sets. 16:31:28 So this iris dataset is one that we'll see a lot if you go through the classification content. 16:31:33 So this is a data set that gets used, says a benchmark and a teaching tool for a lot of different classification algorithms. 16:31:38 So as you can see here, there's a lot of. 16:31:42 There's some information about the data as well as the actual data itself, which you can get by going to the data folder. 16:31:51 There's also the Github Repository approach. 16:31:55 So these are used a lot by news organizations or websites. 16:31:58 So, for example, here is the Github Repository for the file 138 website. 16:32:04 And you could go to the data part of the repository to see all the different data sets. 16:32:10 And we'll leave this here because this is going to be the source of our first example. 16:32:16 So there's this fun post they had for Halloween years ago, where they ranked the different candies based on a poll, I believe, probably taken within within within the 5 38 office. You can read about it and watch the video here. 16:32:33 We're going to get the data set that they used for that poll. 16:32:36 So if you go to this link right here, we'll take you to the data set that they use for that pole. So if you go to this link right here, we'll take you to the data set that they use so each row was a candy bar. 16:32:46 Or a type of candy. And then had all these different aspects of the candy, and then the number of times that candy won in whatever polling methods that candy won in whatever polling method that 5 38 laid out in the article so one way you could get this data set is to download it 16:33:01 so you could go to raw here and then save the file with your web browser to whatever file type you'd like. 16:33:10 Probably a Csv. Another thing, you can do is you can. 16:33:15 And here's the example where I you know, if you did that you could read it in candy dash data. Dot Csv. 16:33:22 And now you'll have it. Let's say you didn't want to go through the hassle of downloading the file on its own. 16:33:27 You can take this web, address, copy it, and then paste it into Pd. 16:33:33 Dot read Csv, and then it works just to same as long as you have access to their repository, which it's public. 16:33:42 So I do. And an Internet connection, which I also do. So now again, this works, you just can put in the address. 16:33:51 Okay. So now, we've gotten this data set from a Github Repository. 16:33:54 We've used an academic repository or not an academic repository. 16:33:58 But we use a data repository. Site. So here are some guidelines, for when you're going to be using data repositories, if you're in general, if you're using data that, you did not collect or create yourself, it's really important that you site and follow whatever rules, are 16:34:16 guidelines the individual or group that collected that data would like. 16:34:22 So a lot of times these places will just say, like, you can use this data. 16:34:26 But please cite the paper where we published it originally, or you can't use this data. 16:34:31 And here are the reasons why or you can use this data under very specific circumstances. 16:34:37 So just you need to go through, read whatever rules and guidelines the data collected for you, and follow those. 16:34:46 The next type of website you might look into for data sets are data competition sites. 16:34:51 So these are websites that host competitions for data, science and data analysis on their website. 16:34:58 So a lot of times these websites will host competitions publicly store data specify rules either by you know that they've created or some outside entity that's providing the competition has created. 16:35:14 The competition has created you can accept. They accept competition entries, and then we'll come up with some guidelines for determining the winner. 16:35:20 So these are places where you could go and test your ability if you'd like to, and try and win prizes. 16:35:27 So sometimes there are a lot of monetary prizes involved. Other times. 16:35:29 It's just to be at the top of some leader board. 16:35:34 So probably the most popular is capable.com. In order to use this, you'll need a cagle.com website. 16:35:39 And this is what it looks like when you log in. So you know, Matthew Osborne, that's me. 16:35:45 When you, when you log in, it, looks something like this, and if you want to just go straight to the data, there are data sets. 16:35:51 There are here are the examples of different components that are currently going on. 16:35:57 When I'm recording this video. And then they have other sorts of things that I'll leave it to you to check out. 16:36:04 Some other popular examples are provided here. But I'll leave it to you to look into those on your own time. 16:36:10 So, as an example for getting data from the.com competition site, we're going to get this Iris data set that I looked at earlier. 16:36:18 So when we looked at this, if we go back to data sets, maybe it'll pop up at the top. 16:36:27 Not immediately. So we look at this data set, we can go to there also, just like in the Uc. 16:36:35 Irvine Cagle has the iris species data set, and you can get this again once after you log in to your cangle profile. 16:36:43 You can get this by going to the link and clicking download. 16:36:46 Then we're going to download. You can just move it over to the Repository, and then once you've moved it over, you can just read it in like so so now we've read in this data set and you can check. 16:36:58 You know that it? It looks like the one that's on the website. 16:37:02 So once again, data, competition sites just the same as repositories. 16:37:08 Make sure you're following the guidelines of the sites. 16:37:11 If you're going to use it for some sort of purpose. 16:37:14 Some of them have very specific rules of how you can use it. 16:37:17 For instance, you can only use it for the competition. You can't use it for anything else like making some sort of anything else like making some sort of commercialized product, or anything like that. 16:37:27 So just fall the rules that are outlined on the website. 16:37:30 If you don't, there are sometimes legal consequences for that. 16:37:33 If they were to find out about it. So just be mindful of the rules, and as a best, you know, guideline, follow the rules of the website when it comes to this sort of thing. 16:37:41 Okay, so that's it. You've covered different types of data source websites. 16:37:46 So websites that are sources of data, including repositories and data competition sites. 16:37:51 We've gone through a couple of examples of getting data from such websites and loading it up into a Jupyter notebook, so we can use it. 16:37:59 And now I think we're ready to move on to the next type of data. 16:38:01 Collection which we'll touch on in the following lecture, video I hope you enjoyed this video. 16:38:09 I enjoyed having you watch it, and I hope to see you next time.