Data Collection Summary Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi everybody. 
Welcome back in this video we're going to end. This is our last video on data collection. 
We're going to have a summary and conclusion of everything we've covered in data collection and give you some future thoughts before moving into data modeling. 
So let me go ahead and share the Jupiter notebook. 
So in this section, the data collection section, we've reviewed a number of data collection techniques that include finding data and existing repositories online or on competition websites. How to download data from a database that you have access to, how to scrape web data using beautiful soup, and then finally how to interact using Python wrappers for API's to get data from websites and other applications. So I want to point out there's a number of other techniques that we did not cover because it's a little bit beyond the scope of this short series. So you could design and administer your own surveys to a group of people. So think about maybe you work for. A podcasting company and you want to get an idea of what ads might be of interest to your users? You could design A survey for that and administer it and ask people to answer. People who listen to your podcasts. You could design your own apps or websites to collect what is known as trace data. So data of people who are using your website. People do this for their PhD. Thesises theses. They will design an app or website and then track the data. I've seen some examples of that. You could run experiments. So for those of you in sort of physical sciences or social sciences, maybe you've ran an experiment before in your PhD or master's research, and then you collected the data from that and then analyzed it for part of your thesis or for a paper. Or, you know, there's a lot of other techniques that frankly we just didn't cover because we didn't have the means to cover it and it's not really within the scope of what we're interested in. So some of you, if you go to work in the industry, will also do things like this. So I just wanted to point out that these are things that we didn't cover and there's more ways to collect data other than just collecting them from online sources. But what we did cover did. Equip you with a wide array of tools with which to gather data on your own for your next database project. So before we move on to actual algorithms and data cleaning techniques, I did want to end this section by discussing some things you may want to consider when working with data on a project. And mainly the big thing is I want you to think about what are some questions you can ask yourself to make sure your project is as good as it can be before actually embarking on the cleaning and modeling portion. So we can think of a data science or a database project as sort of having like two kind of concepts, the start of it. So maybe you one you have an idea for a project, so maybe you're walking around your house or apartment getting ready to cook a meal or maybe cleaning something and. You have this idea pop into your head and you think this would be a really fun project idea. And maybe if it goes well, I can make a business out of it. Or maybe even a smaller scale, like I can make a fun blog post out of it or something like that. So that's one approach. And the other approach is maybe you've been saddled or you found or stumbled upon a really interesting data set that you want to work with. Sometimes these things sort of work in concert, so maybe you first have an idea for a project which leads you to collect an interesting data set. Maybe you have an interesting data set which leads you to go into an idea for a project. Or maybe it's some kind of weird combination of the two, where you have this fun data set and you had an idea for a project and they just happen to coexist with one another in a nice way. So before you start proceeding full force into modeling and diving into the data, some questions you might want to think about ahead of time could be useful when it comes time to actually start the modeling. So you might want to think what is your research question or desire to end goal so it can be useful prior to starting any modeling to sit down and try your best to write it out clearly. And I would you want to point out it's okay if what you end up with at the end doesn't actually reflect what you started with in the beginning. But sometimes it can be useful to start with a clear goal in mind and then see if you can proceed towards that goal the best that you can. Once you have that clear goal, you might want to ask yourself, well, do does the data set I have give me everything I need to pursue that goal? Do I have the correct data set? Do I have any data? And then if I don't have the data or the correct data for my problem that I'm interested in, is that data easily obtainable? So let's say you have a fun idea. If your idea involves collecting a data set which isn't feasible or ethical to collect, you might want to come up with a different idea for a project, or see if you. You can come up with a proxy, so again going into if you think that data is available, so if it is, how easy is it to collect it? And then if it's not available, or if it's incredibly difficult to collect, could you find an easier to collect proxy data set that you could use and maybe be just as effective as your ideal data set? Can the data set you have directly answer the research questions to sort of picking up on this proxy and so if you cannot be used to directly answer what prevents it, like what of doubt this data set prevents you from being able to use it directly and then sort of building off of that what are the limitations of the data set you have? So one example might be maybe your data sets not representative of the population you're interested in, so maybe you have a skewed sample so. For instance, maybe you're interested in general population, but you have a data set that's highly skewed in terms of the gender of the individuals represented, or the race of the individuals, or maybe the age, and therefore it will be difficult to see how well it generalizes beyond your sample. Maybe your data set doesn't contain any information about the thing you're actually interested in, but it does contain something that is likely to be highly correlated with your thing, so a proxy variable. So one example I've seen before in my research is a lot of studies. It's difficult to get actual data on people who get a flu vaccine. But they can actually ask people, what's your intention to get a vaccination? This is not a perfect onetoone, right? Because there will be times where people will say, well, I do plan on getting a flu shot, but then they just never end up going around with it. Or there's social desirability bias where people will say that they want it, they're going to get a flu shot, but only to get you. That's because they think like, that's what you want to hear, that's the correct moral right thing to do, and then have no intention of actually getting it in their own. So this isn't a perfect one to one relationship between those who actually do get a flu shot, but it might be a useful.