top of page
Data Science Boot Camp

Fall 2024

Sep 5, 2024

-

Dec 13, 2024

This program is included with Fall 2024 Career Launch Cohort Enrollment and Erdős Institute Alumni Club Membership at no additional cost.
erdosOspin.gif

Checking your registration status...

To access the program content, you must first create an account and member profile and be logged in.

Register

You are registered for this program.

Registration Deadlines

Sep 6, 2024

-

All Erdős Fall 2024 Career Launch Cohort or Alumni Club members who are not participating in the UX Research nor Deep Learning Boot Camps

-

-

Category

Launch, Core Program, Boot Camp, Projects, Certificates

Overview

The Erdős Institute's signature Data Science Boot Camp has been running since May 2018 thanks to the generous support of our sponsors, members, and partners. Due to its popularity, we now offer our boot camp online three times per year in two different formats: a 1-month long intensive boot camp each May and a semester long version each Spring & Fall.

Slack

#slack-channel

Organizers, Instructors, and Advisors

matt_osborne.png

Steven Gubkin, PhD

Lead Instructor

Office Hours:

MTWRF 12pm - 1pm ET, and by appt.

Email:

Preferred Contact:

Slack

Please feel free to message me on Slack with any questions!

matt_osborne.png

Alec Clott, PhD

Head of Data Science Projects

Office Hours:

By appt. only

Email:

Preferred Contact:

Slack

Participants are welcome to reach out to me via slack or email. I normally work standard EST hours (9am-5pm), but can always find time to meet folks via Zoom too after work. Let me know how I can help!

Objectives

The goal of our Data Science Boot Camp is to provide you with the skills and mentorship necessary to produce a portfolio worthy data science/machine learning project while also providing you with valuable career development support and connecting you with potential employers.

Project Examples

TEAM

Aware NLP Project III

Mohammad Nooranidoost, Baian Liu, Craig Franze, Mustafa Anıl Tokmak, Himanshu Raj, Peter Williams

clear.png
Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

This project involves the investigation and evaluation of different methodologies for retrieval for use in RAG (Retrieval-Augmented Generation) systems. In particular, this project investigates retrieval quality for information downloaded from employee subreddits. We investigated the impacts of using clustering, multi-vector indexing, and multi-querying in advanced retrieval methodologies against baseline naive retrieval.

First Steps/Prerequisites

Computer Setup Day/First Steps
There are some computer set up steps you need to complete before the first lecture. We will meet on 09/05/2024 on Zoom to make sure that we have all done the following:
  1. Cloned the GitHub repo locally
  2. Installed the conda environment.
  3. Run a Jupyter Notebook using that conda environment.
Detailed instructions (created by teaching assistant Ness Mayker Chen) can be found at this link.
 
We will test your ability to do these things by having you submit a "secret code". You will obtain this code by successfully running the notebook
 
computer_setup_day/find_secret_code.ipynb
 
When you have obtained the code put it in the textbox at https://www.erdosinstitute.org/ds-boot-camp-prep
 
If you can do these things independently please show up to help your colleagues!
If you cannot do these things independently please show up to get help from your colleagues!
 
Prerequisites
 
In addition to these computer setup steps there are also some content prerequisites:
  1. Base level familiarity with Python
  2. Differential calculus. Ideally you also know some multivariate differential calculus and linear algebra.
  3. Basic statistics and probability

Program Content

I'm a paragraph. Click here to add your own text and edit me. It's easy.

25231-github-cat-in-a-circle-icon-vector-icon-vector-eps.png
Program Content

Textbook/Notes

Alec's Lost Introduction

Live Lectures

Due to a technical glitch, the audio for the video Alec made to introduce himself and relevant information about projects didn't work in the orientation lecture. You can watch it now!

Slides
Transcript
Code

Gradient Boosting

11 Week: Ensemble Learning II (prerecorded)

A second boosting algorithm that is loosely associated with gradients.

Slides
Code

Week 12: Neural Networks

12 Week: Neural Networks (prerecorded)

Feed forward networks in theory and in sklearn.

Slides
Code

keras

12 Week: Neural Networks (prerecorded)

We introduce the keras package for neural network construction in python.

Slides
Code

Introduction to Recurrent Neural Networks I

12 Week: Neural Networks (prerecorded)

A brief review of the theory of basic recurrent neural networks.

Slides
Code

tSNE

Bonus content (prerecorded)

Another dimension reduction used primarily for data visualization.

Slides
Code

Hierarchical Clustering

Bonus content (prerecorded)

Our second clustering algorithm.

Slides
Code

Regression Version of Classification Algorithms

Bonus content (prerecorded)

All of your favorite classification algorithms back for regression purposes.

Slides
Code

How To Form Projects

Presentation Tips and Tricks (prerecorded)

This video should show you how to navigate the team formation process on the Erdos website.

Slides
Transcript

How to clone the GitHub Repo

Technical Support

This video will walk you through cloning the GitHub repo. It also addresses how to troubleshoot some common pitfalls.

Transcript
Code

Data Source Websites

Data Collection (prerecorded)

We cover a plethora of data source websites you can use.

Slides
Code

Data in Databases

Data Collection (prerecorded)

Your data is stuck in a database, can you get it out? Learn how in this video.

Slides
Code

Data Splits

02 Week: Regression I (prerecorded)

We introduce data splits including:
1. Train/Test splits
2. Validation sets
3. k-fold cross validation.

Slides
Code

Multiple Linear Regression

02 Week: Regression I (prerecorded)

We introduce multiple linear regression.

Slides
Code

Scaling Data

03 Week: Regression II (prerecorded)

Having features with very different scales can impact the performance of some models. In this video we show how to use sklearn's StandardScaler object to standardize our features.

Slides
Code

Bias-Variance Trade-Off

04 Week: Regression III (prerecorded)

The expected generalization error of a learning algorithm can be decomposed into two terms: the bias and the variance.

Slides
Code

PCA and Basketball

Bonus content (prerecorded)

A cool PCA example where the components have an interesting interpretation.

Slides
Code

Adjustments for Time Series Data

07 Week: Time Series I (prerecorded)

Since time series are ordered, we need a slight adjustment to our cross-validation strategy.

Slides
Code

Rolling Averages

07 Week: Time Series I (prerecorded)

Our first time series forecast works by taking local averages of previous observations.

Slides
Code

Autoregressive (AR(p)) Models

07 Week: Time Series II (prerecorded)

An autoregressive model expresses the value of a time series as a linear combination of lags plus an error term.

Slides
Transcript
Code

SARIMA

07 Week: Time Series II (prerecorded)

SARIMA stands for Seasonal Autoregressive Integrated Moving Average. We will learn how these models are defined and see an example of how to use such a model in practice.

Slides
Code

The Confusion Matrix

08 Week: Classification I (prerecorded)

And you thought things couldn't get any more confusing.

Slides
Code

Bayes' Based Classifiers I

09 Week: Classification II (prerecorded)

The first in a two part series about Bayes' rule based classification algorithms.

Slides
Code

Support Vector Machines I

09 Week: Classification II (prerecorded)

A class of support vector machine for linear problems.

Slides
Code

What is Ensemble Learning?

10 Week: Ensemble Learning I (prerecorded)

Putting lots of models together to make a better one (hopefully).

Slides
Code

Boosting

11 Week: Ensemble Learning II (prerecorded)

We describe the general idea behind all boosting algorithms.

Slides
Code

Math Hour 1

Math Hour

We discuss Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) Estimation of model parameters. We address estimating the parameters of binomial and normal distributions from a sample.

Slides
Transcript
Code

XGBoost

11 Week: Ensemble Learning II (prerecorded)

Extra Gradient Boost!

Slides
Code

Perceptrons

12 Week: Neural Networks (prerecorded)

The neurons of neural networks.

Slides
Code

Introduction to Convolutional Neural Networks

12 Week: Neural Networks (prerecorded)

We introduce the basic theory behind convolutional neural networks, NNs designed for grid-based data.

Slides
Code

Loading Pre-Trained Models

12 Week: Neural Networks (prerecorded)

How you can load a model you have saved after training it for a long time.

Slides
Code

What is Clustering?

Bonus content (prerecorded)

We take a moment to define clustering problems.

Slides
Code

Imputation

Bonus content (prerecorded)

When you are missing data, try imputing!

Slides
Code

Gradient Descent

Bonus content (prerecorded)

Taking advantage of the gradient to minimize cost functions.

Slides
Code

General Presentation Tips

Presentation Tips and Tricks (prerecorded)

Some general tips for making a good presentation from Matt.

Slides
Code

Organizing your work

Technical Support

We present two options for organizing the work you do in this course:

1. Copying the notebooks are working on to a new folder.
2. Creating a local branch of the repo where you do your work.

Transcript
Code

Web Scraping with BeautifulSoup

Data Collection (prerecorded)

We give a brief introduction into web scraping with BeautifulSoup

Slides
Code

Summary and Conclusion

Data Collection (prerecorded)

We sure have learned a lot of ways to collect data with python. Let's summarize and make some final conclusions on this topic.

Slides
Code

Simple Linear Regression

02 Week: Regression I (prerecorded)

We introduce the simple linear regression model.

Slides
Code

Categorical Variables and Interactions

03 Week: Regression II (prerecorded)

Adjusting our regression set up to accommodate categorical variables.

Slides
Code

Basic Pipelines

03 Week: Regression II (prerecorded)

Pipelines are a nice way to put all modeling steps into one neat package. Here we introduce the most basic pipeline creation method.

Slides
Code

Regularization

04 Week: Regression III (prerecorded)

Regularization adds a term to our loss function which penalizes large parameters. This can help prevent overfitting.

Slides
Code

Feature Selection Approaches

04 Week: Regression III (prerecorded)

We review a few different approaches to choosing what features to include in a linear regression model.

Slides
Code

Time and Dates in Python

07 Week: Time Series I (prerecorded)

A brief aside on how to handle time and dates in python.

Slides
Code

Exponential Smoothing

07 Week: Time Series I (prerecorded)

Exponential Smoothing forecast methods are rolling averages with weights which decreases exponentially backwards through time. We introduce four such methods.

Slides
Code

Moving Average (MA(q)) Models

07 Week: Time Series II (prerecorded)

A Moving Average model expresses the value of a time series as a linear combination of lagged independent normally distributed errors.

Slides
Transcript
Code

Adjustments for Classification

08 Week: Classification I (prerecorded)

You gotta stratify pal.

Slides
Code

Logistic Regression

08 Week: Classification I (prerecorded)

Just when we thought we were done with regression, it pulls us back in.

Slides
Code

Bayes' Based Classifiers II

09 Week: Classification II (prerecorded)

The second of a two part series on Bayes' rule based classification.

Slides
Code

Support Vector Machines II

09 Week: Classification II (prerecorded)

A type of support vector machine that can deal with nonlinear problems.

Slides
Code

Random Forests

10 Week: Ensemble Learning I (prerecorded)

What do you get when you put a whole bunch of random decision trees together?

Slides
Code

Adaptive Boosting

11 Week: Ensemble Learning II (prerecorded)

Our first boosting algorithm.

Slides
Code

Live Lecture 1

Live Lectures

An orientation to the Fall 2024 Data Science Bootcamp.

Transcript
Code

Voter Models

11 Week: Ensemble Learning II (prerecorded)

All the models get together and take a vote.

Slides
Code

The MNIST Data Set

12 Week: Neural Networks (prerecorded)

We do a more formal introduction of the MNIST data set.

Slides
Code

Introduction to Convolutional Neural Networks II

12 Week: Neural Networks (prerecorded)

The basics of fitting a convolutional neural network in keras.

Slides
Code

Future Directions

12 Week: Neural Networks (prerecorded)

Directions you may pursue if you want to learn more about neural networks, theory and implementation.

Slides
Code

k Means Clustering

Bonus content (prerecorded)

Our first clustering algorithm.

Slides
Code

More Advanced Pipelines

Bonus content (prerecorded)

We demonstrate some more advanced pipeline techniques in sklearn.

Slides
Code

GridSearchCV

Bonus content (prerecorded)

We introduce a sklearn object that makes hyper parameter tuning a tad easier.

Slides
Code

Plotting Tips

Presentation Tips and Tricks (prerecorded)

Some tips on making good presentation plots from Matt.

Slides
Code

A Broad Overview

Data Collection (prerecorded)

In this video we give an eagle's eye view of what we will cover in our data science content.

Slides
Code

Python and APIs

Data Collection (prerecorded)

How can we use python to collect data from APIs?

Slides

A Supervised Learning Framework

02 Week: Regression I (prerecorded)

We introduce a statistical framework for supervised learning problems.

Slides
Code

A First Predictive Modeling Project

02 Week: Regression I (prerecorded)

We review a typical workflow for a predictive modeling project with some baseball data.

Slides
Code

Polynomial Regression and Nonlinear Transformations

03 Week: Regression II (prerecorded)

We can make polynomial and nonlinear transformations of our features to make additional regression model types.

Slides
Code

Linear Regression Diagnostic Plots

03 Week: Regression II (prerecorded)

Residual plots can help us better understand our linear regression model.

Slides
Code

PCA

04 Week: Regression III (prerecorded)

We introduce the theory behind principal components analysis and demonstrate how it is implemented in python.

Slides
Code

What are Time Series and Forecasting

07 Week: Time Series I (prerecorded)

We introduce the notions of time series data and forecasting.

Slides
Code

Baseline Forecasts

07 Week: Time Series I (prerecorded)

Some good baseline models for various flavors of time series data.

Slides
Code

Stationarity and Autocorrelation

07 Week: Time Series II (prerecorded)

We learn what a stationary time series is and see how we can assess whether a time series is clearly non-stationary.

Slides
Code

Next Steps

07 Week: Time Series II (prerecorded)

Where to go in order to keep learning about time series and forecasting.

Slides
Code

k Nearest Neighbors Classifier

08 Week: Classification I (prerecorded)

It's a beautiful day in the neighborhood, a beautiful day for k neighbors.

Slides
Code

Diagnostic Curves

08 Week: Classification I (prerecorded)

Here we examine a few different curves with which we can examine classifier performance.

Slides
Code

Multi-class Classification Metrics

09 Week: Classification II (prerecorded)

What are some ways we can evaluate classification models for problems with more than two possible classes.

Slides
Code

Decision Trees

10 Week: Ensemble Learning I (prerecorded)

Let's branch out and learn about decision trees.

Slides
Code

Bagging and Pasting

10 Week: Ensemble Learning I (prerecorded)

A more general version of the random forest algorithm.

Slides
Code

Project/Homework Instructions

I'm a paragraph. Click here to add your own text and edit me. It's easy.

Project/Team Formation
Project Submission
Projects README

How To Form Projects

Presentation Tips and Tricks (prerecorded)

This video should show you how to navigate the team formation process on the Erdos website.

Slides
Transcript

Schedule

Click on any date for more details

DS Bootcamp computer setup day

Next Event

EVENT

Office Hour 1

Next Event

EVENT

Math Hour 2

Next Event

EVENT

Lecture 3: Regression II

Next Event

EVENT

Problem Session 3

Next Event

EVENT

Office Hour 4

Next Event

EVENT

Math Hour 5

Next Event

EVENT

Lecture 6: Inference II

Next Event

EVENT

Problem Session 6

Next Event

EVENT

Office Hour 7

Next Event

EVENT

Math Hour 8

Next Event

EVENT

Lecture 9: Classification II

Next Event

EVENT

Problem Session 9

Next Event

EVENT

Office Hour 10

Next Event

EVENT

Math Hour 11

Next Event

EVENT

Lecture 12: Introduction to Neural Networks

Next Event

EVENT

Commencement and Project Showcase

Next Event

EVENT

Lecture 1: Introduction, Computer Setup, Q/A

Next Event

EVENT

Problem Session 1

Next Event

EVENT

Office Hour 2

Next Event

EVENT

Math Hour 3

Next Event

EVENT

Lecture 4: Regression III

Next Event

EVENT

Problem Session 4

Next Event

EVENT

Office Hour 5

Next Event

EVENT

Math Hour 6

Next Event

EVENT

Lecture 7: Time Series

Next Event

EVENT

Problem Session 7

Next Event

EVENT

Office Hour 8

Next Event

EVENT

Math Hour 9

Next Event

EVENT

Lecture 10: Ensemble Learning I

Next Event

EVENT

Problem Session 10

Next Event

EVENT

Office Hour 11

Next Event

EVENT

Math Hour 12

Next Event

EVENT

Math Hour 1

Next Event

EVENT

Lecture 2: Regression I

Next Event

EVENT

Problem Session 2

Next Event

EVENT

Office Hour 3

Next Event

EVENT

Math Hour 4

Next Event

EVENT

Lecture 5: Inference I

Next Event

EVENT

Problem Session 5

Next Event

EVENT

Office Hour 6

Next Event

EVENT

Math Hour 7

Next Event

EVENT

Lecture 8: Classification I

Next Event

EVENT

Problem Session 8

Next Event

EVENT

Office Hour 9

Next Event

EVENT

Math Hour 10

Next Event

EVENT

Lecture 11: Ensemble Learning II

Next Event

EVENT

Problem Session 11

Next Event

EVENT

Office Hour 12

Next Event

EVENT

Please check your registration email for program schedule and zoom links.

Project/Homework Deadlines

Sep 20, 2024

Next Event

Watch video about Project Formation

This should help answer any Q's you may have going into project formation

Sep 20, 2024

Next Event

Watch 3 Previous Top Projects

Consult the project database, and watch at least 3 previous top projects from Erdos Alumni.

Sep 27, 2024

Next Event

Project Pitch Hour

Opportunity to meet with other Erdos Fellows and form teams and propose topics.

Oct 4, 2024

Next Event

Data gathering and defining stakeholders + KPIs

Find the dataset you will be working with. Describe the dataset and the problem you are looking to solve (1 page max). List the stakeholders of the project and company key performance indicators (KPIs) (bullet points).

Oct 4, 2024

Next Event

Finalized Teams with Preliminary Project Ideas

Teams need to be finalized by this point. If you proposed or created a project, you must have others in your group. If you did not propose or create a project, you must join an open group.

Oct 18, 2024

Next Event

Exploratory data analysis + visualizations [Checkpoint]

Distributions of variables, looking for outliers, etc. Descriptive statistics.

Oct 18, 2024

Next Event

Data cleaning + preprocessing

Look for missing values and duplicates. Basic data manipulation & preliminary feature engineering.

Nov 1, 2024

Next Event

Written proposal of modeling approach [Checkpoint]

Describe your planned modeling approach, based on the exploratory data analysis from the last two weeks (< 1 page, bullet points).

Nov 8, 2024

Next Event

Machine learning models or equivalent [Checkpoint]

Results with visualizations and/or metrics. List of successes and pitfalls.

Dec 3, 2024

Next Event

Final Projects Due

Final Projects must be submitted by this deadline in order to receive a certificate of completion.

bottom of page