top of page
Data Science Boot Camp

Spring 2023

May 9, 2023

-

Jun 8, 2023

I'm a paragraph. Click here to add your own text and edit me. It's easy.

erdosOspin.gif

Checking your registration status...

To access the program content, you must first create an account and member profile and be logged in.

Register

You are registered for this program.

Registration Deadlines

Mar 16, 2023

-

Academics from Member Institutions/Departments

Mar 16, 2023

-

Academics from Non-Member Institutions paying the $500 membership fee

Jan 16, 2023

-

Academics from Non-Member Institutions applying for Corporate Sponsored Fellowships

Category

Launch

Overview

The Erdős Institute's signature Data Science Boot Camp has been running since May 2018 thanks to the generous support of our sponsors, members, and partners. Due to its popularity, we now offer our boot camp online twice per year in two different formats: a 1-month long intensive boot camp each May and a semester long version each Fall.

Slack

#slack-channel

Organizers, Instructors, and Advisors

Objectives

The goal of our Data Science Boot Camp is to provide you with the skills and mentorship necessary to produce a portfolio worthy data science/machine learning project while also providing you with valuable career development support and connecting you with potential employers.

Those who successfully complete a team project will receive a digital certificate of completion with a sharable URL.

Project Examples

TEAM

Aware NLP Project III

Mohammad Nooranidoost, Baian Liu, Craig Franze, Mustafa Anıl Tokmak, Himanshu Raj, Peter Williams

clear.png
Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

This project involves the investigation and evaluation of different methodologies for retrieval for use in RAG (Retrieval-Augmented Generation) systems. In particular, this project investigates retrieval quality for information downloaded from employee subreddits. We investigated the impacts of using clustering, multi-vector indexing, and multi-querying in advanced retrieval methodologies against baseline naive retrieval.

First Steps/Prerequisites

Participants should have a base-level familiarity with Python. Participants should also be familiar with some basic math concepts. Finally, you will also need to have your laptop or desktop computer set up for the course. If you are new to Python, need a quick math refresher, or if you need help setting up your computer, then please follow the link below.

Program Content

I'm a paragraph. Click here to add your own text and edit me. It's easy.

25231-github-cat-in-a-circle-icon-vector-icon-vector-eps.png
Program Content

Textbook/Notes

Live Lecture 12 - Neural Networks Intro

May 2023 Live Lectures

The twelfth live lecture of our May 2023 Data Science Boot Camp. We provide an introduction into neural networks.

Slides
Code

Live Lecture 9 - Classification Day 2

May 2023 Live Lectures

The ninth live lecture of our May 2023 Data Science Boot Camp. We cover day 2 of our classification content with a brief move into dimension reduction.

Slides
Code

Live Lecture 6 - Time Series Day 1

May 2023 Live Lectures

The sixth live lecture of our May 2023 Data Science Boot Camp. We started our two day time series unit.

Slides
Code

Live Lecture 3 - Intro to Supervised Learning and Regression 1

May 2023 Live Lectures

The third live lecture of our May 2023 Data Science Boot Camp. We introduced fundamental concepts to supervised learning and began our regression unit.

Slides
Code

Facebook Prophet Model

Time Series

We introduce the Facebook Prophet model.

Slides
Code

Web Scraping with BeautifulSoup

Data Collection

We give a brief introduction into web scraping with BeautifulSoup

Slides
Code

GridSearchCV

Supervised Learning

We introduce a sklearn object that makes hyper parameter tuning a tad easier.

Slides
Code

Introduction to Recurrent Neural Networks I

Neural Networks

A brief review of the theory of basic recurrent neural networks.

Slides
Code

keras

Neural Networks

We introduce the keras package for neural network construction in python.

Slides
Code

Perceptrons

Neural Networks

The neurons of neural networks.

Slides
Code

What is Clustering?

Clustering

We take a moment to define clustering problems.

Slides
Code

PCA II

Dimension Reduction

We review how to explain all that variance.

Slides
Code

More Advanced Pipelines

Cleaning

We demonstrate some more advanced pipeline techniques in sklearn.

Slides
Code

Scaling Data

Cleaning

Sometimes you have data with different scales, here we show you how to change that.

Slides
Code

XGBoost

Ensemble Learning

Extra Gradient Boost!

Slides
Code

Boosting

Ensemble Learning

We describe the general idea behind all boosting algorithms.

Slides
Code

What is Ensemble Learning?

Ensemble Learning

Putting lots of models together to make a better one (hopefully).

Slides
Code

Stationarity and Autocorrelation

Time Series

We learn what a stationary time series is and see how we can assess whether a time series is clearly non-stationary.

Slides
Code

Baseline Forecasts

Time Series

Some good baseline models for various flavors of time series data.

Slides
Code

What are Time Series and Forecasting

Time Series

We introduce the notions of time series data and forecasting.

Slides
Code

Support Vector Machines II

Classification

A type of support vector machine that can deal with nonlinear problems.

Slides
Code

Bayes' Based Classifiers II

Classification

The second of a two part series on Bayes' rule based classification.

Slides
Code

Logistic Regression

Classification

Just when we thought we were done with regression, it pulls us back in.

Slides
Code

Adjustments for Classification

Classification

You gotta stratify pal.

Slides
Code

Linear Regression Diagnostic Plots

Regression

Residual plots can help us better understand our linear regression model.

Slides
Code

Polynomial Regression and Nonlinear Transformations

Regression

We can make polynomial and nonlinear transformations of our features to make additional regression model types.

Slides
Code

A First Predictive Modeling Project

Regression

Batter up!
We review a typical workflow for a predictive modeling project with some baseball data.

Slides
Code

Bias-Variance Trade-Off

Supervised Learning

Sometimes it is a good thing to be a little biased, statistically that is.

Slides
Code

Train Test Splits

Supervised Learning

The first video in a three part series discussing various data splits that are made for predictive modeling.

Slides
Code

Summary and Conclusion

Data Collection

We sure have learned a lot of ways to collect data with python. Let's summarize and make some final conclusions on this topic.

Slides
Code

Plotting Tips

Presentation Tips and Tricks

Some tips on making good presentation plots from Matt.

Slides
Code

Live Lecture 11 - Ensemble Day 2

May 2023 Live Lectures

The eleventh live lecture of our May 2023 Data Science Boot Camp. We cover boosting and voter models.

Slides
Code

Live Lecture 8 - Classification Day 1

May 2023 Live Lectures

The eighth live lecture of our May 2023 Data Science Boot Camp. We dive into classification.

Slides
Code

Live Lecture 5 - Regression Day 3

May 2023 Live Lectures

The fifth live lecture of our May 2023 Data Science Boot Camp. We wrapped up our live lecture coverage of regression.

Slides
Code

Live Lecture 2 - Data Collections

May 2023 Live Lectures

The second live lecture of our May 2023 Data Science Boot Camp. We discussed methods of finding data for your data science project.

Slides
Code

Tree-Based Forecasts

Time Series

Using decision trees and random forests to make forecasts.

Slides
Code

Data Source Websites

Data Collection

We cover a plethora of data source websites you can use.

Slides
Code

Future Directions

Neural Networks

Directions you may pursue if you want to learn more about neural networks, theory and implementation.

Slides
Code

Introduction to Convolutional Neural Networks II

Neural Networks

The basics of fitting a convolutional neural network in keras.

Slides
Code

Multilayer Neural Networks

Neural Networks

Feed forward networks in theory and in sklearn.

Slides
Code

Hierarchical Clustering

Clustering

Our second clustering algorithm.

Slides
Code

tSNE

Dimension Reduction

Another dimension reduction used primarily for data visualization.

Slides
Code

PCA I

Dimension Reduction

We introduce the theory behind principal components analysis and demonstrate how it is implemented in python.

Slides
Code

Imputation

Cleaning

When you are missing data, try imputing!

Slides
Code

Introduction

Cleaning

An introduction to our Cleaning section of notebooks.

Slides
Code

Gradient Boosting

Ensemble Learning

A second boosting algorithm that is loosely associated with gradients.

Slides
Code

Bagging and Pasting

Ensemble Learning

A more general version of the random forest algorithm.

Slides
Code

Next Steps

Time Series

Where to go in order to keep learning about time series and forecasting.

Slides
Code

Averaging and Smoothing II

Time Series

The second in a two part series on averaging based models.

Slides
Code

Time and Dates in Python

Time Series

A brief aside on how to handle time and dates in python.

Slides
Code

Gradient Descent

Supervised Learning

Taking advantage of the gradient to minimize cost functions.

Slides
Code

Support Vector Machines I

Classification

A class of support vector machine for linear problems.

Slides
Code

Bayes' Based Classifiers I

Classification

The first in a two part series about Bayes' rule based classification algorithms.

Slides
Code

The Confusion Matrix

Classification

And you thought things couldn't get any more confusing.

Slides
Code

Regression Version of Classification Algorithms

Regression

All of your favorite classification algorithms back for regression purposes.

Slides
Code

Interpreting Linear Regression

Regression

How can we interpret the results of a fitted linear regression model? Let's find out.

Slides
Code

Categorical Variables and Interactions

Regression

How can we adjust our regression set up to accommodate categorical variables. Also what the heck is an interaction?

Slides
Code

Simple Linear Regression

Regression

Our first model! Come learn about simple linear regression.

Slides
Code

k-Fold Cross-Validation

Supervised Learning

In part three of this three part series we discuss the final data splitting technique, cross-validation.

Slides
Code

A Supervised Learning Framework

Supervised Learning

We introduce a statistical framework for supervised learning problems.

Slides
Code

Data in Databases

Data Collection

Your data is stuck in a database, can you get it out? Learn how in this video.

Slides
Code

Welcome!

Introduction

In this video we welcome you to our data science content.

Slides
Code

Live Lecture 10 - Classification Day 3 & Ensemble Day 1

May 2023 Live Lectures

The tenth live lecture of our May 2023 Data Science Boot Camp. We wrap up classification and move over to ensemble learning. We also cover the rest of PCA that we were unable to finish in lecture 9.

Slides
Code

Live Lecture 7 - Time Series Day 2

May 2023 Live Lectures

The seventh live lecture of our May 2023 Data Science Boot Camp. We wrap up our time series coverage.

Slides
Code

Live Lecture 4 - Regression Day 2

May 2023 Live Lectures

The fourth live lecture of our May 2023 Data Science Boot Camp. We continued to learn about linear regression.

Slides
Code

Live Lecture 1 - Introduction

May 2023 Live Lectures

The first live lecture of our May 2023 Data Science Boot Camp provided an introduction to the boot camp and covered participant setup.

Code

Python and APIs

Data Collection

How can we use python to collect data from APIs?

Slides
Code

General Presentation Tips

Presentation Tips and Tricks

Some general tips for making a good presentation from Matt.

Slides
Code

Loading Pre-Trained Models

Neural Networks

How you can load a model you have saved after training it for a long time.

Slides
Code

Introduction to Convolutional Neural Networks

Neural Networks

We introduce the basic theory behind convolutional neural networks, NNs designed for grid-based data.

Slides
Code

The MNIST Data Set

Neural Networks

We do a more formal introduction of the MNIST data set.

Slides
Code

k Means Clustering

Clustering

Our first clustering algorithm.

Slides
Code

PCA III

Dimension Reduction

We end our three part series on PCA by demonstrating how you can interpret the results of a PCA fit.

Slides
Code

Introduction

Unsupervised Learning

We introduce the field of unsupervised learning.

Slides
Code

Basic Pipelines

Cleaning

Pipelines are a nice way to put all modeling steps into one neat package. Here we introduce the most basic pipeline creation methods.

Slides
Code

Voter Models

Ensemble Learning

All the models get together and take a vote.

Slides
Code

Adaptive Boosting

Ensemble Learning

Our first boosting algorithm.

Slides
Code

Random Forests

Ensemble Learning

What do you get when you put a whole bunch of random decision trees together?

Slides
Code

ARIMA

Time Series

The final time series model we cover in lecture.

Slides
Code

Averaging and Smoothing I

Time Series

Our first time series forecast works by taking local averages of previous observations.

Slides
Code

Adjustments for Time Series Data

Time Series

We have to preserve the space time continuum, great Scott!

Slides
Code

Decision Trees

Classification

Let's branch out and learn about decision trees.

Slides
Code

Multi-class Classification Metrics

Classification

What are some ways we can evaluate classification models for problems with more than two possible classes.

Slides
Code

Diagnostic Curves

Classification

Here we examine a few different curves with which we can examine classifier performance.

Slides
Code

k Nearest Neighbors Classifier

Classification

It's a beautiful day in the neighborhood, a beautiful day for k neighbors.

Slides
Code

Feature Selection Approaches

Regression

We review a few different approaches to choosing what features to include in a linear regression model.

Slides
Code

Regularization

Regression

A constrained optimization approach to regression.

Slides
Code

Multiple Linear Regression

Regression

Regression, but this time with more than one feature. Woah.

Slides
Code

Data Splits and Overfitting

Supervised Learning

In this video we comment on how data splits can be used to assess, but not truly "combat" overfitting.

Slides
Code

Validation Set

Supervised Learning

In part two of this three part data split series we discuss validation sets.

Slides
Code

Introduction

Supervised Learning

In this video we introduce the concepts we will be covering in our supervised learning materials.

Slides
Code

A Broad Overview

Introduction

In this video we give an eagle's eye view of what we will cover in our data science content.

Slides
Code

Project/Homework Instructions

I'm a paragraph. Click here to add your own text and edit me. It's easy.

Project/Team Formation
Project Submission
Projects README

Schedule

Click on any date for more details

Matt Osborne Office Hour

May 3, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 1

May 9, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 2

May 10, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 3

May 11, 2023 at 2:00:00 PM

EVENT

Matt Office Hour

May 12, 2023 at 3:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 4

May 15, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 5

May 16, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 6

May 17, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 7

May 18, 2023 at 2:00:00 PM

EVENT

Matt Office Hour

May 19, 2023 at 3:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 8

May 22, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 9

May 23, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 10

May 24, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 11

May 25, 2023 at 8:00:00 PM

EVENT

Matt Office Hour

May 26, 2023 at 7:00:00 PM

EVENT

Matt Office Hour

May 31, 2023 at 6:00:00 PM

EVENT

Erdős Final Project Showcase and Commencement

June 7, 2023 at 4:00:00 PM

EVENT

Matt Osborne Office Hour

May 5, 2023 at 3:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 1

May 9, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 2

May 10, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 3

May 11, 2023 at 8:00:00 PM

EVENT

Matt Office Hour

May 12, 2023 at 7:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 4

May 15, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 5

May 16, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 6

May 17, 2023 at 8:00:00 PM

EVENT

Data Science Boot Camp PM Problem Session 7

May 18, 2023 at 8:00:00 PM

EVENT

Matt Office Hour

May 19, 2023 at 7:00:00 PM

EVENT

Data Science Boot Camp Lecture 9

May 22, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 10

May 23, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 11

May 24, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 12

May 25, 2023 at 9:30:00 PM

EVENT

Matt Office Hour

May 29, 2023 at 8:00:00 PM

EVENT

Matt Office Hour

June 1, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp Lecture 1

May 8, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 2

May 9, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 3

May 10, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 4

May 11, 2023 at 9:30:00 PM

EVENT

Project Pitch Day (Live on Zoom)

May 12, 2023 at 8:30:00 PM

EVENT

Data Science Boot Camp Lecture 5

May 15, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 6

May 16, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 7

May 17, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp Lecture 8

May 18, 2023 at 9:30:00 PM

EVENT

Data Science Boot Camp AM Problem Session 8

May 22, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 9

May 23, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 10

May 24, 2023 at 2:00:00 PM

EVENT

Data Science Boot Camp AM Problem Session 11

May 25, 2023 at 2:00:00 PM

EVENT

Matt Office Hour

May 26, 2023 at 3:00:00 PM

EVENT

Matt Office Hour

May 31, 2023 at 2:00:00 PM

EVENT

Matt Office Hour

June 1, 2023 at 8:00:00 PM

EVENT

Please check your registration email for program schedule and zoom links.

Project/Homework Deadlines

May 12, 2023

8:30 PM

Project Pitch Day (Live on Zoom)

Opportunity to meet with other Erdos Fellows and form teams and propose topics.

May 13, 2023

3:59 AM

Submit Team Proposal to Project Formation Page

If you want to propose a project, or have an idea for a project, submit it by this date.

May 15, 2023

3:59 AM

Finalized Teams with Preliminary Project Idea

Teams need to be finalized by this point. If you proposed or created a project, you must have others in your group. If you did not propose or create a project, you must join an open group.

May 20, 2023

3:59 AM

Data gathering and defining stakeholders + KPIs

Find the dataset you will be working with. Describe the dataset and the problem you are looking to solve (1 page max). List the stakeholders of the project and company key performance indicators (KPIs) (bullet points).

May 20, 2023

3:59 AM

Data cleaning + preprocessing

Look for missing values and duplicates. Basic data manipulation & preliminary feature engineering.

May 27, 2023

3:59 AM

Exploratory data analysis + visualizations [Checkpoint]

Distributions of variables, looking for outliers, etc. Descriptive statistics.

May 27, 2023

3:59 AM

Written proposal of modeling approach [Checkpoint]

Test linearity assumptions. Dimensionality reductions (if necessary). Describe your planned modeling approach, based on the exploratory data analysis from the last two weeks (< 1 page, bullet points).

Jun 2, 2023

3:59 AM

Machine learning models or equivalent [Checkpoint]

Results with visualizations and/or metrics. List of successes and pitfalls.

Jun 3, 2023

4:00 PM

Final project due

Please read the submission instructions on the link below.

bottom of page