PARTICIPANT DEMOGRAPHICS

4000+ participants with PhDs from 300+ universities

Candidate
Profiles

4639

Seeking Internships

1144

Seeking
Part-Time

687 Seeking
Full-Time

1811

Seeking
Senior/Managerial

384 Seeking
DS, ML, AI

2163

Seeking Quant
Research/Finance

1410

Seeking Software Engineering

787 Seeking Quantum Computing

564 Seeking
UX Research

417 Seeking
Prof/Sci Writing

580

Hover over charts for details

University

Area of Study

Year of PhD

Reported Ethnicity

Reported Gender

First Generation

PARTICIPANT PROJECTS

Examples of projects from prior cohorts

FALL 2024

TEAM

Data Science Boot Camp

Predicting Problematic Internet Use

Daniel Visscher, Emilie Wiesner, Aaron Weinberg

Internet use has been identified by researchers as having the potential to rise to the level of addiction, with associated increased rates of anxiety and depression. Identifying cases of problematic internet usage currently requires evaluation by an expert, however, which is a significant impediment to screening children and adolescents across society. One potential solution is to rely on data that is more easily and uniformly collected: the kind collected by a family physician, a simple survey, or by a smartwatch. The research question this project sets out to answer is: “Can we predict the level of problematic internet usage exhibited by children and adolescents, based on their physical activity and survey responses?”

FALL 2024

TEAM

Data Science Boot Camp

AP Outcomes to university metrics

Shannon McElhenney, Raymond Tana, Shrabana Hazra, Prabhat Devkota, Jung-Tsung Li

This project was designed to investigate the potential relationship between AP exam performance and the presence of nearby universities. It was initially hypothesized that local (especially R1/R2 or public) universities would contribute to better pass rates for AP exams in their vicinities as a result of their various outreach, dual-enrollment, tutoring, and similar programs for high schoolers. We produce a predictive model that uses a few features related to university presence, personal income, and population to predict AP exam performance.

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

Wunderpus Octopus (New Atlantis)

Ingrida Semenec, Kshitiz Parihar, Nadir Hajouji, Saswat Mishra, Deniz Olgu Devecioglu

Modeling the relationship between biogeochemical layers and chlorophyll density
The distribution and density of chlorophyll in the ocean are critical indicators of marine primary productivity, which influences the global carbon cycle, marine food webs, and climate regulation. Biogeochemical and physical ocean properties, including nutrient availability, light penetration, water temperature, salinity, and ocean currents influence chlorophyll density. Understanding and accurately modeling these relationships is essential for predicting the impacts of environmental changes on marine ecosystems and for managing oceanic resources effectively. We plan to combine multiple Copernicus Marine Datasets to model the chlorophyll density based on the biochemical and physical properties of the ocean.

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

RivusVox Editor

Zachary Bezemek,Francesca Balestrieri

RivusVox Editor: the world's first near-live zero-shot adaptive speech editing system

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

A Vocal-Cue Interpreter for Minimally-Verbal Individuals

Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara

The ReCANVo dataset consists of ~7k audio recordings of vocalizations from 8 minimally-verbal individuals (mostly people with developmental disabilities). The recordings were made in a real-world setting, and were categorized on the spot by the speaker's caregiver based on context, non-verbal cues, and familiarity with the speaker. There are several pre-defined categories such as selftalk, frustrated, delighted, request, etc., and caregivers could also specify custom categories. Our goal was to train a model, per individual, that accurately predicts labels and improves upon previous work.

We train several different combinations of models of the form “Feature Extractor + Classifier”. For extracting features from audio data, we use two deep models (HuBERT and AST) each with pre-trained weights, as well as mel spectrograms. As classifiers, we use a 4 layer CNN-based neural network (for mel spectrograms), NNs with fully-connected layers (for features coming from deep models), and more.

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

“Good composers borrow, Great ones steal!”

Emelie Curl, Tong Shan, Glenn Young, Larsen Linov, Reginald Bain

Throughout history, composers and musicians have borrowed musical elements like chord progressions, rhythms, lyrics, and melodies from each other. Our motivation for this project is born of a fascination with this phenomenon, which of course extends to less legal examples like unconsciously or intentionally copying the work of another. Even famed and highly regarded composers like Bach, Vivaldi, Mozart, and Haydn are not innocent of borrowing from their contemporaries or even recycling their own works. Similarly, in 2015, in a high-profile court case, defendants and artists Robin Thicke and Pharrell Williams were ordered to pay millions of dollars in damages for copyright infringement to Marvin Gaye's estate, considering they borrowed from Gaye’s "Got to Give it Up" when writing their hit "Blurred Lines." Our project aimed to use deep learning to assess the similarity between musical clips to potentially establish a more robust and empirical way to detect music plagiarism.

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

Taxi Demand Forecasting

Ngoc Nguyen, Li Meng, Sriram Raghunath, Nazanin Komeilizadeh, Noah Gillespie, Edward Ramirez

Knowing where to go to find customers is the most important question for taxi drivers and ride hailing networks. If demand for taxis can be reliably predicted in real-time, taxi companies can dispatch drivers in a timely manner and drivers can optimize their route decision to maximize their earnings in a given day. Consequently, customers will likely receive more reliable service with shorter wait time. This project aims to use rich trip-level data from the NYC Taxi and Limousine Commission to construct time-series taxi rides data for 63 taxi zones in Manhattan and forecast demand for rides. We will explore deep learning models for time series, including Multilayer Perceptrons, LSTM, Temporal Graph-based Neural Networks, and compare them with a baseline statistical model ARIMAX.

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

arXiv Chatbot

Xiaoyu Wang,Ketan Sand,Guoqing Zhang,Tajudeen Mamadou Yacoubou,Tantrik Mukerji

arXiv is the largest open database available containing nearly 2.4 million research papers, spanning 8 major domains covering everything there is to understand from the tiniest of atoms to the entire cosmos. A large language model (LLM) having access to such a dataset will make it unprecedented in generating updated, relevant, and, more importantly, precise information with citable sources.

This is exactly what we have done in this project. We have refined the capabilities of Google’s Gemini 1.5 pro LLM by building a customized Retrieval-Augmented Generation (RAG) pipeline that has access to the entire arXiv database. We then deployed the entire package into an app that mimics a chatbot to make the experience user-friendly.

MAY-SUMMER 2024

TEAM

Data Science Boot Camp

Continuous Glucose Monitoring

Daniel Visscher,Margaret Swerdloff,Noah Gillespie,S. C. Park,oladimeji olaluwoye

The idea of the project is to predict high glucose spikes from continuous glucose data, smartwatch data, food logs, and glycemic index. The dataset consists of the following:
1) Tri-axial accelerometer data (movement in subject)
2) Blood volume pulse
3) Intestinal glucose concentration
4) Electrodermal activity
5) Heart rate
6) IBI (interbeat interval)
7) Skin temperature
8) Food log
Data is public in: https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/#files-panel

MAY-SUMMER 2024

TEAM

Data Science Boot Camp

Chirp Checker

Andrew Merwin, Caleb Fong, B Mede, Yang Yang, Robert Cass, Calvin Yost-Wolff

The nocturnal soundscapes of late summer and autumn are replete with the familiar chirps, trills, and buzzes of singing insects. But these cryptic performers often remain anonymous and underappreciated.

The goal of this project was to build machine learning models to identify the presence of insects in sound files and to coarsely categorize the sounds as crickets, katydids, or cicadas.

Both Support Vector Classifiers and Convolutional Neural Networks were able to identify insects songs to the broad categories of cricket, katydid, and cicada with 90% accuracy or higher.

In the future, similar, more sophisticated models could be applied to filtering large volumes of passively recorded audio from ecological studies of insects and could power apps that identify insect songs to the species level.

SPRING 2024

TEAM

Data Science Boot Camp

Aware NLP Project III

Mohammad Nooranidoost, Baian Liu, Craig Franze, Mustafa Anıl Tokmak, Himanshu Raj, Peter Williams

This project involves the investigation and evaluation of different methodologies for retrieval for use in RAG (Retrieval-Augmented Generation) systems. In particular, this project investigates retrieval quality for information downloaded from employee subreddits. We investigated the impacts of using clustering, multi-vector indexing, and multi-querying in advanced retrieval methodologies against baseline naive retrieval.

FALL 2023

TEAM

Data Science Boot Camp

Groundwater Forecasting

Riti Bahl, Meredith Sargent, Marcos Ortiz, Chelsea Gary, Anireju Dudun

Groundwater is a critical source of water human survival. A significant percentage of both drinking and crop irrigation water is drawn from groundwater sources through wells. In the US, overuse of groundwater could have major implications for the future and forecasting groundwater can be useful in understanding its impact. Building on historical data for four wells, together with surface water and weather data, in Spokane, WA, we construct and evaluate machine learning models that forecast groundwater levels in the area.

FALL 2023

TEAM

Data Science Boot Camp

Funk

aydin ozbek, Dane Miyata, Kristina Knowles, Mario Gomez, Kashish Mehta

Most existing music recommendation systems rely on listeners to provide seed tracks, and then utilize a variety of different approaches to recommend additional tracks in either a playlist-like listening session or as sequential track recommendations based on user feedback.

We built a playlist recommendation engine that takes a different approach, allowing listeners to generate a novel playlist based on a semantic string, such as the title of desired playlist, specific mood (happy, relaxed), atmosphere (tropical vibe), or function (party music, focus). Using a publicly available dataset of existing playlists (https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge), we combine a semantic similarity vector model with a matrix factorization model to allow users to quickly and easily generate playlists to fit any occasion.

FALL 2023

TEAM

Data Science Boot Camp

The Silent Emergency - Predicting Preterm Birth

Katherine Grillaert, Divya Joshi, Alexander Sutherland, Kristina Zvolanek, Noah Rahman

Preterm birth is a primary cause of infant mortality and morbidity in the United States, affecting approximately 1 in 10 births. The rates are notably higher among Black women (14.6%), compared to White (9.4%) and Hispanic women (10.1%). Despite its prevalence, predicting preterm birth remains challenging due to its multifaceted etiology rooted in environmental, biological, genetic, and behavioral interactions. Our project harnesses machine learning techniques to predict preterm birth using electronic health records. This data intersects with social determinants of health, reflecting some of the interactions contributing to preterm birth. Recognizing that under-representation in healthcare research perpetuates racial and ethnic health disparities, we take care to use diverse data to ensure equitable model performance across underrepresented populations.

FALL 2023

TEAM

Data Science Boot Camp

DDTs: Dementia Detection Tool

Himanshu Khanchandani, Clark Butler, Cisil Karaguzel, Selman Ipek, Shreya Shukla

Alzheimer’s disease (AD) is one of the most common types of dementia and frequently affects the elderly. Electroencephalography (EEG) is a non-invasive technique to measure the brain activity using external electrodes and may help provide improved diagnosis of AD. In this project we use power spectrum of EEG to build a robust machine learning classifier which predicts whether a patient has Alzheimer's or is healthy. We vastly improve upon existing models in the literature by using modified features compared to the ones used in literature.

SPRING 2023

TEAM

Data Science Boot Camp

Correcting Racial Bias in Measurement of Blood Oxygen Saturation

Rohan Myers, Saad Khalid, woojeong kim, Brooks Miner, Jaychandran Padayasi

Fingertip pulse oximeters are the current standard for estimating blood oxygen saturation without a blood draw, both at home and in healthcare settings. However, pulse oximeters overestimate oxygen saturation, often resulting in ‘hidden hypoxemia’: a patient has hypoxemia (dangerously low oxygen saturation), but the oximeter returns a healthy oxygen value. Unfortunately, oximeter overestimation of oxygen saturation is exacerbated for patients with darker skin tones due to light-based oximeter technology. This results in Black patients experiencing hidden hypoxemia at twice the rate of white patients. By combining pulse oximeter readings (SpO2) with additional patient data, we develop improved methods for estimating arterial blood oxygen saturation (SaO2) and identifying Hidden Hypoxemia. The predictions of our models are more accurate than pulse-oximeter readings alone, and remove the systematic racial inequity inherent in the current medical practice of using oximeter readings alone.

SPRING 2021

TEAM

Data Science Boot Camp

MaizeFinder

Tuguldur Sukhbold, Michael Darcy, Pol Arranz-Gibert, AJ Adejare

Our problem is to accurately predict maize field centers in Africa using very low resolution satellite images. The dataset contains many disparate entries including two satellite imagery, one with higher spatial res (Planet) and one with higher temporal resolution and wavelength coverage (Sentinel-2), partial metadata about the crop fields including estimated yield, size, and subjective quality of the measurement. We are employing CNN based image segmentation models to compute displacement vector from the image center.

SPRING 2021

TEAM

Data Science Boot Camp

NLPs

Frank Hidalgo, Joseph Szabo, Christopher Zhang, Sean Perez, Kun Jin

Acronym/Abbreviation (short form) disambiguation is one of main challenges when using NLP methods to uderstand medical records. While this topic has long been studied, it is still a work in progress. Current strategies often involve having manually curated datasets of abbreviations and train classifiers. The main problem of that approach is that curated datasets are sparse and don't include all the short forms. In Dec 2020, a paper came out where they created a large dataset of short forms as one of their steps in their pipeline to pre-train models. The goal of our project would be to build upon their short form disambiguation piece and create a tool to disambiguate a medical short form using its context. Example of the usage of our tool: original_sentence = "The patient states that she has had dizziness, nausea, some heartburn, and some change in her vision. She is gravida 6, para 4, AB 2. She has no history of adverse reaction to anesthesia." AB could stand for "abortion", "ankle-brachial", "blood group in ABO system", "A, B lines in Kerley lines". disambiguated_sentence = "The patient states that she has had dizziness, nausea, some heartburn, and some change in her vision. She is gravida 6, para 4, abortion 2. She has no history of adverse reaction to anesthesia."

FALL 2022

TEAM

Data Science Boot Camp

Lime

Yuchen Luo, Ritika Khurana, Aditya Chander, Taylor Mahler

We built a podcast recommendation engine that suggests episodes to a listener based on either a previous episode that they've heard or an episode description that they can input with freeform text entry.

SPRING 2022

TEAM

Data Science Boot Camp

Supermassive Black Hole

Anna Brosowsky, Sayantan Khan, Nancy Wang, Ethan Zell, Yili Zhang

We built a movie finder app that allows a user to enter some details they remember about a movie (along with some optional filter info on the genre and release year) and then predicts what movie the user is thinking of. To solve this NLP problem, our tool uses an embed-and-rerank model. We have precomputed vectorizations of movie plot information for the approximately 34,000 movies in our dataset.

Our model’s first step is to vectorize the user’s query and do a fast comparison to find the 100 closest plot vectors. Then it reranks these top 100 closest plots, performing a more thorough comparison using a neural network that semantically compares the plot fragments with the original query. Finally, we output the 10 movies which show up at the top of this new ranking.

SPRING 2022

TEAM

Data Science Boot Camp

Erdio

Matthew Frick, Paul Jreidini, Matthew Heffernan

Timely identification of safety-critical events, such as gunshots, is of great importance to public safety stakeholders. However, existing systems only deliver limited value by not classifying additional urban sounds. We perform classification of environmental sounds to detect safety-critical events, in particular gunshots, and provide information on first-response via siren detection. We also engineer general features for off-line classification tasks and demonstrate how this system can provide value to additional stakeholders in the film and television industry.

SPRING 2022

TEAM

Data Science Boot Camp

SKYLAB

Chenyi Gu, Briana Stanfield, Dylan Bates, Kanishk Jain

The NHL Stanley Cup is the oldest existing trophy to be awarded to a professional sports franchise in North America, and often considered “the hardest trophy to win in professional sport.” Using just regular season data, we want to know, can we predict who is going to win the Stanley Cup?
We collected data from each team, as well as data from every player in over 20,000 games going back to 2005. Using this data, we made an ensemble model using logistic regression, AdaBoost, random forests, and a neural network, which were able to predict playoff data with up to 70% accuracy - above the theoretical threshold reported in the literature of 62%.

Join Us

THE ERDŐS INSTITUTE

Helping PhDs get and create jobs they love at every stage of their career.

PARTICIPANT DEMOGRAPHICS

4000+ participants with PhDs from 300+ universities

Candidate Profiles

4639

Seeking Internships

1144

Seeking Part-Time

687

Seeking Full-Time

1811

Seeking Senior/Managerial

384

Seeking DS, ML, AI

2163

Seeking Quant Research/Finance

1410

Seeking Software Engineering

787

Seeking Quantum Computing

564

Seeking UX Research

417

Seeking Prof/Sci Writing

580

Hover over charts for details

PARTICIPANT PROJECTS

Examples of projects from prior cohorts

FALL 2024

TEAM

Data Science Boot Camp

Predicting Problematic Internet Use

Daniel Visscher, Emilie Wiesner, Aaron Weinberg

FALL 2024

TEAM

Data Science Boot Camp

AP Outcomes to university metrics

Shannon McElhenney, Raymond Tana, Shrabana Hazra, Prabhat Devkota, Jung-Tsung Li

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

Wunderpus Octopus (New Atlantis)

Ingrida Semenec, Kshitiz Parihar, Nadir Hajouji, Saswat Mishra, Deniz Olgu Devecioglu

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

RivusVox Editor

Zachary Bezemek,Francesca Balestrieri

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

A Vocal-Cue Interpreter for Minimally-Verbal Individuals

Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

“Good composers borrow, Great ones steal!”

Emelie Curl, Tong Shan, Glenn Young, Larsen Linov, Reginald Bain

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

Taxi Demand Forecasting

Ngoc Nguyen, Li Meng, Sriram Raghunath, Nazanin Komeilizadeh, Noah Gillespie, Edward Ramirez

MAY-SUMMER 2024

TEAM

Deep Learning Boot Camp

arXiv Chatbot

Xiaoyu Wang,Ketan Sand,Guoqing Zhang,Tajudeen Mamadou Yacoubou,Tantrik Mukerji

MAY-SUMMER 2024

TEAM

Data Science Boot Camp

Continuous Glucose Monitoring

Daniel Visscher,Margaret Swerdloff,Noah Gillespie,S. C. Park,oladimeji olaluwoye

MAY-SUMMER 2024

TEAM

Data Science Boot Camp

Chirp Checker

Andrew Merwin, Caleb Fong, B Mede, Yang Yang, Robert Cass, Calvin Yost-Wolff

SPRING 2024

Candidate
Profiles

Seeking
Part-Time

Seeking
Full-Time

Seeking
Senior/Managerial

Seeking
DS, ML, AI

Seeking Quant
Research/Finance

Seeking
UX Research

Seeking
Prof/Sci Writing