PARTICIPANT DEMOGRAPHICS
4000+ participants with PhDs from 300+ universities
Candidate
Profiles
4026
Seeking Internships
1088
Seeking
Part-Time
649
Seeking
Full-Time
1691
Seeking
Senior/Managerial
368
Seeking
DS, ML, AI
2037
Seeking Quant
Research/Finance
1322
Seeking Software Engineering
746
Seeking Quantum Computing
530
Seeking
UX Research
395
Seeking
Prof/Sci Writing
551
Hover over pie charts for details
Geographic Preferences
Map below is of those expressing specific geographic preferences.
An additional 391 are open to work any where in US.
200+ are open to work in Canada and abroad.
PARTICIPANT PROJECTS
Examples of projects from prior cohorts
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
Wunderpus Octopus (New Atlantis)
Ingrida Semenec, Kshitiz Parihar, Nadir Hajouji, Saswat Mishra, Deniz Olgu Devecioglu
Modeling the relationship between biogeochemical layers and chlorophyll density
The distribution and density of chlorophyll in the ocean are critical indicators of marine primary productivity, which influences the global carbon cycle, marine food webs, and climate regulation. Biogeochemical and physical ocean properties, including nutrient availability, light penetration, water temperature, salinity, and ocean currents influence chlorophyll density. Understanding and accurately modeling these relationships is essential for predicting the impacts of environmental changes on marine ecosystems and for managing oceanic resources effectively. We plan to combine multiple Copernicus Marine Datasets to model the chlorophyll density based on the biochemical and physical properties of the ocean.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
RivusVox Editor
Zachary Bezemek,Francesca Balestrieri
RivusVox Editor: the world's first near-live zero-shot adaptive speech editing system
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
A Vocal-Cue Interpreter for Minimally-Verbal Individuals
Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara
The ReCANVo dataset consists of ~7k audio recordings of vocalizations from 8 minimally-verbal individuals (mostly people with developmental disabilities). The recordings were made in a real-world setting, and were categorized on the spot by the speaker's caregiver based on context, non-verbal cues, and familiarity with the speaker. There are several pre-defined categories such as selftalk, frustrated, delighted, request, etc., and caregivers could also specify custom categories. Our goal was to train a model, per individual, that accurately predicts labels and improves upon previous work.
We train several different combinations of models of the form “Feature Extractor + Classifier”. For extracting features from audio data, we use two deep models (HuBERT and AST) each with pre-trained weights, as well as mel spectrograms. As classifiers, we use a 4 layer CNN-based neural network (for mel spectrograms), NNs with fully-connected layers (for features coming from deep models), and more.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
“Good composers borrow, Great ones steal!”
Emelie Curl, Tong Shan, Glenn Young, Larsen Linov, Reginald Bain
Throughout history, composers and musicians have borrowed musical elements like chord progressions, rhythms, lyrics, and melodies from each other. Our motivation for this project is born of a fascination with this phenomenon, which of course extends to less legal examples like unconsciously or intentionally copying the work of another. Even famed and highly regarded composers like Bach, Vivaldi, Mozart, and Haydn are not innocent of borrowing from their contemporaries or even recycling their own works. Similarly, in 2015, in a high-profile court case, defendants and artists Robin Thicke and Pharrell Williams were ordered to pay millions of dollars in damages for copyright infringement to Marvin Gaye's estate, considering they borrowed from Gaye’s "Got to Give it Up" when writing their hit "Blurred Lines." Our project aimed to use deep learning to assess the similarity between musical clips to potentially establish a more robust and empirical way to detect music plagiarism.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
Taxi Demand Forecasting
Ngoc Nguyen, Li Meng, Sriram Raghunath, Nazanin Komeilizadeh, Noah Gillespie, Edward Ramirez
Knowing where to go to find customers is the most important question for taxi drivers and ride hailing networks. If demand for taxis can be reliably predicted in real-time, taxi companies can dispatch drivers in a timely manner and drivers can optimize their route decision to maximize their earnings in a given day. Consequently, customers will likely receive more reliable service with shorter wait time. This project aims to use rich trip-level data from the NYC Taxi and Limousine Commission to construct time-series taxi rides data for 63 taxi zones in Manhattan and forecast demand for rides. We will explore deep learning models for time series, including Multilayer Perceptrons, LSTM, Temporal Graph-based Neural Networks, and compare them with a baseline statistical model ARIMAX.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
arXiv Chatbot
Xiaoyu Wang,Ketan Sand,Guoqing Zhang,Tajudeen Mamadou Yacoubou,Tantrik Mukerji
arXiv is the largest open database available containing nearly 2.4 million research papers, spanning 8 major domains covering everything there is to understand from the tiniest of atoms to the entire cosmos. A large language model (LLM) having access to such a dataset will make it unprecedented in generating updated, relevant, and, more importantly, precise information with citable sources.
This is exactly what we have done in this project. We have refined the capabilities of Google’s Gemini 1.5 pro LLM by building a customized Retrieval-Augmented Generation (RAG) pipeline that has access to the entire arXiv database. We then deployed the entire package into an app that mimics a chatbot to make the experience user-friendly.
MAY-SUMMER 2024
TEAM
Data Science Boot Camp
Continuous Glucose Monitoring
Daniel Visscher,Margaret Swerdloff,Noah Gillespie,S. C. Park,oladimeji olaluwoye
The idea of the project is to predict high glucose spikes from continuous glucose data, smartwatch data, food logs, and glycemic index. The dataset consists of the following:
1) Tri-axial accelerometer data (movement in subject)
2) Blood volume pulse
3) Intestinal glucose concentration
4) Electrodermal activity
5) Heart rate
6) IBI (interbeat interval)
7) Skin temperature
8) Food log
Data is public in: https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/#files-panel
SPRING 2024
TEAM
Data Science Boot Camp
Aware NLP Project III
Mohammad Nooranidoost, Baian Liu, Craig Franze, Mustafa Anıl Tokmak, Himanshu Raj, Peter Williams
This project involves the investigation and evaluation of different methodologies for retrieval for use in RAG (Retrieval-Augmented Generation) systems. In particular, this project investigates retrieval quality for information downloaded from employee subreddits. We investigated the impacts of using clustering, multi-vector indexing, and multi-querying in advanced retrieval methodologies against baseline naive retrieval.
FALL 2023
TEAM
Data Science Boot Camp
Groundwater Forecasting
Riti Bahl, Meredith Sargent, Marcos Ortiz, Chelsea Gary, Anireju Dudun
Groundwater is a critical source of water human survival. A significant percentage of both drinking and crop irrigation water is drawn from groundwater sources through wells. In the US, overuse of groundwater could have major implications for the future and forecasting groundwater can be useful in understanding its impact. Building on historical data for four wells, together with surface water and weather data, in Spokane, WA, we construct and evaluate machine learning models that forecast groundwater levels in the area.
FALL 2023
TEAM
Data Science Boot Camp
Funk
aydin ozbek, Dane Miyata, Kristina Knowles, Mario Gomez, Kashish Mehta
Most existing music recommendation systems rely on listeners to provide seed tracks, and then utilize a variety of different approaches to recommend additional tracks in either a playlist-like listening session or as sequential track recommendations based on user feedback.
We built a playlist recommendation engine that takes a different approach, allowing listeners to generate a novel playlist based on a semantic string, such as the title of desired playlist, specific mood (happy, relaxed), atmosphere (tropical vibe), or function (party music, focus). Using a publicly available dataset of existing playlists (https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge), we combine a semantic similarity vector model with a matrix factorization model to allow users to quickly and easily generate playlists to fit any occasion.
FALL 2023
TEAM
Data Science Boot Camp
The Silent Emergency - Predicting Preterm Birth
Katherine Grillaert, Divya Joshi, Alexander Sutherland, Kristina Zvolanek, Noah Rahman
Preterm birth is a primary cause of infant mortality and morbidity in the United States, affecting approximately 1 in 10 births. The rates are notably higher among Black women (14.6%), compared to White (9.4%) and Hispanic women (10.1%). Despite its prevalence, predicting preterm birth remains challenging due to its multifaceted etiology rooted in environmental, biological, genetic, and behavioral interactions. Our project harnesses machine learning techniques to predict preterm birth using electronic health records. This data intersects with social determinants of health, reflecting some of the interactions contributing to preterm birth. Recognizing that under-representation in healthcare research perpetuates racial and ethnic health disparities, we take care to use diverse data to ensure equitable model performance across underrepresented populations.
FALL 2023
TEAM
Data Science Boot Camp
DDTs: Dementia Detection Tool
Himanshu Khanchandani, Clark Butler, Cisil Karaguzel, Selman Ipek, Shreya Shukla
Alzheimer’s disease (AD) is one of the most common types of dementia and frequently affects the elderly. Electroencephalography (EEG) is a non-invasive technique to measure the brain activity using external electrodes and may help provide improved diagnosis of AD. In this project we use power spectrum of EEG to build a robust machine learning classifier which predicts whether a patient has Alzheimer's or is healthy. We vastly improve upon existing models in the literature by using modified features compared to the ones used in literature.
SPRING 2023
TEAM
Data Science Boot Camp
Correcting Racial Bias in Measurement of Blood Oxygen Saturation
Rohan Myers, Saad Khalid, woojeong kim, Brooks Miner, Jaychandran Padayasi
Fingertip pulse oximeters are the current standard for estimating blood oxygen saturation without a blood draw, both at home and in healthcare settings. However, pulse oximeters overestimate oxygen saturation, often resulting in ‘hidden hypoxemia’: a patient has hypoxemia (dangerously low oxygen saturation), but the oximeter returns a healthy oxygen value. Unfortunately, oximeter overestimation of oxygen saturation is exacerbated for patients with darker skin tones due to light-based oximeter technology. This results in Black patients experiencing hidden hypoxemia at twice the rate of white patients. By combining pulse oximeter readings (SpO2) with additional patient data, we develop improved methods for estimating arterial blood oxygen saturation (SaO2) and identifying Hidden Hypoxemia. The predictions of our models are more accurate than pulse-oximeter readings alone, and remove the systematic racial inequity inherent in the current medical practice of using oximeter readings alone.
SPRING 2021
TEAM
Data Science Boot Camp
MaizeFinder
Tuguldur Sukhbold, Michael Darcy, Pol Arranz-Gibert, AJ Adejare
Our problem is to accurately predict maize field centers in Africa using very low resolution satellite images. The dataset contains many disparate entries including two satellite imagery, one with higher spatial res (Planet) and one with higher temporal resolution and wavelength coverage (Sentinel-2), partial metadata about the crop fields including estimated yield, size, and subjective quality of the measurement. We are employing CNN based image segmentation models to compute displacement vector from the image center.
SPRING 2021
TEAM
Data Science Boot Camp
NLPs
Frank Hidalgo, Joseph Szabo, Christopher Zhang, Sean Perez, Kun Jin
Acronym/Abbreviation (short form) disambiguation is one of main challenges when using NLP methods to uderstand medical records. While this topic has long been studied, it is still a work in progress. Current strategies often involve having manually curated datasets of abbreviations and train classifiers. The main problem of that approach is that curated datasets are sparse and don't include all the short forms. In Dec 2020, a paper came out where they created a large dataset of short forms as one of their steps in their pipeline to pre-train models. The goal of our project would be to build upon their short form disambiguation piece and create a tool to disambiguate a medical short form using its context. Example of the usage of our tool: original_sentence = "The patient states that she has had dizziness, nausea, some heartburn, and some change in her vision. She is gravida 6, para 4, AB 2. She has no history of adverse reaction to anesthesia." AB could stand for "abortion", "ankle-brachial", "blood group in ABO system", "A, B lines in Kerley lines". disambiguated_sentence = "The patient states that she has had dizziness, nausea, some heartburn, and some change in her vision. She is gravida 6, para 4, abortion 2. She has no history of adverse reaction to anesthesia."
FALL 2022
TEAM
Data Science Boot Camp
Lime
Yuchen Luo, Ritika Khurana, Aditya Chander, Taylor Mahler
We built a podcast recommendation engine that suggests episodes to a listener based on either a previous episode that they've heard or an episode description that they can input with freeform text entry.
SPRING 2022
TEAM
Data Science Boot Camp
Erdio
Matthew Frick, Paul Jreidini, Matthew Heffernan
Timely identification of safety-critical events, such as gunshots, is of great importance to public safety stakeholders. However, existing systems only deliver limited value by not classifying additional urban sounds. We perform classification of environmental sounds to detect safety-critical events, in particular gunshots, and provide information on first-response via siren detection. We also engineer general features for off-line classification tasks and demonstrate how this system can provide value to additional stakeholders in the film and television industry.
SPRING 2022
TEAM
Data Science Boot Camp
SKYLAB
Chenyi Gu, Briana Stanfield, Dylan Bates, Kanishk Jain
The NHL Stanley Cup is the oldest existing trophy to be awarded to a professional sports franchise in North America, and often considered “the hardest trophy to win in professional sport.” Using just regular season data, we want to know, can we predict who is going to win the Stanley Cup?
We collected data from each team, as well as data from every player in over 20,000 games going back to 2005. Using this data, we made an ensemble model using logistic regression, AdaBoost, random forests, and a neural network, which were able to predict playoff data with up to 70% accuracy - above the theoretical threshold reported in the literature of 62%.