Checking your membership status...
Project Database
View Team Project Submissions for various cohorts and programs below:
65 results were found.
MAY-SUMMER 2024
TEAM 22
Wunderpus Octopus (New Atlantis)
Deep Learning Boot Camp
Ingrida Semenec, Kshitiz Parihar, Nadir Hajouji, Saswat Mishra, Deniz Olgu Devecioglu
Modeling the relationship between biogeochemical layers and chlorophyll density
The distribution and density of chlorophyll in the ocean are critical indicators of marine primary productivity, which influences the global carbon cycle, marine food webs, and climate regulation. Biogeochemical and physical ocean properties, including nutrient availability, light penetration, water temperature, salinity, and ocean currents influence chlorophyll density. Understanding and accurately modeling these relationships is essential for predicting the impacts of environmental changes on marine ecosystems and for managing oceanic resources effectively. We plan to combine multiple Copernicus Marine Datasets to model the chlorophyll density based on the biochemical and physical properties of the ocean.
MAY-SUMMER 2024
TEAM 5
A Vocal-Cue Interpreter for Minimally-Verbal Individuals
Deep Learning Boot Camp
Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara
The ReCANVo dataset consists of ~7k audio recordings of vocalizations from 8 minimally-verbal individuals (mostly people with developmental disabilities). The recordings were made in a real-world setting, and were categorized on the spot by the speaker's caregiver based on context, non-verbal cues, and familiarity with the speaker. There are several pre-defined categories such as selftalk, frustrated, delighted, request, etc., and caregivers could also specify custom categories. Our goal was to train a model, per individual, that accurately predicts labels and improves upon previous work.
We train several different combinations of models of the form “Feature Extractor + Classifier”. For extracting features from audio data, we use two deep models (HuBERT and AST) each with pre-trained weights, as well as mel spectrograms. As classifiers, we use a 4 layer CNN-based neural network (for mel spectrograms), NNs with fully-connected layers (for features coming from deep models), and more.
MAY-SUMMER 2024
TEAM 20
“Good composers borrow, Great ones steal!”
Deep Learning Boot Camp
Emelie Curl, Tong Shan, Glenn Young, Larsen Linov, Reginald Bain
Throughout history, composers and musicians have borrowed musical elements like chord progressions, rhythms, lyrics, and melodies from each other. Our motivation for this project is born of a fascination with this phenomenon, which of course extends to less legal examples like unconsciously or intentionally copying the work of another. Even famed and highly regarded composers like Bach, Vivaldi, Mozart, and Haydn are not innocent of borrowing from their contemporaries or even recycling their own works. Similarly, in 2015, in a high-profile court case, defendants and artists Robin Thicke and Pharrell Williams were ordered to pay millions of dollars in damages for copyright infringement to Marvin Gaye's estate, considering they borrowed from Gaye’s "Got to Give it Up" when writing their hit "Blurred Lines." Our project aimed to use deep learning to assess the similarity between musical clips to potentially establish a more robust and empirical way to detect music plagiarism.
MAY-SUMMER 2024
TEAM 3
Taxi Demand Forecasting
Deep Learning Boot Camp
Ngoc Nguyen, Li Meng, Sriram Raghunath, Nazanin Komeilizadeh, Noah Gillespie, Edward Ramirez
Knowing where to go to find customers is the most important question for taxi drivers and ride hailing networks. If demand for taxis can be reliably predicted in real-time, taxi companies can dispatch drivers in a timely manner and drivers can optimize their route decision to maximize their earnings in a given day. Consequently, customers will likely receive more reliable service with shorter wait time. This project aims to use rich trip-level data from the NYC Taxi and Limousine Commission to construct time-series taxi rides data for 63 taxi zones in Manhattan and forecast demand for rides. We will explore deep learning models for time series, including Multilayer Perceptrons, LSTM, Temporal Graph-based Neural Networks, and compare them with a baseline statistical model ARIMAX.
MAY-SUMMER 2024
TEAM 14
arXiv Chatbot
Deep Learning Boot Camp
Xiaoyu Wang,Ketan Sand,Guoqing Zhang,Tajudeen Mamadou Yacoubou,Tantrik Mukerji
arXiv is the largest open database available containing nearly 2.4 million research papers, spanning 8 major domains covering everything there is to understand from the tiniest of atoms to the entire cosmos. A large language model (LLM) having access to such a dataset will make it unprecedented in generating updated, relevant, and, more importantly, precise information with citable sources.
This is exactly what we have done in this project. We have refined the capabilities of Google’s Gemini 1.5 pro LLM by building a customized Retrieval-Augmented Generation (RAG) pipeline that has access to the entire arXiv database. We then deployed the entire package into an app that mimics a chatbot to make the experience user-friendly.
MAY-SUMMER 2024
TEAM 7
Geo-locator
Data Science Boot Camp
Aashraya Jha,Dante Bonolis,Zachary Bezemek,Leonhard Hochfilzer,Francesca Balestrieri
In the popular online game Geoguessr, the player is shown a random image from Google Street View and is tasked with guessing their location on the globe as accurately as possible. In this project, we seek to solve a simplified version of this problem but using a strategy often used by professional Geoguesser players: using man-made features (for example, traffic lights) to accurately guess a city.
We use the publicly available GSV-Cities Dataset, which consists of around 500k street-view images taken in 23 different cities. We then use CNN trained on the images and features extracted from the images to make our mode. The backbone of this CNN is a pre-trained model named MobileNetV2.
MAY-SUMMER 2024
TEAM 12
Jimmy's and Joes vs X's and O's: Predicting results in college sports analyzing talent accumulation and on-field success
Data Science Boot Camp
Reginald Bain, Tung Nguyen, Reid Harris
Recent legislation has changed the landscape of college sports, a multi-billion dollar enterprise with deep roots in American sports culture. With the recent legalization of sports betting in many states and the SCOTUS O’Bannon ruling that allows athletes to be paid through so-called “Name-Image-Likeness (NIL)” deals, evaluating talent and projecting results in college sports is an increasingly interesting problem. By considering both talent accumulation and recent on-field results, our models aim to predict relevant results for sports betting/team construction. In this iteration of the project, our targets are regular season win percentage (using a season level model that we’ll call Model 1) and individual game results (with a game by game model we’ll call Model 2) in the regular season. Our datasets come from a variety of sources including On3, ESPN, 24/7 Sports, The College Football Database, and SportsReference.com.
MAY-SUMMER 2024
TEAM 19
MoonBoard Grade Classification
Data Science Boot Camp
Gautam Prakriya,Adrian Batista Planas,Larsen Linov,Prabhjot Singh
A MoonBoard is a standardized rock climbing wall - fixed holds on a wall of fixed dimensions. This climbing wall comes with an app that generates routes/problems for climbers to move up. Rock climbing routes are assigned subjective grades to represent difficulty, but given that the MoonBoard is widely used there is often a consensus around grades assigned to MoonBoard problems making them somewhat objective. The goal of the project will be to build a model that identifies the grade of a given route.
MAY-SUMMER 2024
TEAM 21
BirdCLEF
Data Science Boot Camp
Junichi Koganemaru, Robert Jeffs, Ashwin Tarikere Ashok Kumar Nag, Salil Singh
This project addresses the BirdCLEF 2024 research code competition hosted on Kaggle by the Cornell Lab of Ornithology. Participants are provided with a dataset containing labeled audio clips of bird calls recorded at various locations in the world. The competition seeks reliable machine learning models for automatic detection and classification of bird species from soundscapes recorded in the Western Ghats of India, a Global Biodiversity Hotspot. The broader goal of this endeavor is to leverage Passive Acoustic Monitoring (PAM) and machine learning techniques to enable conservationists to study bird biodiversity at much greater scales than is possible with observer-based surveys.
MAY-SUMMER 2024
TEAM 24
Company Discourse: How are people talking about my company online?
Data Science Boot Camp
Hannah Lloyd, Vinicius Ambrosi, Gilyoung Cheong, Dohoon Kim
In the age of digital communication, a wealth of information exists in the discourse surrounding companies and their products on social media platforms and online forums. This project utilizes natural language processing (NLP) and machine learning (ML) techniques to construct predictive models capable of assessing and rating comments provided by consumers. By employing these advanced analytical methods, we aim to enhance the correctness and effectiveness of sentiment analysis in understanding and forecasting consumer behavior. This approach is computationally efficient, while maintaining contextual integrity in the data and leveraging complex analytical techniques to gauge audience sentiment through online discourse.
MAY-SUMMER 2024
TEAM 25
AI-powered solutions for the restaurant industry
Data Science Boot Camp
Evaristo Villaseco, Davood Dar
In the US alone, restaurants waste 25bn pounds of food every year before it reaches the consumers plate and independent restaurants are a large driver of this. This is crucial for an industry that operates with very low profit margins of 3% to 6% on average. In this project we have partnered with Burnt (https://burnt.squarespace.com), whose mission is to help restaurants automate their back-of-house operational flow: recipe management, inventory forecasting, analysis and optimization of costs. In this project, we will use time series data from restaurants to forecast menu item sales based on different factors such as day of the week, weather, holidays etc., which will help to optimize ordering decisions for maximum efficiency.
MAY-SUMMER 2024
TEAM 26
Continuous Glucose Monitoring
Data Science Boot Camp
Daniel Visscher,Margaret Swerdloff,Noah Gillespie,S. C. Park,oladimeji olaluwoye
The idea of the project is to predict high glucose spikes from continuous glucose data, smartwatch data, food logs, and glycemic index. The dataset consists of the following:
1) Tri-axial accelerometer data (movement in subject)
2) Blood volume pulse
3) Intestinal glucose concentration
4) Electrodermal activity
5) Heart rate
6) IBI (interbeat interval)
7) Skin temperature
8) Food log
Data is public in: https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/#files-panel
MAY-SUMMER 2024
TEAM 28
Chirp Checker
Data Science Boot Camp
Andrew Merwin, Caleb Fong, B Mede, Yang Yang, Robert Cass, Calvin Yost-Wolff
The nocturnal soundscapes of late summer and autumn are replete with the familiar chirps, trills, and buzzes of singing insects. But these cryptic performers often remain anonymous and underappreciated.
The goal of this project was to build machine learning models to identify the presence of insects in sound files and to coarsely categorize the sounds as crickets, katydids, or cicadas.
Both Support Vector Classifiers and Convolutional Neural Networks were able to identify insects songs to the broad categories of cricket, katydid, and cicada with 90% accuracy or higher.
In the future, similar, more sophisticated models could be applied to filtering large volumes of passively recorded audio from ecological studies of insects and could power apps that identify insect songs to the species level.
MAY-SUMMER 2024
TEAM 30
Headlines and Market Trends: A Sentiment Analysis Approach to Stock Prediction
Data Science Boot Camp
Jem Guhit, Sarasi Jayasekara, Nawaz Sultani, Timothy Alland, Ogonnaya Romanus, Kenneth Anderson
Financial markets are often affected by sentiment conveyed in news headlines. As major news events can drive significant fluctuations in stock prices, understanding these sentiment trends can provide important insights into market movements. This project aims to answer the question whether the sentiments extracted from financial news headlines can predict stock movements.
We use 5 years worth of data extracted from Yahoo Finance and Stock News API, obtain sentiment scores using FinVader, and use Models: Logistic Regression, Gradient Boosted Trees, XGBoost, and LSTM, to predict whether the next day's stock prices would rise or fall. We use a simulated stock portfolio to evaluate the effectiveness of the models.
MAY-SUMMER 2024
TEAM 40
Flavor Finder
Data Science Boot Camp
Zhihan Li, Xue Xiao, Daniel Colon Amill, Andres Martinez, William Porteous, Michael Shteyn
Flavor Finder is a chat client that generates query-specific menu-item recommendations using Retrieval-Augmented Generation (RAG) to comb thousands of Google reviews. This process results in a natural language dish recommendation which is responsive to a user's unique dietary needs and preferences.
SPRING 2024
TEAM 1
Counting Crossings - Team 2
Data Science Boot Camp
Jared Able
Roads and bridges are essential to civilian, commercial, and government transport across the world. One facet of roads that is often ill-captured by GPS navigation systems is that of road overpasses, and this can have severe consequences for drivers of large vehicles. We aim to rectify this lack by predicting the presence of road overpasses in a given satellite image. To do so, we apply a convolutional neural network to a dataset of satellite images that we labeled and assembled.
SPRING 2024
TEAM 8
Analysing Road Safety
Data Science Boot Camp
Pedro Lemos, Jesse Frohlich, Jacob Van Hook, Assaf Bar-Natan
The purpose of this project is to predict collision rates in New York City using public road data and other geological features, and to identify key road infrastructures that can improve safety as economically as possible.
SPRING 2024
TEAM 17
Activity Detection using Biosignals from Wearable Devices
Data Science Boot Camp
Tong Shan, Dushyanth Sirivolu, Fulya Tastan, Philip Barron, Larsen Linov, Ming Li
Our goal is to study the biosignal pattern of everyday activity like walking, running, lifting chairs, etc, and creates machine learning models to recognize human daily activities from biosignals recorded by wearable devices. These signals includes electrocardiography (ECG), electrodermal activity (EDA), and photoplethysmography (PPG), electromyography (EMG), wrist temperature (TEMP) and chest and wrist actigraphy (ACC). The algorithms can be used in detecting user's daily activities and monitoring user's health condition.
SPRING 2024
TEAM 18
D&D Combat Length Predictions
Data Science Boot Camp
Alec Traaseth, Deewang Bhamidipati, David Rubinstein, Emre Akaturk, Jeremy Schwend
Dungeons and Dragons (D&D) is one of the most popular tabletop role-playing games. But there are not a lot of tools nor data to answer questions about party optimization, resource management, combat difficulty, etc. Enter: FIREBALL, a dataset of 25,000 unique combat sessions. This dataset logs individual actions and character information from 25,000 unique sessions, giving a lot of information to leverage for answering a variety of questions.
This group will be answering the question "How many rounds should a combat take?" This tool could help dungeon masters who are crafting encounters and are wanting to ensure their big perfect monster survives long enough to be a threat, or to prevent a large combat from becoming a boring slog.
SPRING 2024
TEAM 34
Aware NLP Project III
Data Science Boot Camp
Mohammad Nooranidoost, Baian Liu, Craig Franze, Mustafa Anıl Tokmak, Himanshu Raj, Peter Williams
This project involves the investigation and evaluation of different methodologies for retrieval for use in RAG (Retrieval-Augmented Generation) systems. In particular, this project investigates retrieval quality for information downloaded from employee subreddits. We investigated the impacts of using clustering, multi-vector indexing, and multi-querying in advanced retrieval methodologies against baseline naive retrieval.
SPRING 2024
TEAM 38
Recipe Recommender
Data Science Boot Camp
Nadir Hajouji, Felix Almendra Hernandez, Nathan Schley, Ali Arslanhan, Katherine Martin
We designed a recipe recommendation engine that suggests recipes based on a user query and a user's review history.
Our modeling focused mainly on trying to predict recipes that a user was likely to review. We tried some intuitive things, and they didn't work as well as we thought they would- but we obtained models that did a surprisingly good job predicting which reviews were left out of the training set using singular value decomposition.
We also created a user interface that allows a user to enter a freeform query, and that returns a list of recipes that not only match the query but also take the user's review history into account. We did this by combining our model (which quantified how well a recipe matches up with the user history) with a pretrained sentence transformer (which quantified how well a recipe matches the query).
SPRING 2024
TEAM 40
Harmful Brain Activity Classification
Data Science Boot Camp
Jianing Yang, Souparna Purohit, Kshitiz Parihar, Evgeniya Lagoda
This project aims to use EEG recordings of critically ill patients provided by Harvard Medical School (https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification) to classify seizures and other harmful brain activities.
SPRING 2024
TEAM 41
Nuclear Localization Signal (NLS) Prediction - NLSeer
Data Science Boot Camp
Scott Auerbach, Ukamaka Nnyaba, Ming Zhang, Yingyi Guo, Hemaa Selvakumar, Cisil Karaguzel
The purpose of the project is to build a prediction tool that estimates the possibility of having nuclear localization signals inside a protein's sequence based on the significance of each amino acid. Nuclear localization signals (NLS) are segments of a protein sequence that direct it towards the nucleus and have been implicated in human diseases and play an important role in many biological pathways. We employed datasets including whole protein sequences with and without nuclear localization signals and trained both classifiers and neural networks to predict whether or not a protein contained a NLS. Using a random forest classifier, we developed a web app through Flask that can predict whether or not a given protein is likely to have a NLS, and if so, also estimate the likelihood of each amino acid contributing to a NLS.
SPRING 2024
TEAM 1
Pawsitive Retrieval 1
Deep Learning Boot Camp
Marcos Ortiz, Kristina Knowles, Diptanil Roy, Karthik Prabhu Palimar, Sayantan Roy
This project aims to build a model to efficiently identify and rank relevant content from a large dataset of human-generated Reddit posts (5.5 million posts from 34 different subreddits), given an arbitrary user query. The key objectives were to retrieve highly relevant results for queries while keeping retrieval times under 1 second. The long-term application is to use this capability as part of a Retrieval-Augmented Generation (RAG) pipeline for Aware clients.
We focus on systematically varying parameters of our embedding model, as well as applying different filters (before retrieval) and rerankings (after retrieval) that leverage the relationships inherent in the structure of the data.
Using these strategies, we successfully improve the placement of relevant results retrieved according to several modified recommender system metrics. These metrics were implemented using a set of over 1000 human labeled query-result pairs establishing a set of known relevant results for 25 queries.
SPRING 2024
TEAM 6
Medical Image Classification
Deep Learning Boot Camp
Tristan Freiberg, Bailey Forster, Henri Antikainen
Melanoma can affect anyone and early detection is a crucial factor affecting survival rates. Machine learning models could assist trained healthcare professionals in screening for skin cancer. The Human Against Machine 10000 (HAM10000) dataset contains images of 7,470 distinct skin lesions, each belonging to one of seven mutually-exclusive classes of skin lesion, including melanoma, as well as other cancerous types and benign types such as nevi (moles). Our goal was to train a convolutional neural network (CNN) to accurately classify images of skin lesions using the HAM10000 dataset. The project includes a Streamlit app to test users’ classification ability against our fine-tuned models.
SPRING 2024
TEAM 3
Audiobots: Transformers in Disguise
Deep Learning Boot Camp
Dylan Bates, Soheil Anbouhi, Aycan Gamache, Johann Thiel, Paul VanKoughnett, Muhammed Cifci
Historically, songs have been categorized into genres not just for commercial purposes but also to enhance the listening experience and foster cultural exchange through music. Our primary goal was to compare the performance of traditional machine learning models with more advanced deep learning models like transformers, thereby evaluating the effectiveness of these newer neural network architectures for music genre classification.
We used two datasets, GTZAN and Free Music Archive, training a variety of models on the smaller dataset, and choosing the best performing models to train on the larger. Although it is easy enough to overfit on the training data, creating a model that generalizes well is difficult, especially in the case of imbalanced data.
We further tested our best models in a practical scenario by addressing a contemporary debate: whether Beyoncé’s new album Cowboy Carter is country.
FALL 2023
TEAM 45
Biomedical Categorization
Data Science Boot Camp
shayne plourde, Gary Hu, Michelle Lobb, Donna Chen
Diabetes is a major issue in the world, impacting 8.5% of adults and killing 1.5 million people in 2019 according to the World Health Organization. Diabetes is a chronic disease that affects how the body regulates blood glucose levels. Over time, having raised blood glucose levels may lead to serious damage to the nerves and blood vessels, leading to further complications.
The goal of this project is to better understand the relationship between lifestyle factors and diabetes and subsequently predict whether an individual has diabetes or not, based on a survey questionnaire.
FALL 2023
TEAM 37
BrewSavvy
Data Science Boot Camp
Timothy Alland, Brandon Butler, Phuc Nguyen, Aidan Lorenz
We built a beer recommender app that recommends beers to a user based on a list of beers that the user likes. The underlying model uses matrix factorization trained on a data set of ~1.5 million reviews with ~65,000 different beers and ~33,000 users.
FALL 2023
TEAM 18
Will my flight be late?
Data Science Boot Camp
Simon Guichandut, Ketan Sand, Tim Hallatt
Flight delays are not only bothersome but also widespread, causing over 200,000 hours of combined delay annually in just 20 of the busiest airports in the United States. This results in a staggering $32.9 billion annual economic loss for the US. The ability to understand the
contributing factors and predict delays is crucial for better preparation and minimizing the impact. To address this issue, we utilized 12 years of data from the Bureau of Transportation Statistics in the US - (https://www.transtats.bts.gov/HomeDrillChart.asp). The dataset was refined to focus on flights between the top 20 busiest airports, operated by the top 8 airline carriers in US. We employed a random forest model for training, predicting both the likelihood of delay and quantifying the delay duration. A user-friendly website (https://willmyflightbelate.streamlit.app/) was developed to enhance the overall experience
FALL 2023
TEAM 33
Mu 'n I: Direction Detection
Data Science Boot Camp
Christopher Stith, Katja Vassilev, Benjamin Riley, Lukas Scheiwiller, Chinmaya Kausik
The goal of this project is to determine the direction of incoming neutrinos detected by the IceCube neutrino observatory and posted on Kaggle. The IceCube detector indirectly observes high-energy neutrinos from incoming cosmic radiation. IceCube wants to use data science to estimate the direction to feed into their software which calculates the precise direction. We used several linear regression models, including tensorflow, before training a convolutional and fully connected NN in pytorch. These networks were trained using features provided by IceCube and additional features used in the regression.
FALL 2023
TEAM 9
Meow-by-Meow
Data Science Boot Camp
Jinjing Yi, Zach Hafen-Saavedra, William Craig, Tantrik Mukerji, Brady Ali Medina
Cat vocalizations (“meows”) are typically directed at humans, rather than other cats. Cat meows therefore present an opportunity for computational audio analysis to improve relationships between cats and their owners. In our analysis we developed an interface for users to upload audio recordings and have them interpreted as “comfortable”, “uncomfortable”, or “hungry”. Our classification leverages machine learning models trained on preprocessed and augmented data from the CatMeows dataset.
FALL 2023
TEAM 27
Somm
Data Science Boot Camp
Ngoc Nguyen, veronica miatto, Pavel Kovalev, Palak Arora, Vishal Kumar
We build a wine recommendation engine using wine reviews. Wines are very overwhelming due to the sheer variety and lack of consistent categorization. This tool can help buyers navigate many wine options and help sellers procure wines that fits consumer demand. Existing solutions out there (Vivino, Delectable) are restrictive, as they only allow users to search by wine name, grape type, price. Our wine recommender fills the gap by letting users enter free-form queries to search for wines that fit their tastes.
FALL 2023
TEAM 20
AI-generated Image Detection
Data Science Boot Camp
Amanda Pan, Hasan Saad, Alina Al Beaini, Cemile Kurkoglu
AI-generated images have become increasingly realistic, prompting a variety of malicious uses. We plan to develop a model for detecting AI-generated images, ideally improving upon some of the current difficulties: generalization to different methods of image generation, robustness to image resizing and compression, and interpretability of results.
FALL 2023
TEAM 8
Climate Risk in Marginalized Communities
Data Science Boot Camp
Zoe Kearney, Bailey Forster, Viraj Meruliya, Braeden Reinoso, Reeya Kumbhojkar
The Environmental Protection Agency (EPA) monitors concentrations of air toxins in the US, including particulate matter smaller than 2.5 micrometers (PM2.5). These fine particles are able to enter the lungs and bloodstream, posing a significant health risk. There is also evidence that people of color in the US are at increased risk for adverse health effects due climate change and pollution (see review article: Berberian et al. 2022). Our goal is to create a model that uses 2021 ACS 5-Year Estimates Data Profiles and EPA data to identify tracts likely to be at risk of high PM2.5 levels. This is intended as a screening tool to inform further research on the climate related health risks and pollution sources that are affecting marginalized communities.
FALL 2023
TEAM 6
Groundwater Forecasting
Data Science Boot Camp
Riti Bahl, Meredith Sargent, Marcos Ortiz, Chelsea Gary, Anireju Dudun
Groundwater is a critical source of water human survival. A significant percentage of both drinking and crop irrigation water is drawn from groundwater sources through wells. In the US, overuse of groundwater could have major implications for the future and forecasting groundwater can be useful in understanding its impact. Building on historical data for four wells, together with surface water and weather data, in Spokane, WA, we construct and evaluate machine learning models that forecast groundwater levels in the area.
FALL 2023
TEAM 7
Funk
Data Science Boot Camp
aydin ozbek, Dane Miyata, Kristina Knowles, Mario Gomez, Kashish Mehta
Most existing music recommendation systems rely on listeners to provide seed tracks, and then utilize a variety of different approaches to recommend additional tracks in either a playlist-like listening session or as sequential track recommendations based on user feedback.
We built a playlist recommendation engine that takes a different approach, allowing listeners to generate a novel playlist based on a semantic string, such as the title of desired playlist, specific mood (happy, relaxed), atmosphere (tropical vibe), or function (party music, focus). Using a publicly available dataset of existing playlists (https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge), we combine a semantic similarity vector model with a matrix factorization model to allow users to quickly and easily generate playlists to fit any occasion.
FALL 2023
TEAM 25
The Silent Emergency - Predicting Preterm Birth
Data Science Boot Camp
Katherine Grillaert, Divya Joshi, Alexander Sutherland, Kristina Zvolanek, Noah Rahman
Preterm birth is a primary cause of infant mortality and morbidity in the United States, affecting approximately 1 in 10 births. The rates are notably higher among Black women (14.6%), compared to White (9.4%) and Hispanic women (10.1%). Despite its prevalence, predicting preterm birth remains challenging due to its multifaceted etiology rooted in environmental, biological, genetic, and behavioral interactions. Our project harnesses machine learning techniques to predict preterm birth using electronic health records. This data intersects with social determinants of health, reflecting some of the interactions contributing to preterm birth. Recognizing that under-representation in healthcare research perpetuates racial and ethnic health disparities, we take care to use diverse data to ensure equitable model performance across underrepresented populations.
FALL 2023
TEAM 31
DDTs: Dementia Detection Tool
Data Science Boot Camp
Himanshu Khanchandani, Clark Butler, Cisil Karaguzel, Selman Ipek, Shreya Shukla
Alzheimer’s disease (AD) is one of the most common types of dementia and frequently affects the elderly. Electroencephalography (EEG) is a non-invasive technique to measure the brain activity using external electrodes and may help provide improved diagnosis of AD. In this project we use power spectrum of EEG to build a robust machine learning classifier which predicts whether a patient has Alzheimer's or is healthy. We vastly improve upon existing models in the literature by using modified features compared to the ones used in literature.
FALL 2021
TEAM 13
Narwhal
Data Science Boot Camp
Alana Huszar, Sanal Shivaprasad, Yueqiao Wu, Yili Zhang
When a prior authorization (PA) form is submitted to insurance, it’s important to know if it will get approved. In this project, we try to use data analysis techniques to build a model that predicts, based on the information contained in the PA forms, whether a PA form will be approved or not.
FALL 2021
TEAM 7
Hedgehog
Data Science Boot Camp
Andrew McMillan, Shashank G. Markande, Jithin Madhusudanan Sreekala, Josh Tawabutr
With the increased popularity of zero-commission investing and trading apps such as Robinhood and rise of retail traders, the influence of social media on financial markets have grown. Twitter was one of the most used social media platforms for this purpose. First, we used a machine learning classifier to predict the popularity of a tweet. Following that, a scenario was created where the stocks of S&P 500 companies that were mentioned in the popular tweets were bought and held over a certain period of time. The returns from each stock purchase were then used to train a machine learning model for obtaining recommendations to buy. Traders could potentially use the stock recommendations from our model to plan out investment strategies and build stock portfolios.
FALL 2021
TEAM 10
Koala
Data Science Boot Camp
David Wen, Preston Pozderac, Wendson Barbosa
Root Insurance Bidding Strategy Challenge: We want to propose a bidding strategy for online ad placements based on customer demographics to increase sales of our car insurance policies while minimizing cost and obtaining at least 400 policies sold per 10,000 customers.
SPRING 2023
TEAM 22
Gamer Gametime Habits
Data Science Boot Camp
Maria Tiongco, Aycan Katitas, Joshua Schroeder
Play Next in Steam: A Game Recommendation System! We built a video game recommendation system based on a matrix factorization algorithm and trained the model on a subset of real Steam Users and their playtimes in the games in their libraries. The model predicts a Steam User's potential playtime in a game they have not played before. With this prediction, the recommender can recommend games to a Steam User based on what games the model predicts the user will put the most hours into. This also provides helpful information to the Steam User by seeing which games can potentially give them the most playtime per game price.
SPRING 2023
TEAM 28
Correcting Racial Bias in Measurement of Blood Oxygen Saturation
Data Science Boot Camp
Rohan Myers, Saad Khalid, woojeong kim, Brooks Miner, Jaychandran Padayasi
Fingertip pulse oximeters are the current standard for estimating blood oxygen saturation without a blood draw, both at home and in healthcare settings. However, pulse oximeters overestimate oxygen saturation, often resulting in ‘hidden hypoxemia’: a patient has hypoxemia (dangerously low oxygen saturation), but the oximeter returns a healthy oxygen value. Unfortunately, oximeter overestimation of oxygen saturation is exacerbated for patients with darker skin tones due to light-based oximeter technology. This results in Black patients experiencing hidden hypoxemia at twice the rate of white patients. By combining pulse oximeter readings (SpO2) with additional patient data, we develop improved methods for estimating arterial blood oxygen saturation (SaO2) and identifying Hidden Hypoxemia. The predictions of our models are more accurate than pulse-oximeter readings alone, and remove the systematic racial inequity inherent in the current medical practice of using oximeter readings alone.
SPRING 2023
TEAM 32
Species Density
Data Science Boot Camp
Kristen Scheckelhoff, Rohan Sarkar, Erika Ordog
We examined the effects of various human and environmental factors on mammal species densities. Some species are known to adapt and thrive in the face of expanding urbanization, but with this increased human activity also comes competition for resources. Identifying key factors which influence species density in urban areas can lead to policy recommendations for more sustainable urban growth. We explored a variety of models, and we ultimately obtained the best results using Random Forests regression. This model also gave us insight into which of the human and environmental features in our dataset have the greatest influence on species density.
SPRING 2023
TEAM 37
Protein Function Prediction
Data Science Boot Camp
Eamon Byrne, Dustin Nguyen, Ness Mayker, Salma Abdelbaky
Predict the biological function of proteins based upon their amino acid sequence and other publicly available data.
This project would double as a team submission to the Fifth Critical Assessment of Functional Annotation (CAFA 5) competition - similar to the Critical Assessment of Structure Prediction (CASP 14) competition in which AlphaFold2 gained prominence (in 2021).
Here is the Kaggle competition (submissions are open for the next 3 months): https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/
(1st prize is $15,000... lol)
SPRING 2023
TEAM 39
Airbnb
Data Science Boot Camp
Zheyu Ni, Muhammad Reza Averly, Ricky Oropeza, Shubhrika Ahuja, Praveen Shahani
How to make money in LA with Airbnb? Being a host has never been easy since they need to provide a quality experience for guests while considering the profit margin. We aim to alleviate their burdens by streamlining the decision-making process. We explore and analyze tens of thousands of listings in LA for insights. Then, we combined structural modeling and machine learning to customize pricing for new listings based on property locations, features, ownership, and substitution between nearby listings to maximize the host’s profit. The structural model helps capture supply & demand dynamics and machine learning helps capture the consumer consideration set (hotspot market). We also use machine learning for price prediction to utilize rich features better. Furthermore, we recommend areas with the best rate of return for potential hosts and suggest possible amenities to become popular in Airbnb.
SPRING 2021
TEAM 7
MaizeFinder
Data Science Boot Camp
Tuguldur Sukhbold, Michael Darcy, Pol Arranz-Gibert, AJ Adejare
Our problem is to accurately predict maize field centers in Africa using very low resolution satellite images. The dataset contains many disparate entries including two satellite imagery, one with higher spatial res (Planet) and one with higher temporal resolution and wavelength coverage (Sentinel-2), partial metadata about the crop fields including estimated yield, size, and subjective quality of the measurement. We are employing CNN based image segmentation models to compute displacement vector from the image center.
SPRING 2021
TEAM 10
NLPs
Data Science Boot Camp
Frank Hidalgo, Joseph Szabo, Christopher Zhang, Sean Perez, Kun Jin
Acronym/Abbreviation (short form) disambiguation is one of main challenges when using NLP methods to uderstand medical records. While this topic has long been studied, it is still a work in progress. Current strategies often involve having manually curated datasets of abbreviations and train classifiers. The main problem of that approach is that curated datasets are sparse and don't include all the short forms. In Dec 2020, a paper came out where they created a large dataset of short forms as one of their steps in their pipeline to pre-train models. The goal of our project would be to build upon their short form disambiguation piece and create a tool to disambiguate a medical short form using its context. Example of the usage of our tool: original_sentence = "The patient states that she has had dizziness, nausea, some heartburn, and some change in her vision. She is gravida 6, para 4, AB 2. She has no history of adverse reaction to anesthesia." AB could stand for "abortion", "ankle-brachial", "blood group in ABO system", "A, B lines in Kerley lines". disambiguated_sentence = "The patient states that she has had dizziness, nausea, some heartburn, and some change in her vision. She is gravida 6, para 4, abortion 2. She has no history of adverse reaction to anesthesia."
SPRING 2021
TEAM 38
Amethyst
Data Science Boot Camp
Jimin Kim, Francisco Martinez, Noah Schoem, Ifeoma Ugwuanyi
We aim to extract from Qarik's PDFs of World Bank loans the following:
* Loan amounts
* Borrower country
* Loan purpose and targeted category/industry
and cross-reference these with region, income level, and other public data to identify historical trends in the
World Bank's lending program.
SPRING 2021
TEAM 35
Ruby
Data Science Boot Camp
Rongqing Ye, Nakyung Lee, Rachel Domagalski, Hannah Pieper
ClassifyMyMeds: Predicting Prior Authorization Approval and Volume for CoverMyMeds
When a patient tries to get a prescription from a pharmacy, a claim is created against the patient's insurance (payer). Such a pharmacy claim might be rejected for various reasons and might require prior authorization (PA). A PA is a form that providers submit on behalf of a patient to the insurance making a case for the prescribed therapy. In this project, we surveyed many classifiers for predicting how likely a certain PA will be approved, and forecast future volume of PAs with time series analysis techniques. Additionally, we identify the formulary for each payer and predict the number of times certain drugs can be refilled.
FALL 2021
TEAM 3
Camel
Data Science Boot Camp
Elizabeth Campolongo, Ranthony Edmonds, Chaya Norton
Special teams play can significantly impact the outcome of a game in the National Football League (NFL). The rising use of advanced metrics and data analytics in American football can help NFL analysts and coaches better understand what features influence special teams play, which has been relatively limited to date. This project applied Topological Data Analysis (TDA) to develop a metric for quantifying special teams plays. In addition to the features provided by the NFL as part of the 2022 NFL Big Data Bowl Challenge, we engineerd features such as the trajectory of the football and “kicker core distance,” a metric we designed to measure the pressure applied to a kicker, to understand their impact on play results.
FALL 2022
TEAM 90
RooKeys
Data Science Boot Camp
Anudeep Arora, Sam Landoulsi, Lalit Yadav
A delayed flight causes major financial losses to airline companies, airports, and travelers. For instance, expenses incurred by a traveler for accommodation and food due to flight delays and/or missed connections together with time lost and being away from home. This can result in a decrease in air travel demand from existing and potential customers for an airline because travelers bank on efficient service quality and performance. Impact of delays can translate into productivity slowdown, indirectly affecting the economy and stunting GDP. Keeping this in mind, our aim is to build a classifier model to answer the following question:
● Given a 4-hour period horizon, will a flight be delayed by more than 15 minutes?
● Our target audience is airline companies, as according to the Federal Aviation Administration, the total flight delays cost is $22 billion yearly.
FALL 2022
TEAM 91
Runarljod
Data Science Boot Camp
Balin Fleming, Rouzbeh Modarresi Yazdi, Akash Banerjee, Ethan Farber, Lauren Keyes, Abdullateef Shodunke
American Sign Language (ASL) is the first language of more than 250,000 people in the US and Canada. Despite the large population of people who use the language, automatic translation of the language is not yet widespread. This is in part due to challenges of obtaining high quality data for the images to be properly translated.
The goal of our project is to achieve high quality translation of ASL using publicly available data and convolutional neural networks to accurately classify images. In the future we hope to be able to recognize video capture of ASL.
We train with a large dataset of about 26,500 images. Our project has far reaching uses in making global communication easier to multiple stakeholders.
FALL 2022
TEAM 78
Mahogany
Data Science Boot Camp
Olivia McAuley, Dylan Bates
Wildfires damage the environment, lives, and property; and cost the US billions of dollars in damage each year.
The goal of our project is to predict where wildfires will spread, providing important information to stakeholders, and ultimately reducing these costs.
Stakeholders could use this information to optimally allocate resources and direct first-responders where to begin fire suppression and evacuation efforts.
FALL 2022
TEAM 75
Lime
Data Science Boot Camp
Yuchen Luo, Ritika Khurana, Aditya Chander, Taylor Mahler
We built a podcast recommendation engine that suggests episodes to a listener based on either a previous episode that they've heard or an episode description that they can input with freeform text entry.
FALL 2022
TEAM 71
Juniper
Data Science Boot Camp
Christopher Chia, Moeka Ono
Maps of forests allow us to know the locations of a variety of different tree cover types. However, forests change over time, and updating maps involves an expensive process of data collecting.
We answer two questions: Can we instead predict tree cover types just from geographical features?
And can we identify the most essential feature to prioritize when collecting data?
We answer these questions with machine learning algorithms and topological data analysis.
SPRING 2022
TEAM 11
Da Vinci
Data Science Boot Camp
Adam Kawash, Moeka Ono, Soumen Deb, Allison Londerée
The DaVinci Team of the Erdős Institute has utilized advances in computer vision technology with the goal to train a machine learning model to classify species of birds. We then applied this model in a prototype app ChickID. In doing so our project addresses two primary goals:
1) Generate an algorithm that could take images of birds to identify the species.
2) Ensure our model could function even using amateur-level images with a high degree of accuracy, to ensure accessibility of identification.
Our product can be applied for both private and public settings to allow for fast and accurate identification.
SPRING 2022
TEAM 33
Supermassive Black Hole
Data Science Boot Camp
Anna Brosowsky, Sayantan Khan, Nancy Wang, Ethan Zell, Yili Zhang
We built a movie finder app that allows a user to enter some details they remember about a movie (along with some optional filter info on the genre and release year) and then predicts what movie the user is thinking of. To solve this NLP problem, our tool uses an embed-and-rerank model. We have precomputed vectorizations of movie plot information for the approximately 34,000 movies in our dataset.
Our model’s first step is to vectorize the user’s query and do a fast comparison to find the 100 closest plot vectors. Then it reranks these top 100 closest plots, performing a more thorough comparison using a neural network that semantically compares the plot fragments with the original query. Finally, we output the 10 movies which show up at the top of this new ranking.
SPRING 2022
TEAM 39
Starry Night
Data Science Boot Camp
Bryan Reynolds, Kai Wei, Xiaoyu Liu, Estefany Nunez, Xiaozhou Feng
Our project classifies the artist of a painting and applies image style transfer techniques using convolutional neural networks (CNNs). A dataset containing the works of Vincent van Gogh, Claude Monet, Leonardo da Vinci, Rembrandt, Pablo Picasso, and Salvador Dali was created and cleaned. Five CNN models were trained on the data, resulting in classification accuracy scores ranging from 83-88%. Next, ensemble learning techniques were used to apply a voter algorithm using all five CNN models. The best accuracy score was achieved using a majority voter, which increased the model’s accuracy to ~90%. The style transfer model was created using a software package based on CNN techniques and fine-tuned on one famous painting from each artist.
Two interactive web apps were developed, one for the artist classification model and another for the neural style transfer model:
https://huggingface.co/spaces/czkaiweb/StarryNight
https://huggingface.co/spaces/breynolds1247/StarryNight_StyleTransfer
SPRING 2022
TEAM 40
Erdio
Data Science Boot Camp
Matthew Frick, Paul Jreidini, Matthew Heffernan
Timely identification of safety-critical events, such as gunshots, is of great importance to public safety stakeholders. However, existing systems only deliver limited value by not classifying additional urban sounds. We perform classification of environmental sounds to detect safety-critical events, in particular gunshots, and provide information on first-response via siren detection. We also engineer general features for off-line classification tasks and demonstrate how this system can provide value to additional stakeholders in the film and television industry.
SPRING 2022
TEAM 41
SKYLAB
Data Science Boot Camp
Chenyi Gu, Briana Stanfield, Dylan Bates, Kanishk Jain
The NHL Stanley Cup is the oldest existing trophy to be awarded to a professional sports franchise in North America, and often considered “the hardest trophy to win in professional sport.” Using just regular season data, we want to know, can we predict who is going to win the Stanley Cup?
We collected data from each team, as well as data from every player in over 20,000 games going back to 2005. Using this data, we made an ensemble model using logistic regression, AdaBoost, random forests, and a neural network, which were able to predict playoff data with up to 70% accuracy - above the theoretical threshold reported in the literature of 62%.