Final Project Guidelines

  1. Predicting Fine Particulate Matter Pollution

    We introduce the data setting and provide a high-level overview of the most important data science methods for estimating the health risks of air pollution and climate change related exposures. Given the strong evidence that the presence of fine particulate matter in the atmosphere increases mortality and hospitalization rates, it is important that we can track and predict these pollution levels.

    One of the main challenges in this work is that the fine particulate matter detectors are sparsely located across the US, meaning that many areas do not have quality data. In this project, you will build a model to predict fine particulate matter pollution in the untracked regions of the US, given data from the tracked areas. This data includes meteorological readings, land use data, and satellite data.

    Francesca Dominici is the Clarence James Gamble Professor of Biostatistics, Population and Data Science at Harvard T.H. Chan School of Public Health and Director of the Data Science Initiative at Harvard University.

    Dr. Dominici is an international leader at the interface of statistics, data science, causal inference and and public health. Dr. Dominici has stellar leadership in education and mentoring, and the promotion of diversity and gender equality.

    Dr. Dominici’s team has provided the scientific community and policy makers with robust evidence on the adverse health effects of air pollution, and climate change. Her studies have directly and routinely impacted air quality policy, leading to more stringent ambient air quality standards in the US.

  2. Room Labelling and Object Arrangement

    Floor plans play an invaluable role in the architectural world - they tell us the shape and size of a built environment, and also convey the architect's vision for the style and feel of the final product. Much of this work is currently done with 3D renderings, but in the past, a 2D floor plan was the standard. Many of these floor plans exist as essentially unstructured data, such as images.

    In this project, you will be working with a data set of approximately 1000 floor plans. While it is easy for a human to look at one of these images and understand the purpose of each room, a computer has a harder time understanding this. Your goal is to build a system to identify the different types of rooms (living room, bed room, etc.) in each floor plan, and extract relevant metadata from their properties. Such metadata could be recognizing furniture symbols with size and location, and/or drawing hard boundaries between rooms. A wonderful stretch-goal could be for the system to automatically populate rooms with furniture appropriate to its use.

    Jose Luis García del Castillo y López is an architect, computational designer, and educator. He advocates for a future where programming and code are tools as natural to artists as paper and pencil. In his work, he explores creative opportunities at the intersection of design, technology, fabrication, data and art. He is currently a Lecturer in Architectural Technology at the Harvard Graduate School of Design.

  3. Data Science for Case Law

    This project will challenge you to read case law, and generate automatic summarizations of the cases. Headnotes are brief case summary statements often generated by commercial third parties. They are under copyright protection, and so are not always available or are subject to many usage restrictions. In this project, you will reconstruct these headnotes for a variety of cases, using historical court cases as a training set. We will use the Caselaw Access Project, a recently-released dataset of all United States case law from the Harvard Law School Library.

    Jack Cushman is a Lecturer at the Harvard Law School. Jack teaches Computer Programming for Lawyers at Harvard Law School and is a senior developer at the Harvard Library Innovation Lab. He previously practiced as an appellate lawyer at the firm of Stern, Shapiro, Weissberg and Garin.

    Kelly Fitzpatrick is a Research Associate at the Harvard Law School Library. Kelly runs outreach and communications for the Caselaw Access Project at the Harvard Library Innovation Lab. She holds a masters in library science and worked previously on the Harvard Open Access Project.

  4. Predicting Disease Activity

    This project involves applying machine learning techniques and novel Internet-based data sources for real-time monitoring and short-term forecasting of population-level disease activity. In this project, you will learn how to design and deploy predictive models to track and forecast epidemic outbreaks in real time. For this, you will gain access to epidemiological data of ongoing and historical disease outbreaks in the US and other countries. Diseases of interest include Influenza (USA), Dengue fever (Puerto Rico and Peru), Ebola (DRC), Coronavirus 2019-nCoV (China and the world).

    Given that most epidemiological monitoring systems in the world provide lagged information about disease activity (case counts of infected people are known to decision-makers weeks after the fact due to lab testing and data collection delays), you will learn how to design and implement time-series based prediction models to estimate the likely disease activity in current time and future weeks.

    You will learn how to utilize data sources that were not originally designed to be indicators of disease activity to complement time-series predictive models. These data sources include: disease-related Google search activity, disease-related Twitter microblogs, air travel, and weather patterns.

    Dr. Mauricio Santillana is a physicist and applied mathematician with expertise in mathematical modeling and scientific computing. He has worked in multiple research areas frequently analyzing big data sets to understand and predict the behavior of complex systems. His research modeling population growth patterns has informed policy makers in Mexico and Texas. His research in numerical analysis and computational fluid dynamics has been used to improve models of coastal floods due to hurricanes, and to improve the performance of global atmospheric chemistry models.

    Mauricio received a B.S. in physics with highest honors from the Universidad Nacional Autonoma de Mexico in Mexico City, and a master’s and PhD in computational and applied mathematics from the University of Texas at Austin. Mauricio first joined Harvard as a postdoctoral fellow at the Harvard Center for the Environment and has been a lecturer in applied mathematics at the Harvard SEAS, receiving two awards for excellence in teaching.

  5. Predicting Project Success

    Data Science has become a core function in many firms today, driving innovation and new data-intensive business and operating models. This project will demonstrate how data and data science offers unprecedented opportunities to organizations in running their business, focusing on project planning capabilities. In this project, you will be exposed to concepts related to business operations and an in depth understanding of the role of data and predictions in these operations.

    In particular, can project success or failure (defined by budgetary or temporal constraints) be predicted ahead of time, given a variety of project features? Many businesses operate without a reliable predictor of project success, and therefore waste time and money due to mismanagement. This project addresses a real-world business need with data, and will help managers better plan their business operations.

    Dr. Yael Grushka-Cockayne is Visiting Associate Professor of Business Administration, Harvard Business School. Associate Professor Grushka-Cockayne's research and teaching activities focus on data science, analytics, forecasting, decision analysis, project management, and behavioral decision-making. Yael is an award-winning teacher and in 2014 was named one of "21 Thought-Leader Professors" in Data Science. At HBS Yael teaches the required Technology and Operations Management course and an elective course on Business Analytics. She has been teaching in the Harvard Business Analytics Program, powered by 2U, since 2018. Previously, Yael taught courses on Decision Analysis, Project Management, and Data Science in Business. Yael's "Fundamentals of Project Planning and Management" Coursera MOOC had over 200,000 enrolled, across 200 countries worldwide.

    Before starting her academic career, she worked in San Francisco as a marketing director of an ERP company. As an expert in the areas of project management, she has served as a consultant to international firms in the aerospace and pharmaceutical industries. Yael is an Associate Editor at Management Science, Operation Research, and Decision Analysis. Education: B.Sc., Ben-Gurion University; MSc, London School of Economics; Ph.D., MRes, London Business School

  6. Measuring the shape and brightness of galaxies with neural networks

    For decades, astronomers have scanned the sky with increasingly powerful telescopes and cameras, collecting millions of digital images of billions of stars, galaxies, and other objects. These "sky surveys" collect so much data that no human could ever look at it all directly, so we rely on automated software to detect objects, isolate them from their neighbors, and determine their properties. Traditionally this has been done by model fitting.

    For example, to characterize a galaxy, we use a parametric generative model for the galaxy that outputs an image (values in a grid of pixels) as a function of the location, size, shape, orientation, and brightness of the galaxy. A common choice is the Sérsic profile ( One can define an objective function (or loss function) given the model and a noise model, and run an optimizer or Markov chain Monte Carlo to estimate the parameters and their uncertainty. Much of modern astronomy is built on this approach, even though the Sérsic model is not a terribly good fit for many galaxies.

    A more modern approach would be to use neural networks. This would allow us to specify our concept of what galaxies look like by using many training examples (real or simulated) rather than a functional form. The neural network can be trained with those input examples, and desired output parameters, and then applied to new data.

    For this module we will use the galsim software ( to generate mock data with known parameters and noise. We will then train a neural net on the mock data, and quantify its performance. By showing that modern data-driven approaches can succeed on this problem, we open the door to future work on real galaxies, including edge cases such as merging systems of galaxies, or galaxies that overlap along the line of sight -- situations that traditional methods sometimes handle poorly.

    Dr. Douglas Finkbeiner has a joint appointment in the Department of Astronomy and Department of Physics, working on topics ranging from dark matter to interstellar dust. He is currently excited about the Dark Energy Spectroscopic Instrument (DESI), a 5000-fiber spectrograph on the 4m telescope at Kitt Peak, AZ.