Extreme scale data science at the convergence of big data and massively parallel computing is enabling simulation, modelling and real-time analysis of complex natural and social phenomena at unprecedented scales. The aim of the project is to gain practical experience into this interplay by applying parallel computation principles in solving a compute and data-intensive problem.
Your final project is to solve a data-intensive or a compute-intensive problem with parallel processing on the AWS cloud or on Harvard’s supercomputer: Odyssey (or both!). You will identify a compute or and data science problem, analyse its compute scaling requirement, collect the data, design and implement a parallel software, and demonstrate scaled performance of an end-to-end application.
Your project should consider the following aspects:
- Demonstrate the need for big compute and/or big data processing, and what can be achieved thanks to large-scale parallel processing.
- Solve a problem for a non-trivial computation graph and with hierarchical parallelism.
- Be implemented on a distributed-memory architecture with either a many-core or a multi-core compute node, and evaluated on at least 8 compute nodes (note: each compute node on Odyssey is a multi-core with 32, or 64 cores or with a many-core GPU with hundreds of cores).
- Use a hybrid parallel program in either, for example: MPI + OpenMP, MPI + OpenACC (or OpenCL), Spark or MapReduce + OpenACC (or OpenCL) or MPI + Spark or MapReduce.
- Be evaluated on large data sets or problem sizes to demonstrate both weak and strong scaling using appropriate metrics (throughput, efficiency, iso-efficiency...).
Note that your project is not required to include all of these aspects. However, projects that do include more of the listed aspects or a higher level of difficulty will be weighted accordingly.
Students are required to form teams and to partition the work among the team members. The final project must be done in teams with 3-4 students each (exceptions by permission of the instructor). You can use the course forum to find prospective team members. You may also find and discuss project ideas on the forum. In general, we do not anticipate that the grades for each group member will be different. However, we reserve the right to assign different grades to each group member if it becomes apparent that one of them put in a vastly different amount of effort than the others.
There are five milestones for your final project. It is critical to note that no extensions will be given for any of these milestones for any reason. Projects submitted after the final due date will not be graded.
- Team formation and tentative topic (3/30)
- In-class presentation of your project proposal (4/14 and 4/16)
- In-class presentation of your progress with the design of the project (4/21 and 4/23)
- The final project deliverables submission (5/10)
- Project presentation to teaching staff (5/11)
Your group needs to present a project proposal (and submit the PDF of the presentation) with the following sections:
- What is the problem you are trying to solve with this application?
- What is the need for big compute and/or big data processing and what can be achieved thanks to large-scale parallel processing?
- Describe your model and/or data in detail: where does it come from, what does it mean, etc.
- Which tools and infrastructures you are planning to use to build the application?
You will have 5, and ONLY 5, minutes to briefly summarize your proposal. You have to prepare 2-3 slides for your proposal. We will enforce the 5-minute time limit.
- The presentation of the project proposal is worth 10% of the project.
- This presentation is a chance for you to get feedback. We may suggest modifications if necessary. Our main concern is the amount of effort a given project will require; either too much or too little is unacceptable.
Project Progress (Design)
Your group needs to present a project progress (and submit the PDF of the presentation) covering the main aspects in the design of the parallel application with the following sections:
- Define the type of your application, the levels of parallelism exploited, the types of parallelism within the application, and the parallel execution model that will be used to build the parallel application
- Specify programming model and infrastructure that you will use, and If code exists, provide an analysis/profiling of the existing sequential/parallel code (you should check the limits of your account for scalability testing).
- Describe the main overheads (communication, synchronization, load balancing and sequential sections) in the parallelization and techniques that will be applied to mitigate them
- Describe the numerical complexity of the algorithm
- Estimate the theoretical speed-up and scalability expected
You will have 5, and ONLY 5, minutes to briefly summarize your proposal. You have to prepare 2-3 slides for your progress. We will enforce the 5-minute time limit.
- The presentation of the project design is worth 20% of the project. While it will likely take less than 20% of the time you spend on the project, you should take it very seriously.
- We will grade your designs harshly. The design is essentially the most important part of the project. Having a good project design is needed to ensure an efficient implementation and can significantly cut your total coding and integration time.
- This presentation is not a proposal but a design document, you should be concrete and specific rather than abstract and general, and include real performance estimates.
- This presentation is also a chance for you to get feedback from the teaching staff, and to come up with ways around roadblocks you encounter. It is also a chance for the teaching staff to ensure that your project is on track, and that your project is still in the appropriate-amount-of-work range.
- Web site as final report
- Software with evaluation data sets, test cases (on Github repo)
- Presentation to the teaching staff
Project Web Site
An important piece of your final project is a public web site that describes all the great work you did for your project. The web site serves as the final project report, and needs to describe your complete project. You can use GitHub Pages, or the README file on the GitHub repository, so you can easily refer to the software at the GitHub repository. You should assume the reader has no prior knowledge of your project and has not read your proposal. It should address the following aspects:
- Description of problem and the need for HPC and/or Big Data
- Description of solution and comparison with existing work on the problem
- Description of your model and/or data in detail: where did it come from, how did you acquire it, what does it mean, etc.
- Technical description of the parallel application, programming models, platform and infrastructure
- Links to repository with source code, evaluation data sets and test cases
- Technical description of the software design, code baseline, dependencies, how to use the code, and system and environment needed to reproduce your tests
- Performance evaluation (speed-up, throughput, weak and strong scaling) and discussion about overheads and optimizations done
- Description of advanced features like models/platforms not explained in class, advanced functions of modules, techniques to mitigate overheads, challenging parallelization or implementation aspects...
- Final discussion about goals achieved, improvements suggested, lessons learnt, future work, interesting insights…
Your web page should include screenshots of your software that demonstrate how it functions. You should include a link to your source code.
Your final project can be implemented using any API or programming language you would like. Make your own repository on GitHub with a link to your project web page. Software with evaluation data sets, test cases should be available on the repo. Include a README that describes the code and application files, and how your program should be run. We will be grading these projects on a variety of platforms, so you must include detailed instructions on how to run or compile your code. If we cannot run your application from the instructions included with your submission, we will not be able to grade this portion of your project. Your performance results should be reproducible, so you should provide all the information of the system and the environment needed to reproduce your tests.
You will have 10, and ONLY 10, minutes to briefly present your project followed by 5 minutes of discussion time. You may prepare 4-5 slides for your summary, but we will enforce the 10-minute time limit. Focus the majority of your presentation on your main contributions rather than on technical details. What do you feel is the coolest part of your project? What insights did you gain? What is the single most important thing you would like to show the class? Upload the presentation to the GitHub repo and on Canvas.
The final project grades are dependent on the following criteria:
- Attempted difficulty: Some projects are harder than others. For example, an assignment based off of one of the homework assignments is probably easier than a completely new application.
- Did you meet your major goals? The most important grading criteria is functionality: A working program will always garner the majority of available points; no credit will be given for non-working programs. A modest solution that works will be graded much more favorably than an ambitious "solution" that core dumps!
Project will be graded on the depth of work undertaken and communication (web site, presentation):
- 10%: Project proposal (in-class presentation)
- 20%: Project design (in-class presentation)
- 10%: Problem description and comparison with existing work
- 20%; Parallel application, programming model, platform and infrastructure
- 10%: Software source code, design and documentation
- 20%: Performance evaluation and discussion
- 10%: Final discussion and deliverables quality
Extra points may be earned for the use of advanced features like:
- Programming models not explained in class (CUDA for GPU programming...)
- Infrastructures not used during the course (Google Compute, eXede supercomputers...)
- Advanced functions of models explained in class (MPI I/O, complex collective communications in MPI, OpenMP NUMA extensions...)
- Modules of platforms not explained in class (ML or Graph in Spark...)
- Advanced techniques to mitigate overheads (overlap of communication and computation, optimized libraries...)
- Customized system configurations (kernel/driver tuning...) to mitigate overheads
- Challenging parallelization or implementation aspects