Coursework
Computational science has become a third partner, together with theory and experimentation, in advancing scientific knowledge and practice, and an essential tool for product and process development and manufacturing in industry. Big data science adds the ‘fourth pillar’ to scientific advancements, providing the methods and algorithms to extract knowledge or insights from data.
The course is a journey into the foundations of Parallel Computing at the intersection of large-scale computational science and big data analytics. Many science communities are combining high performance computing and high-end data analysis platforms and methods in workflows that orchestrate large-scale simulations or incorporate them into the stages of large-scale analysis pipelines for data generated by simulations, experiments, or observations.
This is an applications course highlighting the use of modern computing platforms in solving computational and data science problems, enabling simulation, modeling and real-time analysis of complex natural and social phenomena at unprecedented scales. The class emphasizes on making effective use of the diverse landscape of programming models, platforms, open-source tools, computing architectures and cloud services for high performance computing and high-end data analytics.
In the last decade we have seen many technology developments and provision paradigm shifts that have changed the way to design, develop, share and execute applications.
- Exponential growth of data, doubling in size every two years
- Emergence of new processing platforms to assist managing and stream and batch processing the massive amount of data
- Advances in computer architecture with an increase in multi-core parallelism and faster networks
- Access to on-demand large-scale infrastructures, paying for use with a small budget, thanks to infrastructure as a service (IaaS) cloud providers
- Free and open-source software that surpasses commercial or bespoke in-house software, allows to build complex systems with a low budget, and helps understand and improve the internal processes
- Major change in software development towards a more community driven open source software development, and adoption of software containers as unit of application deployment to foster code and component reuse and simplify operations
As scientists and engineers, we have to dig deeper than buzzwords. Behind these rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular platform you are using. The course focuses on teaching those principles, the position where programming models and platforms fit in, how to make good use of them, and how to avoid their pitfalls.
The course does not attempt to give detailed instructions on how to install or use specific software packages or APIs, since there is already plenty of documentation covering those topics. Instead we discuss the various principles and trade-offs that are fundamental in using modern computing platforms for computational and data science.
Summary
Computational Science and Data Science applications can be compute-intensive, also known as big compute, or data-intensive, also known as big data. Whereas compute-intensive applications bring the data to the compute, data-intensive applications do the opposite by bringing the compute to the data.
Compute-intensive applications consist of a large number of independent (HTC: High Throughput Computing) or parallel (HPC: High Performance Computing) tasks that are performed simultaneously to address a particular part of the problem. Compute power is the main bottleneck and the developer has to use one of the existing parallel programming models (mainly OpenMP for shared-memory, or MPI for distributed-memory) to decompose the application into tasks and define their communication and synchronization.
Data-intensive applications, on the other hand, usually require the same computation to be applied to large volumes of data. The communication and storage of the data become the main bottleneck and the developer has to use one of the existing data processing models (mainly MapReduce/Hadoop and Spark) to partition the data into multiple segments for their parallel processing using the same task and the subsequent combination of the intermediate results in multiple stages.
The course gives a practical overview of parallel programming models for compute- and data-intensive applications, and the existing platforms, open-source tools and cloud services to support their execution. The design of efficient parallel programs is approached as an algorithm-architecture problem meaning that a solution is architecture dependent. After the course, you will be in a great position to decide which kind of programming model and platform is appropriate for which purpose, and understand how tools should be combined to form the foundation of the right application architecture to meet the scalability and performance requirements of the most demanding computational problems.
Outline
Introduction. Large-Scale Computational and Data Science
Why we need performance and parallel processing in Computational and Data Science applications, and an overview of the course.
- Computational science
- Data science
- The need for parallel processing
- Description of the course
A. Parallel Processing Fundamentals
The different application execution profiles and forms of parallel computing, modern computing architectures and provisioning models for scalable large-scale processing, and aspects to consider in the design, development and distribution of parallel codes.
A1. Parallel Processing Architectures
- Shared-memory parallel architectures
- GPU systems
- Distributed-memory parallel architectures
- Benchmarking
- Local resource managers
- Grid computing
A2. Large-scale Processing on the Cloud
- Virtualization
- Types of cloud services
- Cloud services for parallel processing
- Replicability of Numerical Experiments
- Need for Private Infrastructures
- The Anatomy of the Cloud
A3. Application Parallelism
- Types of applications
- Levels of parallelism
- Types of parallelism
- Parallel execution Models
A4. Designing Parallel Programs
- Performance analysis
- Parallelization overheads
- Numerical complexity
- Efficiency and scalability
A5. Parallel Programming Paradigms
- Performance optimization
- Accelerated computing
- Shared-memory programming
- Distributed-memory programming
- Data-centric programming
B. Parallel Computing
Programming models and techniques for parallel processing of compute-intensive applications.
B1. Accelerated Computing
- Heterogeneous computing
- GPU computing
- GPU programming
- OpenAcc fundamentals
B2. Performance Optimization
- Performance analysis
- Optimization process
- Optimization techniques
- Memory locality model
- Loop optimization
- Compiler
B3. Shared-memory Parallel Processing
- Shared-memory basics
- OpenMP fundamentals
- Data dependencies
- Automatic parallelization
- Parallelization process
B4. Distributed-memory Parallel Processing
- Distributed-memory basics
- MPI fundamentals
- Hybrid model
- A summary of Big Compute models
C. Parallel Data Processing
Programming models and techniques for parallel processing of data-intensive applications.
C1. Batch Data Processing
- Why Is Big Data processing different?
- The MapReduce programming model
- The Hadoop processing framework
C2. Dataflow Processing
- MapReduce limitations
- The Spark execution engine
- The Spark programming model
- The Spark ecosystem
C3. Stream Data Processing
- Big streaming data
- Stream processing with Spark
- Stream processing at the Edge
C4. Serverless Data Processing
- Serverless computing
- Benefits and limitations
- Serverless streaming architectures
Wrap-up. Advanced Topics
Summary of the course and future directions
- Wrap-up of the course
- Exascale computing
- Quantum computing
- Edge computing and analytics
- Future directions in extreme scale computing