Coursework

Computational science has become a third partner, together with theory and experimentation, in advancing scientific knowledge and practice, and an essential tool for product and process development and manufacturing in industry. Big data science adds the ‘fourth pillar’ to scientific advancements, providing the methods and algorithms to extract knowledge or insights from data.

The course is a journey into the foundations of Parallel Computing at the intersection of large-scale computational science and big data analytics. Many science communities are combining high performance computing and high-end data analysis platforms and methods in workflows that orchestrate large-scale simulations or incorporate them into the stages of large-scale analysis pipelines for data generated by simulations, experiments, or observations.

This is an applications course highlighting the use of modern computing platforms in solving computational and data science problems, enabling simulation, modeling and real-time analysis of complex natural and social phenomena at unprecedented scales. The class emphasizes on making effective use of the diverse landscape of programming models, platforms, open-source tools, computing architectures and cloud services for high performance computing and high-end data analytics.

In the last decade we have seen many technology developments and provision paradigm shifts that have changed the way to design, develop, share and execute applications.

Exponential growth of data, doubling in size every two years
Emergence of new processing platforms to assist managing and stream and batch processing the massive amount of data
Advances in computer architecture with an increase in multi-core parallelism and faster networks
Access to on-demand large-scale infrastructures, paying for use with a small budget, thanks to infrastructure as a service (IaaS) cloud providers
Free and open-source software that surpasses commercial or bespoke in-house software, allows to build complex systems with a low budget, and helps understand and improve the internal processes
Major change in software development towards a more community driven open source software development, and adoption of software containers as unit of application deployment to foster code and component reuse and simplify operations

As scientists and engineers, we have to dig deeper than buzzwords. Behind these rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular platform you are using. The course focuses on teaching those principles, the position where programming models and platforms fit in, how to make good use of them, and how to avoid their pitfalls.

The course does not attempt to give detailed instructions on how to install or use specific software packages or APIs, since there is already plenty of documentation covering those topics. Instead we discuss the various principles and trade-offs that are fundamental in using modern computing platforms for computational and data science.

Summary

Computational Science and Data Science applications can be compute-intensive, also known as big compute, or data-intensive, also known as big data. Whereas compute-intensive applications bring the data to the compute, data-intensive applications do the opposite by bringing the compute to the data.

Compute-intensive applications consist of a large number of independent (HTC: High Throughput Computing) or parallel (HPC: High Performance Computing) tasks that are performed simultaneously to address a particular part of the problem. Compute power is the main bottleneck and the developer has to use one of the existing parallel programming models (mainly OpenMP for shared-memory, or MPI for distributed-memory) to decompose the application into tasks and define their communication and synchronization.

Data-intensive applications, on the other hand, usually require the same computation to be applied to large volumes of data. The communication and storage of the data become the main bottleneck and the developer has to use one of the existing data processing models (mainly MapReduce/Hadoop and Spark) to partition the data into multiple segments for their parallel processing using the same task and the subsequent combination of the intermediate results in multiple stages.

The course gives a practical overview of parallel programming models for compute- and data-intensive applications, and the existing platforms, open-source tools and cloud services to support their execution. The design of efficient parallel programs is approached as an algorithm-architecture problem meaning that a solution is architecture dependent. After the course, you will be in a great position to decide which kind of programming model and platform is appropriate for which purpose, and understand how tools should be combined to form the foundation of the right application architecture to meet the scalability and performance requirements of the most demanding computational problems.

Outline

Introduction. Large-Scale Computational and Data Science

Why we need performance and parallel processing in Computational and Data Science applications, and an overview of the course.

Computational science
Data science
The need for parallel processing
Description of the course

A. Parallel Processing Fundamentals

The different application execution profiles and forms of parallel computing, modern computing architectures and provisioning models for scalable large-scale processing, and aspects to consider in the design, development and distribution of parallel codes.

A1. Parallel Processing Architectures

Shared-memory parallel architectures
GPU systems
Distributed-memory parallel architectures
Benchmarking
Local resource managers
Grid computing

A2. Large-scale Processing on the Cloud

Virtualization
Types of cloud services
Cloud services for parallel processing
Replicability of Numerical Experiments
Need for Private Infrastructures
The Anatomy of the Cloud

A3. Application Parallelism

Types of applications
Levels of parallelism
Types of parallelism
Parallel execution Models

A4. Designing Parallel Programs

Performance analysis
Parallelization overheads
Numerical complexity
Efficiency and scalability

A5. Parallel Programming Paradigms

Performance optimization
Accelerated computing
Shared-memory programming
Distributed-memory programming
Data-centric programming

B. Parallel Computing

Programming models and techniques for parallel processing of compute-intensive applications.

B1. Accelerated Computing

Heterogeneous computing
GPU computing
GPU programming
OpenAcc fundamentals

B2. Performance Optimization

Performance analysis
Optimization process
Optimization techniques
Memory locality model
Loop optimization
Compiler

B3. Shared-memory Parallel Processing

Shared-memory basics
OpenMP fundamentals
Data dependencies
Automatic parallelization
Parallelization process

B4. Distributed-memory Parallel Processing

Distributed-memory basics
MPI fundamentals
Hybrid model
A summary of Big Compute models

C. Parallel Data Processing

Programming models and techniques for parallel processing of data-intensive applications.

C1. Batch Data Processing

Why Is Big Data processing different?
The MapReduce programming model
The Hadoop processing framework

C2. Dataflow Processing

MapReduce limitations
The Spark execution engine
The Spark programming model
The Spark ecosystem

C3. Stream Data Processing

Big streaming data
Stream processing with Spark
Stream processing at the Edge

C4. Serverless Data Processing

Serverless computing
Benefits and limitations
Serverless streaming architectures

Wrap-up. Advanced Topics

Summary of the course and future directions

Wrap-up of the course
Exascale computing
Quantum computing
Edge computing and analytics
Future directions in extreme scale computing