Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Dirk Pflüger
Universität Stuttgart

Two of the main challenges of future exa-scale HPC systems are scalability and fault tolerance. To ensure scalability on such extreme scales, new numerical algorithms are required that reduce the need for global communication and synchronization and thus the amount of data that has to be communicated. To cope with predicted
failure rates, the algorithms further have to be able to cope with hard and silent faults: Classical approaches such as checkpoint-restart are not feasible for large-scale simulations with high resolutions due to the size of their simulations
snapshots. Algorithm-based fault tolerance is required instead.

For the solution of higher-dimensional PDEs such as they arise in plasma physics, hierarchical numerical methods come to the rescue. We present algorithms for
the solution of high-dimensional PDEs for sparse grids which we study for gyrokinetic simulations with GENE. They exploit a hierarchical extrapolation scheme, the sparse grid combination technique. Our approach introduces an extra coarse-grained level of
parallelism for scalability and a new numerical fault-tolerant scheme.

Presentation (PDF File)

Back to Big Data Meets Computation