Scalability and Fault Tolerance for Exascale Simulations of Hot Fusion Plasmas

Dirk Pflüger
Universität Stuttgart

Higher-dimensional PDEs pose a challenge even for future exascale HPC
systems: Discretizations based on tensor products, even their adaptive versions, suffer the curse of dimensionality. Thus, only a single snapshot of a simulation run can exceed the memory of such a system. Beyond the mere computational feasibility of these simulations, two main challenges of future exa-scale HPC systems are scalability and fault tolerance. To ensure scalability on such extreme scales, new numerical algorithms are required that overcome the communication bottleneck by reducing the need for global communication and synchronization and thus the amount of data that has to be communicated. Correspondingly, it is infeasible to resort to classical approaches to cope for hard and silent faults: Checkpoint-restart is not an option due to the size of a single simulation snapshot. Algorithm-based fault tolerance is required instead, which uses data reconstruction or similar approaches.

For the solution of higher-dimensional PDEs such as they arise in plasma physics, hierarchical numerical methods come to the rescue. We present algorithms for the solution of higher-dimensional PDEs for sparse grids which we study for gyrokinetic simulations with GENE. We exploit a hierarchical extrapolation scheme, the sparse grid combination technique. Our approach makes use of a basis change to communicate data in a hierarchical representation. This reduces the amount of data for global communication and provides a new numerical fault-tolerant scheme.

Presentation (PDF File)

Back to Workshop II: HPC and Data Science for Scientific Discovery