Massive and highly hybrid systems are hierarchical and heterogeneous in nature. To match such an architecture one needs a task-based scheduler for multi-level parallelism—essentially a hierarchy of schedulers with different scopes. This includes exploring the support required of a runtime system to execute adaptive applications with scalefree tasking, as well as to balance the load in the presence of non-deterministic performance variations. WP6 will develop a sustainable set of tools, methods, and software solutions that will be critical to the underlying algorithms and software for linear algebra developed in WP2–4. These tools, methods, and software will be immune to the rapid developments in the hardware landscape and the software stack, and will fit well in the HPC software ecosystem by responding to the needs of bandwidth-bound applications.
We structure the work in this work package in three tasks:
- Scheduling and Runtime Systems
Task-graph-based multi-level scheduler for multi-level parallelism. Investigate user-guided schedulers: application-dependent balance between locality, concurrency, and scheduling overhead. Run-time system based on parallelizing critical tasks. Address the thread-to-core mapping problem
- Auto-Tuning
Off-line: tuning of critical numerical kernels across hybrid systems. Run-time: use feedback during and/or between executions on similar problems to tune in later stages of the algorithm.
- Algorithm-Based Fault Tolerance
Explore new NLA methods of resilence and develop algorithms with these capabilities.