NExt-generation System software and ToolS for Emerging ARCHitectures (NESTSEARCH)

Principal Investigators: Carlos García Sánchez and Juan Carlos Sáez Alcaide

Start and end dates: 01/09/2025-31/08/2028

Code: PID2024-158311NB-I00

Summary: The landscape of computer architecture in the last few years shows a clear shift towards unprecedented levels of resource replication, heterogeneity, modularity and disaggregation. This poses new challenges at all layers of the software stack, particularly for the system software (from compilers to the OS or runtime systems/middleware). To optimally exploit the new levels of potential performance with minimal user intervention, it is widely accepted that the system software will need to adapt its design and behavior to:

  1. Apply policies to improve resource management leveraging hardware capabilities of modern architectures (e.g., cache-partitioning, or tiered memory) while fulfilling strict QoS levels;
  2. Provide means to maximize resource usage in HPC and cloud platforms by facilitating programmability of malleable and elastic applications and their integration with resource managers;
  3. Transparently offer efficient mechanisms that allow higher degrees of performance portability across architectures, with support from compilers, runtime systems and libraries; and (iv) apply strategies to harness the complexities associated with new general-purpose and domain-specific accelerators (DSAs), as well as vector units.

Accomplishing this requires a holistic and multi-level approach towards elasticity and resource management both in system software and applications, that must be cooperative and autonomic. Cooperative, since applications and parallel frameworks will expose knobs and metrics, allowing a proper interaction with an external elasticity-enabled resource manager. Autonomic, since those application requirements will be learned and transferred to new application-architecture tuples with minimal human intervention to deal with the growing decision space of emerging computing systems.

The overall goal of this project proposal is to explore, design and implement new techniques and tools to improve resource management, data placement, elasticity support and transparent performance portability across applications on multi-application environments. The project will focus on highly parallel multicore architectures featuring heterogeneous computing resources, and will leverage simulation techniques when necessary. To accomplish this goal, the projects overall goal is broken down into four specific objectives, addressed by the following inter-related work packages

  1. To design efficient shared-resource management and data placement strategies for improved QoS.
  2. To explore elasticity methods to improve resource usage and QoS in HPC and cloud data centers.
  3. To deliver performance portability and ease application mapping in heterogeneous systems.
  4. To simplify the efficient exploitation of new general-purpose accelerators, vector units and DSAs.