cHIMERA: Heterogeneity and specialization In the post-Moore ERA

Summary

Energy efficiency is the main limit for the scalability of computing systems in the Post-Moore era. The semiconductor industry is now at an inflexion point and from this side there is no clear alternative to solve this problem in the short/midterm. To sustain higher performance and keep up with the computational demands of emerging applications, a great consensus among the industry and the academia suggests that:

  1. Specialization and heterogeneity are key to improve energy efficiency in the Post-Moore era. The growing popularity of GPUs or FPGAs is just an example of this trend.
  2. This shift towards specialization and heterogeneity comes at the expense of flexibility and programmability. A reexamination of the Hw-Sw interface, including the different layers of the system Sw (from the compiler to the OS and the runtime system) is a promising research direction to mitigate these problems.
  3. Traditional benchmarks are no longer valid to evaluate many of the proposed architectural and system level optimizations, because the compile-execute loop does not fit well with the increasing architectural complexity. Analyzing driver applications in a holistic way is becoming a more reasonable approach to get realistic insights.

Objectives and tasks

The overall objective of the project has been broken down into different sub-objectives and tasks:

1. Circuit/Architecture Level

1.1. Emerging memory organizations and non-volatile memories

We propose to explore alternative memory organizations to evaluate their performance and energy efficiency in domain-specific applications. New experimental platforms, simulators or emulators, are required to model and foresee the behavior of the new complex memory systems. Specifically, we aim at assessing the performance of near memory computing solutions based on Hybrid Memory Cube (HMC) memories, by taking advantage of the high bandwidth provided by such memories in scenarios where data reuse and locality are low.

1.2. Specialized arithmetic units and Approximate Computing

The main objective of the study on approximate computing is to obtain a set of customizable approximate FUs able to comply with a target accuracy requirement within an application. This is the key for obtaining the largest energy reductions as well as performance improvements. Furthermore, these units are the basis for constructing a HLS approach able to take apart kernels that demand high accuracy, from those were accuracy could be sacrificed for the sake of energy savings.

2. System Sw Level

2.1. Extended Runtime Task Scheduling and Task Co-Scheduling

At the runtime level, we will address individual, task-parallel applications, and propose the extension of current runtimes and programming models to support additional degrees of freedom (DoFs), which include mapping of tasks to the most suitable cores (mapping DoF), dynamically controlling the amount of intra-task threads (threading), partitioning or combination of tasks (partitioning/merging) or even controlling arithmetic and data representation precision per task (precision casting), among others. Together, these DoFs will imply more complex scheduling decisions, but also more flexible, and adaptable to the ever-growing architectural heterogeneity. As a result, we will develop a dependency-aware
runtime task scheduler integrating the extended scheduling decisions. We will try to reduce as much as possible the implementation overheads by leveraging modern C++17 features.

2.2. Operating Systems: Process Scheduling and Resource Management

The goal of this objective is twofold. Firstly, we will devise strategies to effectively manage shared resources in multicore and manycore architectures by leveraging Hw support for resource allocation. Secondly, we will explore scheduling solutions with the ability to deliver the benefits of heterogeneous multicores to a wide range of legacy applications.

To achieve the first goal, we plan to develop a simulation tool that will leverage theoretical and machine-learning-based models to assess the impact on performance that applications suffer due to contention (when running simultaneously with others). This simulator will enable us to quickly identify the most promising shared-resource allocation schemes by comparing them against the approximated optimal-solutions. Then we will proceed with the implementation of these schemes in the Linux kernel, by using an enhanced version of the PMCTrack monitoring tool, which will be extended in two ways: (1) inclusion of support for the various Hw monitoring and allocation facilities that make up the Intel Resource Director Technology, and (2) the creation of a research framework for fast
prototyping of scheduling and shared-resource management strategies. Despite the fact that we will primarily focus on the evaluation of these shared-resource management policies at the OS level, we also plan to evaluate the algorithms at the VMM (virtual machine monitor) level, so as to cope with the ever-increasing usage of virtual environments.

As for the second goal, we aim to overcome one of the major shortcomings present in most existing scheduling schemes for heterogeneous multicores: they were designed to target a specific kind of applications, thus making it difficult for any of them to be adopted in general-purpose settings. To fill this gap, we will design a scheduling infrastructure with the ability to embrace a wider spectrum of applications, making it a suitable candidate for adoption in general-purpose OSs or VMMs. Notably, our approach will heavily rely on performance monitoring counters (PMC) for characterizing application performance/energy efficiency at runtime, and will also employ other system metrics (e.g. CPU utilization) to decide which core type is most suitable to run a specific application. PMCTrack will be a key building block to provide our system with convenient and portable access to PMC.

2.3. Machine Learning Integration for Improved Resource Management.

The introduction of extended scheduling policies carries out a dramatic increase in the runtime scheduling and co-scheduling complexity. To alleviate this, we propose the use of Sw and Hw techniques to accelerate and improve the quality of the scheduling phase. From the Sw perspective, ML can be of wide appeal to reduce the effort and increase the precision of the scheduling process. Specifically, we will explore reinforcement learning techniques, among others, to demonstrate the validity of this approach when embedded in an actual runtime; in this context, the use of multi-agent co-operative approaches, in which each individual agent is in charge of optimizing for a given dimension (e.g. energy consumption, performance, precision), combined with q-learning techniques, will be proposed as an actual implementation integrated within the middleware. The necessary monitorization and/or characterization of tasks will leverage the OS services and Hw counters via existing infrastructures (mainly PMCTrack or PAPI). Second, the emergence of new architectures specifically focused on machine learning acceleration will bring new opportunities to accelerate the co-scheduling process. Proceeding this way, our approach will yield a Sw/Hw co-design, in which algorithmic improvements for the acceleration and improvement of the co-scheduling techniques will be, at the same time, accelerated using ad-hoc, independent and specific-purpose Hw. This Hw includes present or future ML-architectures, but the Sw will be designed to be modular enough to support virtually any Hw with acceleration capabilities in the context of machine learning.

3. Application level

3.1. Exploring performance of DSAs with out-of-domain applications.

The goal is to explore and leverage the potential performance gains of architectures conceived for a specific domain (DSAs) for applications not actually belonging to that domain. Particularly, we aim to study new machine learning DSAs for general-purpose computing. Given the particularities of these architectures (mainly in terms of efficiency in reduced-precision computing), a proper analysis and selection of applications (or parts of applications) that tolerate reduced or mixed precision computing is crucial to leverage the potential performance and energy consumption gains of new DSAs, while maintaining correctness in results. Three fields in which the group members have previously contributed are promising: dense linear algebra, where mixed precision with iterative refinement techniques can be of wide appeal in many primitive operations; video processing, where some building blocks (e.g. DCT, DWT, Motion estimation) can be performed using reduced precision; or hyperspectral image processing, that usually does not have high precision requirements. Besides, in combination with the task schedulers that we plan to develop within objective O2, we will explore strategies to exploit the heterogeneity in present (Volta GPUs) and future (Intel Xeon, IMGTEC and ARM SoCs) architectures featuring discrete ML co-processors. In this same think (mapping) out-of-the-box spirit, we plan to evaluate FPAAs and Programmable SoCs (PSoCs) using applications far from signal conditioning.

3.2. Developing and mapping of applications onto heterogeneous platforms

In this sub-objective we will focus on the development and/or mapping of diverse applications representative of different domains which will allow us to evaluate the architectural and system level modifications proposed in previous objectives. The mere exercise of mapping these large applications onto heterogeneous platforms will provide rich feedback to refine the goals pursued by previous task goals. Among other applications we will focus on Future Video Encoding, some building block for an Advanced Driver Assistance System (ADAS), sequence alignment and an IoT predictive model for Solar Irradiance Forecasting (in collaboration with CIEMAT).