Skip to main content


Integration of a Manycore Accelerator in a High-Performance Processor

HPC, MPPA, cache coherency

Current challenges in computer architecture are motivated by high-performance computing (HPC) systems, which now target exascale applications, and by embedded high-performance computing (eHPC) computers, which are required by cyber-physical applications such as autonomous driving systems. In both cases, general-purpose CPUs must be complemented by compute accelerators, as only these can offer the energy efficiency and the time predictability respectively required by these applications. In classic HPC systems, the traditional compute accelerators are GPGPUs (General Purpose Graphical Processing Unit), first applied to numerical computing but now largely used for machine learning workloads. In eHPC systems, the traditional compute accelerators are FPGAs, however the growing needs for deep neural network inference motivates different accelerators, if possible software programmable (as opposed to the hardware programmability of FPGAs) using standards environments (as opposed to proprietary environments such as NVidia CUDA).
While modern application processors rely on multicore CPUs, unlocking scalability to hundreds or thousands cores requires a different class of computer architectures, known as manycore. In a manycore architecture, a ’clustering’ of cores into multiprocessor ’compute units’ becomes architecturally visible.
For instance, cache coherency can be restricted to the compute unit, or all cores in the compute unit may share the same local memory, DMA engine or node in the processor global interconnect. The most widely used manycore architecture today is the GPGPU, as introduced by the NVidia Fermi generation. In this and subsequent GPGPU generations, compute units are called ’streaming multiprocessors’ and are composed of ’streaming cores’ share a local memory, a coherence domain, a memory access coalescing mechanism and support for fast synchronization between cores. Besides GPGPUs, manycore architectures based on full-featured cores and compute units with local memory have been successful in HPC.
The objective of the research project is the adaptation and the integration of Kalray MPPA compute units, called MPPA clusters, into a high-end ARM-based system-on-chip designed by Bull-Atos for use in HPC systems, datacenters, and autonomous driving systems in the 2020–2022 time frame. This research is supported by two ambitions European projects: the Mont-Blanc 2020 FET-HPC project, led by Bull-Atos; the European Processor Initiative (EPI) SGA-1 project, led by Bull-Atos, CEA, BSC (Barcelona Supercomputing Center) and Infineon.
Those projects aim at producing competitive European computing technology. These two projects have been engineered as parts of a series that will include the EPI SGA-2 (2020) and the EPI-SGA-3 (2022). Within these projects, Kalray will provide CPU acceleration resources are classically available from GPGPUs.
While Kalray acceleration is targeting autonomous driving systems and AI inference for edge computing, the BSC accelerator focuses on traditional HPC applications, based on compute units composed of RISC-V cores extended with vector units.
The research will contribute to the selection of the memory model for the accelerator, either full coherence, or I/O coherence. The level of support of atomic operations in system memory and in accelerator local memory will also be refined. Then the virtual-to-physical mapping and its implementation will be determined. Also, alternate physical address mapping schemes to reduce row buffer conflicts in DDR and HBM external memories will be evaluated. Investigations will
consider both the ’best-effort’ scenarios, typical of high-performance computers and datacenters, and the ’time-predictable’ scenarios motivated by autonomous driving systems.


Thesis director: Frédéric ROUSSEAU
Thesis supervisor: Frédéric PETROT
Thesis started on: Sept. 2019
Doctoral school: EEATS

Submitted on January 12, 2022

Updated on February 8, 2022