PhD Thesis

< back to PhD thesis

« Adapting a HPC runtime system to FPGAs ».

Author: G. Christodoulis
Advisor: F. Desprez
thesis reviewer(s): S. Niar, C. Perez,
thesis examinator(s): O. Muller, R. Namyst, D. Novo, F. Brodequis,
These de Doctorat Université Grenoble Alpes
Defense: December 05 2019


Along with the traditional CPU  cores, processing units of different architectures are now employed by the High Performance Computing (HPC) community in order to obtain improved efficiency and performance. A Field Programmable Gate Array (FPGA), is a hardware fabric composed of interconnected re-programmable logic and memory blocks. This type of processing unit, is considered a promising candidate to amplify the efficiency and the computational power of heterogeneous HPC platforms since it comes with massive parallelism and a reduced amount of abstraction layers between the application and the actual hardware.
However, exploiting FPGA requires an in-depth knowledge of low-level hardware design and high expertise on vendor-provided tools, which is not aligned with the skills of high performance computing application programmers. In the scope of this thesis, we have designed a framework that allows straightforward development of scientific applications over heterogeneous platforms enhanced with FPGA. Using this framework requires only a limited knowledge of the underlying architecture, and an FPGA can be used in the same way as any other type of processing unit. At the core of the proposed framework there is the StarPU heterogeneous runtime system, that was extended to support this new type of accelerator, hiding from the programmer complex operations deriving from the complexity of the underlying architecture while it allows fine control of the performance through different scheduling strategies. For the communication part, we created CONOR, a communication library based on RIFFA, that enforces consistency in scenarios with concurrent accesses to the FPGA. .
The approach proposed by our framework is evaluated across two directions. The first one corresponds to programmability, while the second one concerns the performance overhead imposed by the additional components attached to the FPGA. Both the programmability and overhead of the framework are evaluated using a basic blocked version of matrix multiplication showing the ease of development and the negligible overhead imposed by FPGA management to the rest of the framework. In addition to our experiments on matrix multiplication, we created an efficient hardware design of GEMM, that will allow the execution of more complex and interesting applications like the Cholesky decomposition.