Louka YERLY | TIMA - Université Grenoble Alpes

PhD in Computer Architecture (HW/SW interaction between operating system and microarchitecture)

SLS

Keywords: Computer Architecture, Microarchitecture, Operating System, HW/SW

Abstract: Modern general purpose processors leverage speculation at the hardware level to accelerate program execution. Typical examples include hardware branch prediction, cache replacement policies, data and instruction prefetching, etc. Those mechanisms are typically designed with user code in mind, as many embedded and desktop-level applications spend most of their time executing their own code rather of system calls. However, we can envision two scenarios where applications end up spending a significant time in kernel code:
• 'Datacenter' applications such as web servers, databases, etc. make significant use of I/O primitives.
• Following the end of Dennard scaling, system-on-chips now feature accelerators dedicated to specific applications (e.g., GPU for graphics processing, NPU/TPU for machine learning applications, etc.). In the future, one can imagine that most of the code will be ran on accelerators, when the general purpose processor will mostly be used to run the system.
Given this, it appears that efficiently executing system code may become comparatively more important than it was for embedded and desktop-level applications.
This thesis will focus on two questions:
• Is system code fundamentally different from user code in terms of e.g., instruction mix, data spatial and temporal locality, code footprint, etc.
• Are microarchitectural speculation techniques that were designed with user code in mind adapted to system code?
A first requirement is to be able to analyze (e.g., trace) both userspace and kernel space code within the same application. This can be achieved through binary translation and instrumentation tools that execute the system (e.g., QEMU) and not only the application (e.g., Pin). This may also be obtainable through hardware extensions (e.g., Intel Processor Trace) or by adding tracing directly at the RTL level in an open-source processor (e.g., Boom, Xiangshan). Regardless, the goal is to obtain representative traces of execution and to replay them offline to obtain metrics of interest and find interesting differences between user code and kernel code that could be leveraged at the hardware level.
Therefore, as a first step, the doctoral student will be tasked with understanding the different options available to obtain those traces and to weigh their pros and cons. He will then set up an infrastructure to obtain traces knowing that target applications are large (e.g., DeathStarBench), potentially multithreaded, and potentially client-server, thus are non-trivial to set up for tracing.
As a second step, and based on the analysis performed in the first step, the doctoral student will propose hardware schemes that have potential to improve system code performance. Those schemes may have to work for both user code and system code, despite the two potentially having different characteristics.
For instance, it is not unlikely that the dataset used by a system call will have temporal locality within the system call execution, but will not be reused once the system call has terminated. That is, the next instance of that system call will use different data. Therefore, it may be desirable to tailor the cache replacement policy based on whom (user or kernel) brought the data into the cache.

Informations

Thesis director: Frédéric PETROT (TIMA - SLS)
Thesis supervisors: Arthur PERAIS (TIMA-SLS)
Thesis started on: December 2024
Doctoral school: MSTII