Abdallah MEEBED | TIMA - Université Grenoble Alpes

Efficient hardware implementations of neural networks with intra-layer heteroge- neous quantization

SLS

Keywords: Number representation, Quantization, Neural networks, Hardware acceleration

Abstract: Efficient hardware implementations of neural networks with intra-layer heterogeneous quantization
Context: Usual digital neural networks consist of large amounts of matrix-vector of matrix-matrix multi-
plication operations. Thus, the design of dedicated hardware acceleration circuits for digital neural networks mostly focuses on accelerating these matrix operations. The involved matrices are often quite large, each with millions or tens of millions of parameters. Thus, the size of these memories of parameters and effective bandwidth with the computation cores are design parameters of paramount importance. It has already been proposed to compress the parameters in memory to reduce the required memory size and bandwidth. Generally, this exploits the inherent sparsity (values that are zero) within these matrices of parameters. The generalization of this consideration is quantization: it consists in evaluating how many bits of information are actually needed to represent the different parameters involved in a given neural network model. For inference only, quantizations down to 8-bit integers are now relatively common. Extreme quanti- zations down to 2 bits, 2-bit ternary numbers and even binary have been considered with success even though the impact on accuracy is noticeable.
To ease design of accelerator circuits for quantized neural networks, usually it is considered that all values within a complete network, or within a complete layer, have same quantization. The interest is the relatively low-complexity hardware with little control flow. However, the impact on accuracy is significant when quantization is pushed very low, below 8 bits.
Previous works have revealed that within a given layer of a neural network, only a small fraction of the
parameters actually need a relatively high quantization (4 or 8 bits), while most of the values can be quantized very low (1 or 2 bits). To enable pushing low quantization very low without too much impact on accuracy, it is necessary to handle heterogeneous quantization within layers of the given neural networks. Objectives: The challenge is to devise appropriate transformations of a given neural network model, such that intra-layer heterogeneous quantization handling can be efficiently processed by dedicated acceleration hardware circuits. It is not guaranteed that the most hardware-friendly access patterns to the memories of weights are also suitable to keep a good accuracy, especially in the context of highly-quantized data. In that case, the candidate will evaluate whether or not more neurons or different quantizations can compensate accuracy loss due to compression.
The following properties and freedom degrees will be taken into account:
• the distribution of the different quantizations within the matrices of parameters,
• the structure of the execution units,
• the order and distribution of values in memory,
• the impact on overall accuracy.
Proof of concept implementations on FPGA target will be realized.

Informations

Thesis director: Frédéric PÉTROT (TIMA - MADMAX)
Thesis co-supervisor: Olivier MULLER (TIMA - MADMAX)
Thesis started on: 13/10/2025
Doctoral school: MSTII