Skip to main content


Learning in very low precision

Computer Arithmetic, Deep Learning, Digital Circuits and Systems

The growing need to deploy deep learning applications on embedded devices entails a need for architectures that are more energy-efficient than the mainstream solutions based on general-purpose processors and GPUs. The use of non-standard arithmetic formats is one promising direction to reduce the application-level energy consumption. For inference, very good results have been obtained using ternary networks, i.e. networks where coefficients and weight can take only the values -1, 0, and 1. Larger number formats are still used inside the neural network accelerators to hold, for instance, partial accumulation results.
For learning, there is a consensus that a large dynamic is needed, and that single-precision (32 bits) floating point offers more than needed accuracy and range. There is also a consensus that the standard half-precision 16-bit format (5 bits of exponent, 10bits of significand), which was designed for graphics, lacks dynamics for machine learning. Intel and ARM push the bfloat16 format with 8 bits of exponent, while IBM push their DLFloat format with 7 bits of exponents. Radically new formats are also proposed, such as the posit format by Gustafson or a logarithmic system proposed by Jeff Johnson at Facebook. In all these cases, the intermediate computations are performed in larger formats, just as in the ternary case.
The objective of this PhD is to explore such arithmetic opportunities with a focus on the learning phase, and the objective to use the smallest possible formats. The candidate will build a framework in which new number formats and micro-architectures can be evaluated in terms of application-level metrics (resource consumption, performance and accuracy). The deployment is initially envisioned on FPGAs, and the candidate will investigate FPGA-specific formats and architectures. The design of autoencoders will be used as a first case study.


Thesis director: Frédéric PETROT
Thesis supervisor: Florent DE DINECHIN (INSA Lyon)
Thesis started on: Feb. 2020
Doctoral school: MSTII

Submitted on January 12, 2022

Updated on February 8, 2022