Methods and tools for resilience

On-going work includes methods and tools for both dependability evaluation and dependability improvement, with respect to natural disturbances (e.g., particle impacts) or malicious attacks. Part of the work aims at better characterizing the errors generated by disturbances (for example, the actual errors obtained after a laser shot, depending on the technology used to manufacture the target circuit, or the defects due to ageing). Another part of the work aims at improving the dependability of a circuit during its lifetime. In addition, significant effort targets a more efficient test at system level. This page summarizes the main actions; more details can be found using the links below.


Protection against faults

Transient faults due e.g., to environmental disturbances or manufacturing variations, and permanent faults due e.g., to residual defects and ageing, are an increasingly source of failures in digital systems. Various techniques have been studied in the past to detect and/or correct and/or tolerate the effects of such faults. New methods are still required to tackle new types of faults related to the most recent technologies (including FDSOI, FINFET, and also 3D integration) and to also take into account intentional (malicious) faults due to attacks. Old assumptions about e.g., error multiplicity ("single-bit errors") cannot be trusted anymore. More functional or property-based detection approaches are therefore required. Studied approaches are based for example on control-flow checking, either for single-processor monitoring (including improved characteristics for the IDSM method previously proposed in the team) or in the context of many-core systems.

Self-healing digital blocks

Due to the increasing probability of faults or errors, many systems include as a requirement the capability to react at real time. As shown a few years ago in several national or international roadmaps, the capability to not only detect an error but also dynamically adapt the system behavior to provide continuous service (even degraded) has become a strong objective. This is often called "self-healing" systems. Various approaches have already been proposed towards this objective but none emerged as a significant answer to the needs, especially when the recovery has to be managed at the level of a specialized digital block rather as the level of a many-core system. Preliminary studies have started in this area; part of the work will be targeted towards increasing the self-healing property of a digital system without resorting to e.g., task re-allocation between processors. This work is partly in collaboration with researchers in Sousse (Tunisia). Another aspect will cover the self-healing of Network on Chip structures. It will use a smart adaptive reallocation of the communication to minimize ageing factors and thus the impact on the lifetime expressed in terms of timing errors, but also logic and functional failures. A combination of monitors and adaptive systems to uniformly distribute the system activity over the NOC blocks will be the center of the research, to avoid excessive HCI ageing or to recover after NBTI.

Ageing prediction

Reliability issues due to aging phenomena (NBTI, HCI, TDDB) are becoming increasingly difficult to assess in advanced technologies only at a given design level, due to diverse device conditions (temperature/voltage) and parameters. Recent investigations show an important dependence of these parameters on the workload, a strong correlation between them and an important Poisson-law dispersion of characteristic. The actual physical level models will be enriched with the last aspects of NBTI, HCI, TDDB phenomena. Design-in Reliability (DiR) methodologies seek to provide a quantitative assessment of CMOS digital reliability by combining multiple level evaluations as the innovative aspect. Future research directions will target the dependency of a SoC ageing evaluation on the workload by using cosimulation methods that are more accurate than actual Statistical Static Timing Analysis. The objective here is to define an evaluation flow developed in association with CAD vendor and easily adopted by industrial research teams for reliability (lifetime) prediction.

Dependability analysis

Significant work was targeted in the last years towards efficient fault injection techniques to validate early in the design flow the level of robustness achieved by a given design, for a given application, with respect to a given error model. In spite of hardware emulation techniques that considerably decrease experimental time compared with simulations, the evaluations remain very long. Recently, work has been done to reduce the evaluation time in the case of a processor-based system running a given application software. This work is based on an accurate modeling of the processor micro-architecture and on a single simulation trace. Further work is currently focused on the case of complex digital blocks without processor and software. Avoiding lengthy fault injections requires a more analytical evaluation, based on the circuit structure and on its usage, to identify the lifetime of data in the various internal registers for a given application.

3D circuits

With the recent advances in manufacturing process, 3D integration becomes possible, bringing in many advantages such as shorter wires, reduced delays, improvement of IC form factor, and heterogeneous integration of reusable IPs. To bring 3D design into the mainstream, many challenges still remain, besides the manufacturing process related problems. Reliability and yield are still an issue, so is the design partitionning, placing and routing or timing and power estimation in supperposed dies. Moreover, EDA tools have to support the 3D flow. Power consumption estimation is one of the major challenges in 3D ICs. When comparing to a 2D power estimation flow, there are many more challenges and options to consider in 3D: taking into account dynamic and static power per tier and the correlation over tiers at resonable engineering cost, taking into account heterogenous components, sometimes poorly characterized from the power estimation perspective, heat dissipation from the 3D stack. The goal here is to start investigating strategies and heuristics able to lead us to a proper partition and floorplan of 3D systems. Further to that, the work will be oriented to the power and dissipation estimation issues per tier and correlated. These aspects will be investigated at the highest possible level of abstraction. We typically plan to target the RTL level which is probably a good trade-off to quickly iterate with an appropriate level of accuracy. This work will be performed with Atrenta Inc, 3D EDA market leader.

System-level test

In order to guarantee high reliability, manufacturing and on-line testing are crucial. With the growing complexity of Systems On Chip, system level test and debug is getting more and more challenging. As for the Boundary Scan in the 90's, new system level test standards are being developed and are gaining popularity in the industry. Two of them, the IEEE 1500 and more recently the IEEE 1687 allow system level testing but have some limitations in terms of scalability and flexibility when targeting complex heterogeneous many-core/3D systems. An IEEE study group for "System-JTAG" (SJTAG) is also reaching the PAR state. We intend to extend/adapt these standards to allow not only the manufacturing test of such large systems but also their online testing when embedded in their functional environment. Another goal is to take benefits from all these new standards to monitor the system for both system debug and on line testing, using hardware and software techniques and adapting validation techniques to manufacturing and on-line testing.

For more details ...

Hardware/Software dependability analysis from RT-Level descriptions

RT-Level design for reliability/safety/availability and/or security