20th IEEE International On-Line Testing Symposium
Hotel Cap Roig, Platja d'Aro, Catalunya, Spain July 7-9, 2014

Keynot Talk
home page Home page

Keynote speaker: Prof. Murali Annavaram, University of Southern California

Title: GPU Reliability: Why it matters and what we can do about it?

Abstract: In the past 10 years graphics processing units (GPUs) moved from gamers darlings to the backbone of supercomputing infrastructure. While computational inaccuracies can be barely tolerated even in multimedia domains, in scientific and general purpose domains computational integrity becomes sacrosanct. Unlike large out-of-order processors, GPUs use the last available transistor for non-speculative computation where errors cannot be easily masked. Unfortunately the technology progression is not on our side in this battle. Smaller dimensions increase soft error vulnerability of SRAMs, while logic circuits face fast wearout. Hence, new solutions must be explored for improving computing fabric in general, and GPU fabric in particular.

This talk first looks at accurately modeling the vulnerability of caches to soft errors. We will discuss the challenges in accurately quantifying the FIT rate (failures in a billion hours) of caches that are protected by complex error correction schemes. We will answer questions such as: is parity sufficient to meet a given FIT goal or do I really need to use the SECDED code?

In the second half of this talk we will look at mechanisms for verifying the computational integrity of hundreds of execution units in GPUs. We will look at both error detection and error correction schemes that take advantage of resource replication and resource underutilization in GPUs to provide strong computational integrity guarantees. And, along the way I hope to fire our collective imagination for new research directions to improve reliability.

Bio: Murali Annavaram has been a faculty member in the Ming-Hsieh Department of Electrical Engineering at the University of Southern California from 2007. He currently holds the Robert G. and Mary G. Lane Early Career Chair. His research focuses on energy efficiency and reliability of computing platforms. His group also works on energy efficient sensor management for body area sensor networks for continuous and real-time health monitoring. Murali received NSF CAREER award in 2010 and an IBM Faculty Partnership award in 2009.
Prior to his appointment at USC, he was a senior research scientist at the Intel Microprocessor Research Labs from 2001 to 2007 working on energy efficient server design and 3D stacking architectures. In 2007 he was a visiting researcher at the Nokia Research Center, Palo Alto working on virtual trip line based traffic sensing. His work on Energy Per Instruction Throttling at Intel is the foundation for Turboboost that improves performance at a fixed power budget. His work on Virtual-Trip-Lines at Nokia formed the foundation for Nokia Traffic Works product that provides real time traffic sensing using mobile phones. He received the Ph.D. degree in Computer Engineering from the University of Michigan, Ann Arbor, in 2001. He is a Senior Member of IEEE and ACM.
More info at http://www.usc.edu/dept/ee/scip/.