FROM RESEARCH TO PRODUCT:
RAS FEATURES IN EPYC AND RADEON INSTINCT

VILAS SRIDHARAN
DEMAND FOR BETTER EXPERIENCES

VOICE, GESTURE, FACE RECOGNITION
SUPER HIGH RESOLUTION DISPLAYS
VR, AR

HUGE DEMAND FOR MORE COMPUTE

BIG DATA ANALYTICS
HIGH-PERFORMANCE COMPUTING
MACHINE LEARNING
Cloud Service Providers
- IaaS/PaaS
- Media
- Social
- SaaS

Enterprise IT
- Virtualization
- SDS/HCI
- Hadoop
- NoSQL

High Performance Compute
- Design & Simulation
- Research & Academia
- Machine Learning
- Supercomputing
DESIGNED FOR THE CLOUD

AMD RADEON INSTINCT™ MI50

World’s First 7nm GPU
Machine Learning Operations for Training and Inference
Flexible Architecture for Different Workloads
End-to-End ECC Protection
DATA CENTER TRENDS

FROM RESEARCH TO PRODUCT:
RAS FEATURES IN EPYC AND RADEON INSTINCT

- High reliability to help enable data center growth
- Advanced availability to help improve customer experience
- Robust serviceability to help reduce data center costs
- Justify RAS features with data
MEMORY TRENDS
DRAM BEHAVIORS

20 – 50% multi-bit

50% permanent

ENDNOTES: 2, 3, 4, 5
BUS SPEED

Projected increase

Bus Clock

DDR3 Relative Rate

Address Parity  ECC
EFFECTIVE REMEDIATION

- Single Bit Correct
- Single Device Correct

42x reduction

Projected decrease

Relative Event Rate

Month

Address Parity

ECC
PRODUCT FEATURES

DDR4 SUBSYSTEM

- DRAM ECC with x4 DRAM device correction
- DRAM address/command parity, write CRC—with replay
- Patrol and demand scrubbing
- Data poisoning and Machine Check recovery
SERVICE COSTS

Conventional DRAM

Stacked DRAM

Increased replacement costs

ENDNOTES: 7
MEMORY BANDWIDTH

<table>
<thead>
<tr>
<th>Type</th>
<th>Bandwidth Per Socket</th>
</tr>
</thead>
<tbody>
<tr>
<td>HBM2</td>
<td>1000</td>
</tr>
<tr>
<td>DDR4 (high)</td>
<td>200</td>
</tr>
<tr>
<td>DDR4 (low)</td>
<td>100</td>
</tr>
<tr>
<td>DDR3 (high)</td>
<td>50</td>
</tr>
<tr>
<td>DDR3 (low)</td>
<td>25</td>
</tr>
<tr>
<td>DDR2 (high)</td>
<td>10</td>
</tr>
<tr>
<td>DDR2 (low)</td>
<td>5</td>
</tr>
</tbody>
</table>

Leverage for RAS?
REDUNDANT MEMORY

Can provide
- Improved reliability

Not the right tradeoff for many markets
- Lower capacity
- Reduced bandwidth
- Unpredictable performance
PRODUCT FEATURES

HBM2 SUBSYSTEM

- Single bit correction ECC
- Multi-bit detection CRC
- Stores data XOR address

SOURCE: AMD
PROCESSOR TRENDS
TRANSIENT UPSETS

Transient Upset Rate

Percent Multi-Bit Upsets

Substantial reduction

Continued increase

SOURCE: AMD   ENDNOTES: 9, 10
REDUCED VOLTAGE

Probability

Normalized Supply Voltage (VDD)

Exponential increase

Read (1000MHz)

Write (1000MHz)
AVF ANALYSIS

- Single Bit AVF

- Optimized protection

- Logical Interleave

- Physical Interleave
PRODUCT FEATURES

CACHE HIERARCHY

- Fast private L2 cache
- Fast shared L3 cache
- Double bit correct, triple bit detect ECC on L2, L3, and queues
- Interleaving in L2 and L3
- Separate L2/L3 voltage rail (Vddm)
GPU TRENDS
COMPUTE THROUGHPUT

Graph showing the increase in compute throughput from 2011 to 2017, with a significant jump around 2015.

Leverage for RAS?

~100x

Source: AMD
REDUNDANT MULTITHREADING

Sphere of Replication (SoR)

Input Replication

Output Comparison

Global memory

Thread A

= ?

Thread A'

Comm Buff

Global memory

ENDNOTES: 15
REDUNDANT MULTITHREADING

Unpredictable performance

Benchmark

<table>
<thead>
<tr>
<th>Bins</th>
<th>BO</th>
<th>Bits</th>
<th>BkSch</th>
<th>DCT</th>
<th>DWT</th>
<th>FWT</th>
<th>FW</th>
<th>MM</th>
<th>NB</th>
<th>PS</th>
<th>QRS</th>
<th>R</th>
<th>SC</th>
<th>SF</th>
<th>URNG</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
REdundant Multithreading

Fundamental bottlenecks
ECC ANALYSIS

Area Overhead (%)

Design 1  |  Design 2  |  Design 3  |  Design 4

Time

Optimized design

SOURCE: AMD
PRODUCT FEATURES

GRAPHICS ENGINE

- ECC on all important arrays
- Modest die area overhead
- Low performance overhead
- Better correction than RMT

SOURCE: AMD
ENTERPRISE-CLASS RAS FEATURES

Understand market requirements
Adapt to technology trends
Optimize design to meet customer needs

Top500 Core Count

<table>
<thead>
<tr>
<th>Time</th>
<th>0E+00</th>
<th>2E+07</th>
<th>4E+07</th>
<th>6E+07</th>
<th>8E+07</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0</td>
<td>2E+07</td>
<td>4E+07</td>
<td>6E+07</td>
<td>8E+07</td>
</tr>
</tbody>
</table>

ENDNOTES: 1
RELIABLE COMPUTATION FOR THE MODERN DATACENTER
ACKNOWLEDGEMENTS

- Xun Jian, Rakesh Kumar, University of Illinois Urbana-Champaign
- Nathan Debardeleben, Elisabeth Moore, Qiang Guan, Sean Blanchard, Ultrascale Systems Research Center, Los Alamos National Laboratory
- Jon Stearley, Kurt B. Ferreira, Scott Levy, Scalable Architectures, Sandia National Laboratories
- Devesh Tiwari, Christian Engelmann, Saurabh Gupta, Oak Ridge National Laboratory
- John Shalf, Computational Research Division, Lawrence Berkeley National Laboratory
- David R Kaeli, Northeastern University
- Kevin Skadron, University of Virginia
- Larry Kaplan and many others at Cray
- David Rohr, Gvozden Neskovic, Prof. Dr. Volker Lindenstruth, Frankfurt Institute for Advanced Studies (FIAS) / GSI Helmholtzzentrum für Schwerionenforschung
- Many others at the U.S. national labs
DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREin, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof, Radeon and Ryzen are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. EA and the EA logo are trademarks of Electronic Arts, Inc. Microsoft is a registered trademark of Microsoft Corporation in the US and other jurisdictions.
ENDNOTES


[6] What is the difference between SDRAM, DDR1, DDR2, DDR3 and DDR4? https://www.transcend-info.com/Support/FAQ-296/. Uses DDRx and HBM2 data rates defined by the specification when released. Assumes 8-channel DDRx per socket or 4 stacks of of HBM2 per socket.


ENDNOTES


