Português English

Tese de Doutorado de Fernando Fernandes dos Santos

Detalhes do Evento


Aluno: Fernando Fernandes dos Santos
Prof. Dr. Paolo Rech 

Título: Understanding and improving the GPU’s reliability by combining accelerated neutron beam experiments and fault simulation

Linha de Pesquisa: Confiabilidade e Tolerância a Falhas

Data: 29/11/2021
Horário: 12h

Esta banca ocorrerá excepcionalmente de forma totalmente remota. Interessados em assistir a defesa poderão acessar a sala virtual através do link: https://meet.google.com/hnq-wjva-vyb

Banca Examinadora:

– Prof. Dr. Timothy Tsai (NVIDIA)
– Prof. Dr. Dimitris Gizopoulos (NKUA – National and Kapodistrian University of Athens)
– Prof. Dr. Evgenia Smirni (W&M – College of William & Mary)

Presidente da Banca: Prof. Dr. Paolo Rech

Abstract: Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators, employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a sudden concern about their hardware reliability. In order to evaluate the GPU reliability, researchers expose a device to a neutron beam and perform fault injection to simulate the fault propagation. While beam experiments provide a very realistic error rate of the device, it lacks fault propagation visibility. Contrarily, fault injection allows the complete visibility of the fault propagation, but the fault simulation and the error model are often limited to user-accessible resources and may lead to unrealistic results. Consequently, a methodology to accurately estimate the error rate of a device is necessary to answer two of the fundamental open questions in GPU reliability evaluation: (1) whether fault simulation provides representative results and can be used to predict the Failure In Time (FIT) rates of codes running on GPUs. (2) are the single and double bit-flip accurate error models to simulate faults on a GPU. This thesis presents a novel FIT estimation approach to predict the NVIDIA GPUs’ error rate. The proposed FIT estimation is achieved by comparing and combining highenergy neutron beam experiments that account for more than 13 million natural terrestrial exposure years, an extensive architectural-level fault simulation (using SASSIFI and NVBitFI), and detailed application-level profiling, requiring more than 1,000 GPU hours. Results show that, in most cases, the estimated Silent Data Corruption (SDC) rate is sufficiently close (differences lower than 5) to the experimentally measured SDC rates. The knowledge from the FIT estimation is then used to present a new error model based on the relative error in opposition to single/double bit flip. The relative error is based on a new method that extracts the relative error differences from a fault injection at the           Register-Transfer Level (RTL). Using the experimental, architectural, and algorithmic analysis, this work presents also two novel hardening solutions for HPC and safety-critical applications: (1) Reduced Precision Duplication With Comparison (RP-DWC). RP-DWC’s primary goal is to lower the overhead of Duplication With Comparison (DWC) by executing the redundant copy in reduced precision. RP-DWC achieves an excellent coverage (up to 86%) with minimal overheads (as low as 0.1% time and 24% energy consumption overhead). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs). The Algorithm-Based Fault Tolerance (ABFT) employed to the matrix multiplication (the core of the CNNs) can correct more than 60% of the critical SDCs in a CNN, while re-designing the CNN’s max pool layers leads to a detection up to 98% of SDCs. Additionally, this work is the first to evaluate the CNNs’ error rate and CNNs’ hardening efficiency on neutron beam experiments.

Keywords: GPUs. Reliability. High Performance Computing. safety critical systems.