Proposta de Tese de Fernando Fernandes dos Santos

Contato

Detalhes do Evento

Date: 05/03/2021 09:00 – 11:00
Categorias: Defesas, Proposta de Tese

Aluno: Fernando Fernandes dos Santos
Orientador: Prof. Dr. Paolo Rech

Título: Understanding and improving the GPU’s reliability by combining accelerated neutron beam experiments and fault simulation

Linha de Pesquisa: Confiabilidade e Tolerância a Falhas

Data: 05/03/2021
Horário: 09h

Esta banca ocorrerá excepcionalmente de forma totalmente remota. Interessados em assistir a defesa poderão acessar a sala virtual através do link: https://meet.google.com/evp-nhaf-vyy

Banca Examinadora:
Prof. Dr. Ricardo dos Santos Ferreira (UFV)
Prof. Dr. Marco Antonio Zanata Alves (UFPR)
Profª. Drª. Taisy Silva Weber (UFRGS)

Presidente da Banca: Prof. Dr. Paolo Rech

Abstract: Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a sudden concern about their hardware reliability. This work introduces new ways to implement fault-tolerance on GPUs using Algorithm Based Fault Tolerance (ABFT) and Duplication With Comparisson (DWC) techniques for GPUs. A novel hardening for HPC and safety-critical applications is presented, Reduced Precision Duplication With Comparisson (RP-DWC). RP-DWC’s primary goal is to lower the overhead of DWC by executing the redundant copy in reduced precision. RP-DWC is particularly suitable for modern mixed-precision architectures, such as NVIDIA GPUs. Additionally, a Failure In Time (FIT) estimation approach to predict the error rate is proposed to evaluate NVIDIA GPUs’ reliability. The FIT estimation is necessary to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results and can be used to predict the FIT rates of codes running on GPUs. The proposed FIT estimation is achieved by comparing and combining high-energy neutron beam experiments that account for more than 13 million natural terrestrial exposure years. An extensive architectural-level fault simulation is performed, requiring more than 350 GPU hours (using SASSIFI and NVBitFI) and detailed application-level profiling. Results show that, in most cases, fault simulation FIT prediction for SDCs is sufficiently close (differences lower than 5) to the experimentally measured FIT rates. The background obtained with the FIT estimation is then used to present a new error model based on the relative error in opposition to single/double bit flip. The relative error is based on a new method that extracts the relative values from a low-level fault injection at the HDL level.

Keywords: GPUs. Reliability. High Performance Computing. safety critical systems.