Fault Tolerance in HPC applications

We collaborate with the Los Alamos National Labs, with NVIDIA, and Northeastern University to evaluate, understand, and move towards the solution of the reliability issue in supercomputers. We not only quantify, but also qualify transient error effects. We also consider the benefit of imprecise computing in increasing the error tolerance threshold and, then, reducing the number of critical errors. Once critical errors are identified, we can propose technique to mitigate them, increasing the application reliability without unnecessary overhead.

The U.S. Department of Energy (DOE) identifies reliability as one of the ten greatest challenges for exascale [1]. Errors that may undermine the reliability of an HPC system can come from a variety of sources including environmental perturbations, firmware errors, manufacturing process, temperature, and voltage variations [2, 3, 4]. Transient errors induced by ionizing particles are particularly critical since they dominate error rates in commercial devices [5]. HPC components typically provide high computational capacity, low cost, reduced per-task energy consumption, and flexible development platforms. Unfortunately, HPC components are also extremely likely to experience transient errors as they are built with cutting-edge technology, have very high operation frequencies, and include large amounts of resources. In this scenario, a lack of understanding of HPC components resilience may lead to lower scientific productivity, lower operational efficiency, and even significant monetary loss [6]. As a reference, DOE’s Titan has a radiation-induced Mean Time Between Failures (MTBF) in the order of dozens of hours [7, 8].

Many top supercomputers use some form of accelerators like Intel Xeon Phi or NVidia GPUs [9]. We are studying accelerators to evaluate, understand, and develop mitigation strategies for reliability issues in current and future supercomputers [10]. To assess the radiation-induced problems, we experiment with an accelerated neutron flux suitable to mimic the terrestrial neutron effects on electronic devices [11]. We currently covered thousands of years of natural exposure per component studied. Then, we analyze the error severity and how the error propagates to the output of executions affected by transient errors to understand the reliability issue [12]. Finally, we inject faults to perform detailed analysis of the application’s vulnerabilities and provide pragmatic information to develop mitigation mechanisms that maximize resilience and minimize the overhead imposed.

[1] Advanced Scientific Computing Research. 2016. Scientific Discovery through Advanced Computing - The Challenges of Exascale. https://science.energy.gov/ascr/research/scidac/exascale-challenges/. (2016). [Online; accessed 5-March-2016].

[2] J. C. Laprie. 1995. DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY. In Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years., Twenty-Fifth International Symposium on. DOI: http://dx.doi.org/10.1109/FTCSH.1995.532603

[3] R. R. Lutz. 1993. Analyzing software requirements errors in safety-critical, embedded systems. In Requirements Engineering, 1993., Proceedings of IEEE International Symposium on. 126–133. DOI: http://dx.doi.org/10.1109/ISRE.1993.324825

[4] M. Nicolaidis. 1999. Time redundancy based soft-error tolerance to rescue nanometer technologies. In VLSI Test Symposium, 1999. Proceedings. 17th IEEE. 86–94. DOI: http://dx.doi.org/10.1109/VTEST.1999.766651

[5] R.C. Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. Device and Materials Reliability, IEEE Transactions on 5, 3 (Sept 2005), 305–316. DOI:http://dx.doi.org/10.1109/TDMR.2005.853449

[6] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson et al., “Addressing failures in exascale computing,” International Journal of High Performance Computing Applications, pp. 1–45, 2014.

[7] L. A. Bautista Gomez, Franck Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, S. Keckler, K. Pattabiraman, R. Rech, and M. Sonza Reorda. 2014. GPGPUs: How to Combine High Computational Power with High Reliability. In 2014 Design Automation and Test in Europe Conference and Exhibition. Dresden, Germany.

[8] Devesh Tiwari, Saurabh Gupta, Jim Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan Debardeleben, Philippe Navaux, Luigi Carro, and Arthur Buddy Bland. 2015. Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation. In Proceedings of 21st IEEE Symp. on High Performance Computer Architecture (HPCA). ACM.

[9] J.J. Dongarra, H.W. Meuer, and E. Strohmaier. 2016. TOP500 Supercomputer Sites: June 2016. (2016). http://www.top500.org

[10] D. A. G. de Oliveira, L. L. Pilla, T. Santini and P. Rech, “Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units,” In IEEE Transactions on Computers, vol. 65, no. 3, pp. 791-804, March 1 2016. DOI: 10.1109/TC.2015.2444855

[11] M. Violante, L. Sterpone, A. Manuzzato, S. Gerardin, P. Rech, M. Bagatin, A. Paccagnella, C. Andreani, G. Gorini, A. Pietropaolo, G. Cardarilli, S. Pontarelli, and C. Frost. 2007. A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility. Nuclear Science, IEEE Transactions on 54, 4 (2007), 1184–1189. DOI: http://dx.doi.org/10.1109/TNS.2007.902349

[12] Daniel Oliveira, Laercio Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Fernandes, Caio Lunardi, José Marı́a Cela, Philippe Navaux, Luigi Carro and Paolo Rech 2017. Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 2017.

{Image missing.} Figure: UFRGS, Los Alamos Nat. Lab, and Northeastern Univ. team at LANSCE, Los Alamos, testing HPC and embedded parallel processors.