Neural Network Reliability for Automotive Applications

We are evaluating and mitigating the transient error effects in neural networks for automotive applications. Embedded GPUs are particularly attractive for analysis the huge amount of images and signals required for self-driving systems. Modern self-driving cars rely on neural networks to detect and classify objects. Neural networks were found in the past to be intrinsically resilient. However, new deep learning frameworks rely on convolution to extract information from images, whose reliability still needs to be qualified. Additionally, as GPUs were initially designed for graphics rendering, a task where high reliability is not a concern [1], their architecture has some inherent reliability weaknesses [2,3, 4, 5]. A single corrupted thread could feed thousands of future parallel threads with erroneous data, leading to multiple errors in the output. A vast majority of radiation-induced errors affect multiple output elements in GPUs [6]. The peculiarity of GPUs to spread the single corruption to various output element may undermine the reliability of object detection system. We are the first research group which experimentally evaluated embedded GPUs under a neutrons source.

With radiation tests, we were able to propose hardening techniques for object detection algorithms which are commonly used on ADAS. We have irradiated two well know Convolutional Neural Networks (CNNs), Darknet and Faster R-CNN [7, 8], executed on three different GPUs architectures (Kepler, Maxwell, Pascal). We have identified which errors could affect detection and should be mitigated [9]. With our results, we could propose an Algorithm Based Fault Tolerance technique for CNNs which reduced the number of critical errors of about 87%. We also found that some operations like Maxpooling (i.e. which is a filter to reduce the spatial size of the network output) improve the CNN reliability. We have designed a more robust Maxpool operation, which can detect up to 90% of critical SDCs.

[1] M. Breuer, S. Gupta, and T. M. Mak, “Defect and error tolerance in the presence of massive numbers of defects,” Design Test of Computers, IEEE, vol. 21, pp. 216–227, May 2004.

[2] N. DeBardeleben, S. Blanchard, L. Monroe, P. Romero, D. Grunau, C. Idler, and C. Wright, “GPU Behavior on a Large HPC Cluster,” 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 19th International European Conference on Parallel and Distributed Computing (Euro-Par 2013), Aachen, Germany,, August 26-30 2013.

[3] H.-J. Wunderlich, C. Braun, and S. Halder, “Efficacy and efficiency of algorithm-based fault-tolerance on gpus,” in On-Line Testing Symposium (IOLTS), 2013 IEEE 19th International, pp. 240–243, July 2013.

[4] L. A. B. Gomez, F. Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, K. Pattabiraman, R. Rech, and M. S. Reorda, “GPGPUs: How to Combine High Computational Power with High Reliability,” in 2014 Design Automation and Test in Europe Conference and Exhibition, (Dresden, Germany), 2014.

[5] D. A. G. de Oliveira, L. L. Pilla, T. Santini, and P. Rech, “Evaluation and mitigation of radiation-induced soft errors in graphics processing units,” IEEE Transactions on Computers, vol. 65, pp. 791–804, March 2016.

[6] D. Oliveira, L. Pilla, M. Hanzich, V. Fratin, F. Fernandes, C. Lunardi, J. Cela, P. Navaux, L. Carro, and P. Rech, “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators,” in Proceedings of 21st IEEE Symp.on High Performance Computer Architecture (HPCA), ACM, 2017. Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015.

[7] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” CoRR, vol. abs/1506.02640, 2015.

[9] F. Fernandes, L. Weigel, P. Navaux, L. Carro, and P. Rech, “Evaluation and mitigation of neural network-based object detection in three gpu architectures,” SELSE, 2017.

[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN:

{Image missing.} Figure: three GPUs architectures (Kepler, Maxwell, Pascal) irradiated at ChipIR, UK, while executing CNNs for automotive applications