Dependable Real-Time Computing on Field-Programmable Gate Arrays

Protecting real-time systems using novel hardening techniques for FPGAs.

Field-Programmable Gate Arrays (FPGAs) are reconfigurable integrated circuits that have found great commercial success over the past decades. Their flexibility allows one to describe the desired hardware in a hardware description language (HDL) and, through vendor-specific tools, synthesize it to a configuration bitstream that can be uploaded to the device. Once configured, the device behaves as the circuit described in the HDL. FPGAs provide a general-purpose fabric that has been used in numerous applications, such as telecommunications, medical, aerospace and defense. Many of these applications present dependability constraints that require the development of appropriate fault tolerance mechanisms to be satisfied. Specifically for FPGAs, the memory that stores the configuration bitstream is a very critical component. Faults affecting its contents, caused by ionizing particles, for example, can modify the behavior of the defined circuit, leading to functional failures [1]. And, since its contents are typically only written at system power up, these changes persist until repair procedures are taken. These procedures usually have to overwrite the erroneous contents with the correct one, a process called scrubbing [2]. Due to the significant size of these memories, that can reach tens of megabytes for state-of-the-art devices, a full device scrub can take a long time to be concluded. Long repair latencies are problematic for applications subject to real-time constraints [3], a common feature for critical systems. Repair procedures that are able to quickly restore the system functionality, while still presenting high fault coverage and low penalties in area, performance and energy, are necessary to enable the use of FPGAs for such systems.

In our group we address this problem through localized repair procedures that, using the partial reconfiguration capabilities of current FPGAs, can be executed in a much shorter period of time. For that to be possible, the error location must be known quickly and precisely. We have proposed the use of fine-grained redundancy to perform this detailed diagnosis. By mapping the most likely error locations for each error detection pattern, one can significantly reduce the mean time to repair by starting the scrubbing process closer to the error [4]. The problem of maximizing the probability of successful repair within a certain timeframe was also investigated [5]. Alternative redundancy granularities have been evaluated to minimize costs in area and latency [6].

[1] A. Lesea, S. Drimer, J.J. Fabula, C. Carmichael and P. Alfke, “The Rosetta experiment: Atmospheric soft error rate testing in differing technology FPGAs,” in IEEE Trans. on Device and Materials Reliability, vol. 5, no. 3, Sept. 2005, pp. 317-328.

[2] C. Carmichael, M. Caffrey and A. Salazar, “Correcting Single-Event Upsets Through Virtex Partial Configuration,” in Xilinx Application Notes [Online]. Available:, Xilinx, June 2000.

[3] H. Kim, K. G. Shin, “Evaluation of Fault Tolerance Latency from Real-Time Application’s Perspectives,” in IEEE Trans. on Computers, vol. 49, no. 1, Jan. 2000, pp. 55-64.

[4] G. L. Nazar, L. P. Santos, L. Carro, “Fine-Grained Fast Field-Programmable Gate Array Scrubbing,” in IEEE Trans. On Very Large Scale Integration (VLSI) Systems, vol. 23, no. 5, May 2015, pp. 893–904.

[5] G. L. Nazar, “Improving FPGA repair under real-time constraints,” in Microelectronics Reliability, vol. 55, no. 7, June 2015, pp. 1109–1119.

[6] L. P. Santos, G. L. Nazar, L. Carro, “Low Cost Dynamic Scrubbing for Real-Time Systems,” in International Symposium on Applied Reconfigurable Computing (ARC), 2016, pp. 144-156.