# Reliability Estimation Methods: Trade-offs Between Complexity and Accuracy

## Samuel N. Pagliarini<sup>1</sup>, Denis T. Franco<sup>2</sup>, Lirida A. de B. Naviner<sup>1</sup> and Jean-François Naviner<sup>1</sup>

{samuel.pagliarini, lirida.naviner, jean-francois.naviner}@telecom-paristech.fr denisfranco@furg.br

<sup>1</sup>Institut TELECOM, Télécom-ParisTech, LTCI-CNRS COMELEC Department 46, Rue Barrault - 75013 - Paris, France

<sup>2</sup>Universidade Federal do Rio Grande, Centro de Ciências Computacionais, C3. Avenida Itália, Km 8, Campus Carreiros 96201-900 - Rio Grande, RS - Brasil

#### **Abstract**

As integrated circuits scale down into nanometer dimensions, a great reduction on the reliability of combinational blocks is expected. Thus, reliability estimation methods are of critical concern. This paper contains a survey of different reliability estimation methods that are prone for reliability analysis under multiple faults. Three methods are presented and discussed: PTM, SPR and SPRMP. Both the complexity and accuracy of each method is addressed and compared.

#### 1. Introduction

The amount of defects as well as the number of soft errors (i.e., transient errors) in electronic circuits are expected to increase, becoming major concerns in current and future technologies [1]. Thus, there is a current trend in the design of these circuits in which reliability related criterion are more and more commonly required in the design flows.

While defects are a consequence of issues in the fabrication process, transient errors have many different sources. The transient faults that originate them can be caused by different physical phenomena, such as high-energy particle hits originating from cosmic rays, capacitive coupling, electromagnetic interference or power transients [2]. Transient errors induced by the strikes of energetic particles at the devices are of great concern, specially for dependable systems and circuits [1, 3]. In the past, transient errors used to be a concern only in the design of memories. This encouraged the development of the now widely used error correcting codes and other mitigation techniques such as nodal interleaving. On the other hand, at 90 nm and beyond, the reduced dimensions of the devices and also reduced operating voltage levels lead to more common radiation-induced faults in logic. These faults might sometimes appear as multiple ones. In fact, error rates that are approaching those of memories have been reported in the literature [4].

On the occasion of a transient fault, one or multiple electrical pulses may be created. These pulses, typically current spikes, are able to change the current output value of a circuit node from one to zero or viceversa. Yet, not all transients become observable errors. Some transients do not reach the outputs of the combinational logic since there are natural masking properties against transient errors. These are:

- (1) electrical masking, which accounts for the attenuation of the current pulses as they propagate through each logic gate;
- (2) temporal masking, or latch-window masking, which defines a time interval in which the transients that reach the memory elements of the circuit will or will not be registered. If a transient miss the timing window it will not be acknowledged by the circuit/system;
- (3) logical masking, which accounts for the lack of sensitized paths (from the erroneous node) to a primary output or memory element.

Among the masking properties that render immunity to combinational logic circuits, logical masking is the hardest to model and characterize [5]. Logical masking is also technology-independent, which is quite relevant. Thus, considering this scenario, in this work we focus on reliability estimation methods that concern logical masking only.

### 2. Reliability estimation methods

Three different methods will be described in the subsections that follow: Probabilistic Transfer Matrix (PTM), Signal Probability Reliability (SPR) and Signal Probability Reliability Multi-Pass (SPRMP). All the methods are prone for reliability analysis under a multiple fault scenario. Simpler and traditional methods like fault simulation are usually better suited for single faults only. Thus, specialized methods like the ones described in this paper are of great relevance.

#### 2.1. PTM

The first method to be presented, referred as PTM [6], [7], is the basis for the other two. It is a simple method that models, through the use of matrices, the logic gates and the topology of a circuit. The main idea of the method is to define the correlation between the output patterns and the input patterns of a circuit in a  $2^m$  x  $2^n$  matrix, where m is the number of inputs and n is the number of outputs of a circuit. In order to do so, each logic gate is also represented as a PTM matrix. This representation is achieved by the use of two auxiliary elements: the Ideal Transfer Matrix (ITM) and the parameter q, as illustrated in fig. 1.



Fig. 1 – PTM representation of an OR logic gate.

The ITM matrix represents the truth table of the logic gate. Such matrix is then modified to create the PTM: each value '1' is replaced by q and each value '0' is replaced by (1 - q). The q parameter represents the logic gate's ability to deliver a correct response at the output(s), i.e., it represents the reliability of a logic gate. Thus, (1 - q) represents the error probability. Given each logic gate in a circuit is represented by a PTM matrix, it is then necessary to calculate the PTM matrix of the circuit by taking into account the topology. Such calculation is done by leveling the target circuit, as illustrated in fig. 2.



Fig. 2 – PTM representation at circuit level [8].

The PTM is calculated for each circuit level. The PTM of the first level (PTM<sub>L1</sub>) is calculated by performing the Kronecker product of  $PTM_{AND}$  and  $PTM_{NOT}$  (PTM<sub>L1</sub> =  $PTM_{AND}$   $\square$  PTM<sub>NOT</sub>). The PTM matrix of the second level is already known and is given by  $PTM_{NOR}$ . Finally,  $PTM_{L1}$  is multiplied by  $PTM_{L2}$  to obtain the PTM of the whole circuit.

Although the PTM method is able to estimate the reliability of a circuit accurately, it is not feasible even for medium-sized circuits. The complexity is exponential with both m and n.

#### 2.2. SPR

When applying the PTM method, the size of the intermediate matrices increases at a fast pace. The SPR method [9], [10] tries to avoid this issue by representing each signal in the circuit by a 2x2 matrix. Such matrix is illustrated in fig. 3.

$$P_{2\times 2}(signal) = \begin{bmatrix} P(signal = correct\ 0) & P(signal = incorrect\ 1) \\ P(signal = incorrect\ 0) & P(signal = correct\ 1) \end{bmatrix}$$

Fig. 3 – SPR matrix representation.

In the SPR matrix representation it is assumed that a signal may carry four distinct values. These values are: a correct '0', an incorrect '0', a correct '1' and an incorrect '1'. The SPR matrix then contains the probability of a signal being one of these mentioned values.

Let us consider an OR gate, as represented in fig. 4. Let us also assume that its inputs are already represented as SPR matrices  $A_4$  and  $B_4$ . In order to calculate the SPR matrix of the output s it is necessary to take into account the inputs, the logic function and the reliability of the gate. The logic function and the reliability are already given by the PTM matrix representation. And, since a gate might have multiple inputs, the joint probability of the inputs must be considered. This is achieved by calculating the Kronecker product of the inputs, as illustrated in fig. 4. The resulting matrix is multiplied by the PTM of the gate.

One last step is then performed, in which the values from the P(S) matrix are merged into a single SPR matrix. Values are merged according to the ITM matrix of the gate. This step inserts a certain loss of accuracy

$$A_4 = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix}$$

$$B_4 = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix}$$

$$B_4 = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix}$$

$$D_{QOR} = 0.95$$

$$\begin{bmatrix} 0.25 & 0 & 0 & 0 \\ 0 & 0.25 & 0 & 0 \\ 0 & 0 & 0.25 & 0 \\ 0 & 0 & 0 & 0.25 \end{bmatrix} \times \begin{bmatrix} 0.95 & 0.05 \\ 0.05 & 0.95 \\ 0.05 & 0.95 \\ 0.05 & 0.95 \end{bmatrix} = \begin{bmatrix} 0.2375 & 0.0125 \\ 0.0125 & 0.2375 \\ 0.0125 & 0.2375 \\ 0.0125 & 0.2375 \\ 0.0125 & 0.2375 \end{bmatrix} \begin{bmatrix} 0.2375 & 0.0125 \\ 0.0375 & 0.7125 \end{bmatrix}$$

$$I = A_4 \otimes B_4 \qquad PTM_{OR} \qquad P(S)$$

Fig. 4 – Example of signal probability propagation in an OR gate [10].

due to erroneous evaluation of reconvergent fan-outs. This issue is explained in details in the next subsection and it is the motivation for the SPRMP method. Nevertheless, this same step allows for a improved performance (when compared with the PTM method). The complexity of the SPR algorithm is linear with the number of gates.

#### **2.3. SPRMP**

The inaccuracy in the SPR method comes from the erroneous evaluation of reconvergent fan-outs. The simple example circuit in fig. 5 illustrates this issue. The input SPR matrix is simplified for a simpler analysis (only two elements are considered,  $a_0$  and  $a_3$ ). Although a reconvergent fan-out is not part of the topology itself, in order to obtain the final reliability of the circuit, it is required to multiply the reliability associated with each of the circuit outputs. This operation has the same effect as an actual reconvergent fan-out in the circuit topology.

$$A_{4} = \begin{bmatrix} a_{0} & 0 \\ 0 & a_{3} \end{bmatrix} \qquad a \xrightarrow{\begin{array}{c} \text{circuit} \\ \\ \\ \end{array}} y(0) \qquad Y(0)_{4} = \begin{bmatrix} a_{3}q & a_{3}\bar{q} \\ a_{0}\bar{q} & a_{0}q \end{bmatrix}$$

$$R(circuit) = R_{y(0)}R_{y(1)} = a_3^2q^2 + a_3a_0q^2 + a_0a_3q^2 + a_0^2q^2$$

Fig. 5 – Computing the reliability of a simple circuit with a reconvergent fan-out [8].

If the SPR method is applied to the given circuit, it produces an erroneous result. The equation for the circuit reliability contains two terms that should not be accounted for. Terms like  $a_3^2q^2$  are inconsistent since they depend twice on the same probability  $(a_3)$ . Along the same lines, terms like  $a_0a_3q^2$  are inconsistent since they depend on different states of the same signal.

The SPRMP method solves this issue by splitting the analysis in multiple passes (hence the name). In each pass a single state is considered while the others are assumed to be zero. Fig. 6 illustrates this concept. Although only pass 1 and pass 2 are being shown in fig. 6, all four possible states should be evaluated. For each fan-out node we have 4 partial reliabilities. The reliability of the circuit as a whole is given by the sum of all partial reliabilities. For the example illustrated in fig. 6 the reliability is given by:

$$R(circuit) = R(circuit, a_0) + R(circuit, a_3) = a_0q^2 + a_3q^2$$
 (1)

As expected, the reliability given in (1) no longer depends on inconsistent terms. The accuracy of SPRMP is precise if all fan-outs are evaluated. The penalty comes in terms of execution time, which no longer presents a linear complexity. The complexity of the algorithm is exponential with the number of fan-outs.

SPRMP allows for straightforward trade-offs between execution time and accuracy. Through the concept of dominant fan-outs, i.e., some fan-outs are more important than others, it is possible to diminish the execution time drastically. Depending on the topology of the target circuit, even when only a small number of fan-outs is considered, a large reduction in execution time is possible with an error smaller than 2% [9].

#### 3. Conclusion

This paper discussed and compared three different methods to estimate the reliability of combinational logic circuits. Each method has its own characteristics. It is up to the designer to choose which method seems



Fig. 6 – Example of the multi-pass algorithm [7].

to be a better fit. PTM is on the extreme edge of accuracy: it will always produce accurate results. SPR is on the other edge: it will produce a relatively inaccurate result in a linear time. SPRMP lies somewhere in between: accuracy and complexity can be traded by considering only a portion of the total number of fan-outs.

#### **ACKNOWLEDGMENT**

This work was partially funded by the CATRENE project RELY and by the STIC AmSud program.

#### REFERENCES

- [1] M. Nicolaidis, "Design for soft error mitigation," Device and Materials Reliability, IEEE Transactions on, vol. 5, no. 3, 2005, pp. 405-418.
- [2] G. Saggese, N. Wang, Z. Kalbarczyk, S. Patel, and R. Iyer, "An experimental study of soft errors in microprocessors," Micro, IEEE, vol. 25, no. 6, 2005, pp. 30-39.
- [3] P. Dodd and L. Massengill, "Basic mechanisms and modeling of single-event upset in digital microelectronics," Nuclear Science, IEEE Transactions on, vol. 50, no. 3, 2003, pp. 583-602.
- [4] B. Narasimham, B. Bhuva, R. Schrimpf, L. Massengill, M. Gadlage, O. Amusan, W. Holman, A. Witulski, W. Robinson, J. Black, J. Benedetto, and P. Eaton, "Characterization of digital single event transient pulse-widths in 130-nm and 90-nm cmos technologies," Nuclear Science, IEEE Transactions on, vol. 54, no. 6, 2007, pp. 2506-2511.

- [5] N. George and J. Lach, "Characterization of logical masking and error propagation in combinational circuits and effects on system vulnerability," in Dependable Systems Networks (DSN), 2011 IEEE/IFIP 41st International Conference on, 2011, pp. 323-334.
- [6] K. N. Patel, I. L. Markov, and J. P. Hayes, "Evaluating circuit reliability under probabilistic gate-level fault models," in In International Workshop on Logic Synthesis (IWLS), 2003, pp. 59-64.
- [7] S. Krishnaswamy, G.F. Viamontes, I.L. Markov, and J.P. Hayes. Accurate reliability evaluation and enhancement via probabilistic transfer matrices. Design, Automation and Test in Europe, 2005. Proceedings, pp. 282-287 Vol. 1, 2005.
- [8] D. T. Franco, "Signal reliability for combinational circuits under multiple simultaneous faults". Ph.D. thesis. Télécom Paristech, Paris. 2008.
- [9] D. T. Franco, M. C. Vasconcelos, L. Naviner, and J.-F. Naviner, "Signal probability for reliability evaluation of logic circuits," Microelectronics Reliability, vol. 48, no. 8-9, 2008, pp. 1586 1591.
- [10] D. T. Franco, M. C. Vasconcelos, L. Naviner, and J.-F. Naviner, "Reliability of logic circuits under multiple simultaneous faults," 51st Midwest Symposium on Circuits and Systems, (MWSCAS), 2008.