Efficient methodology for design and verification of Memory ECC error management logic in safety critical SoCs

Siddharth Garg, Joachim Fader, Ashish Kumar Gupta (Freescale Semiconductor

Abstract:

Stringent safety requirements require SoC architects to focus more on implementing schemes to make the microcontrollers fail safe under all conditions. Also, as the chip complexity and size is growing, additional hardware required to meet the safety requirements also grows. So, a lot of work is going on to make sure that safety requirements are met with minimal area overhead. Failures may arise from random hardware failures, systematic hardware failures or systematic software failures. Safety can be implemented through software, hardware or a combination of both. In hardware, we can implement various features such as using lockstep core and redundant hardware elements (replication to avoid common cause failures), ECC on memories, monitors (clock, voltage, and temperature etc. to make sure that the chip doesn’t go outside the safe conditions, which are not met by the design), built-in self-test (to check for permanent failures), efficient fault handling to make sure that in case of any failure, system is able to recover. Most of the failures in electronic systems are caused because of transient failures of memories and can also cause reliability issues, so it is very important to hit the area so as to minimize system failures. This paper presents the efficient methodology to implement and verify ECC error management in systems with large number of memories, with minimum hardware overhead and without compromising the safety requirements.

What is ECC? :

ECC ensures data integrity of a system or a network. ECC (Error Correcting Code) mechanisms not only provide detection of multi-bit errors in data transmission, but are also able to correct smaller bit errors. In a SOC, there are large numbers of memories that can generate either single bit or multi bit errors. Usually single bit errors are correctable and hence are not critical whereas multi bit errors are critical and from safety perspective, they should not be missed.

Conventional design approach:

Usually, memories in SoC work at different frequencies to reduce power consumption or sometimes to meet protocol specifications. Also the errors are reported centrally in the SoC so that appropriate action can be taken based on which memory has ECC error. This central module will be asynchronous to many memories and hence FIFOs are required in case all error addresses and error types are needed to be latched properly. Depth of each FIFO depends on the relation between frequency of memory and error managing block. As the number of memories increase and number of asynchronous domains increase in the SOC, the overall FIFO size (sum of FIFO sizes required for individual asynchronous memories) becomes very large and will have large area overhead.

Proposed design approach:

In the new scheme, both single bit and multi bit errors are handled based on the occurrence scenarios and criticality without compromising the safety requirements.

In the scheme 3 signals are latched:-

Address
Single bit error (correctable)
Multi bit error (uncorrectable)

We have classified the error conditions into 8 categories and dealt with them for a single memory. The logic proposed is replicated for each memory instance. Let us assume Ax and Ay to be two different addresses of the memory where accesses are being made. Let us denote correctable error from Ax as Correctable Ax, uncorrectable error from Ax as Uncorrectable Ax, correctable error from Ay as Correctable Ay, uncorrectable error from Ay as Uncorrectable Ay,

Correctable Ax followed by Correctable Ax before first error is taken care of.
Correctable Ax followed by Correctable Ay before first error is taken care of.
Uncorrectable Ax followed by Uncorrectable Ax before first error is taken care of.
Uncorrectable Ax followed by Uncorrectable Ay before first error is taken care of.
Correctable Ax followed by Uncorrectable Ax before first error is taken care of.
Correctable Ax followed by Uncorrectable Ay before first error is taken care of.
Uncorrectable Ax followed by Correctable Ax before first error is taken care of.
Uncorrectable Ax followed by Correctable Ay before first error is taken care of.

The expected behaviors for above conditions are given below:-

Correctable Ax is generated as same type of error from same address can be ignored as it is already taken care of.
Correctable Ax is generated and Second correctable error from different address is ignored as it is less critical being corrected by the ECC algorithm.
Uncorrectable Ax is generated as same type of error from same address can be ignored as it is already taken care of.
Uncorrectable Ax along with Overflow is generated as it’s important to indicate system that error management unit has not stored Uncorrectable error address from any second location.
Correctable Ax along with Uncorrectable Ax as both kinds of errors from same location and address has been stored and path of correctable and uncorrectable error indication is different.
Correctable Ax along with Overflow is generated as it’s important to indicate system that error management unit has not stored Uncorrectable error address from any second location.
Uncorrectable Ax along with Correctable Ax as both kinds of errors from same location and address has been stored and path of correctable and uncorrectable error indication is different.
Uncorrectable Ax is generated and Second correctable error from different address is ignored as it is less critical being corrected by the ECC algorithm.

The complete summary is provided in table below:

		Expected Behavior
First error	Following error	Single bit flag Set	Multi bit flag set	Address latched	Overflow
Correctable error from Ax	Correctable error from Ax	Yes	No	Ax	No (Same error from same address ignored)
Correctable error from Ax	Correctable error from Ay	Yes	No	Ax	No (Second error correctable from different address ignored)
Uncorrectable error from Ax	Uncorrectable error from Ax	No	Yes	Ax	No (Same error from same address ignored)
Uncorrectable error from Ax	Uncorrectable error from Ay	No	Yes	Ax	Yes (Second error uncorrectable from different address generates overflow)
Correctable error from Ax	Uncorrectable error from Ax	Yes	Yes	Ax	No (Both errors from same address latched)
Correctable error from Ax	Uncorrectable error from Ay	Yes	No	Ax	Yes (Second error uncorrectable from different address generates overflow)
Uncorrectable error from Ax	Correctable error from Ax	Yes	Yes	Ax	No (Both errors from same address latched)
Uncorrectable error from Ax	Correctable error from Ay	No	Yes	Ax	No (Second error correctable from different address ignored)

How to verify this new implementation? :

The challenge here is to verify that all corner cases are covered and none of the scenario gets missed.

The usual practice is to try covering the scenarios with directed test cases but there is a huge chance of missing corner case bugs through directed test cases. Directed test cases can only assure that all memories are connected properly to the central memory error manager.

Behavior of all IPs can be different in the sense that the traffic generated may be different for different IPs (e.g number of clocks between two reads and the time taken by error management unit running on a completely asynchronous clock) and hence some conditions will never get executed through directed test cases being executed by system core. Also there can be hundreds of memories that need to be verified. This necessitates the need of a scheme that enables generating all scenarios in a generic test case to verify the complete error management logic with all possible corner cases and scenarios being simulated. It covers all safety requirements related to error handling for the SoC. As the logic block for error handling is common, by verifying for single memory instance we can make sure that the logic written for error handling will work for all memories.

In the generic test case, the errors were generated by forcing the memory read and write sequences on the memory boundary itself so that we have full control over when to generate an error. This control is to the accuracy of single clock cycles, i.e., we can control that in which clock cycle we want to perform a read access which leads to generation of error. If we have this level of control over error generation, we can generate all the possible scenarios for single as well as back to back errors. The error injection mechanism is provided in the SoC itself. We inject error by writing through normal procedure and then we generate consecutive reads by forcing memory reads. Complete coverage of scenarios is achieved by sweeping number of idle cycles between two memory reads.

A sample waveform of the simulation is shown below. In the first figure, two multi-bit uncorrectable errors are getting generated on consecutive cycles from different addresses of memory resulting in an overflow.

In the figure below, one single-bit correctable and one multi-bit uncorrectable error are getting generated at a gap of one cycle from different addresses of memory. Correctable error is ignored as it is less critical compared to uncorrectable error, one of the safety requirements of the system.

Industry Articles

Efficient methodology for design and verification of Memory ECC error management logic in safety critical SoCs