Solutions for Soft Errors in System on Chip Designs

By Richard Phelan,
Product manager at ARM

Should designers worry about soft errors in embedded designs? With the prevalence of higher memory densities and finer process geometries, many design teams are now evaluating techniques to alleviate soft errors. Richard Phelan, product manager at ARM, discusses the soft error problem, and the solutions ARM considered when specifying the latest ARM cores.

Soft error problems represent a considerable cost and reputation challenge for today's chip manufacturers. In safety critical applications, for example, unpredictable reliability can represent considerable risk, not only in terms of the potential human cost but also in terms of corporate liability, exposing manufacturers to potential litigation. In commercial consumer applications, there is again significant potential economic impact to consider. For high-volume, low-margin products, high levels of product failure may necessitate the costly management of warranty support or expensive field maintenance. Once again, the effect on brand reputation may be considerable.

With ever shrinking geometries and higher-density circuits, the issue of soft errors and reliability in complex System on Chip (SoC) design is set to become an increasingly challenging issue for the industry as a whole.

The soft error problem
There are a number of factors that contribute to today's trend towards increasing soft-error rates. Neutron radiation interferes with charges held within sensitive nodes in the circuit, affecting the logic state; soft errors - or single-event upsets as they are known as - generally affect storage elements, such as memory, latches and registers. As silicon process geometries shrink and systems become more complex, particle collision is more likely to impact the stored charge sufficiently enough to change its state. There is also mounting evidence to support the likelihood of more than one error occurring simultaneously – or multi-bit errors.

Compiler
Figure 1 Neutron Radiation can Upset Sensitive Circuit Nodes
Acknowledgement:
Photo by Tom Way and Julie Lee. Courtesy of International Business Machines Corporation. Unauthorized use not permitted.

Increasingly, manufacturers are viewing soft errors as a serious and significant issue. Certainly, the unpredictable nature of system malfunctions generated by soft errors necessitates the evaluation of potential soft error rates (SER) for any SoC system. Current research suggests that the average rate of failure for complex chips may be in excess of four errors per year.

In addressing the soft error issue, chip manufacturers are beginning to tackle some of the challenges. Those producing processors for desktop PC systems are working with some of the industry's most advanced technology processes. Compared to embedded systems, desktop processors now utilize large, high-density memories, which significantly increases the vulnerability of systems to soft error failure. Embedded systems, such as those utilized in portable and wireless products, are generally more tolerant since they contain less memory and use processors designed to operate at lower clock speeds than PC systems. However, they are more likely to be used in safety-critical systems and consumer products where reliability is important. In addition, embedded processor manufacturers are increasingly turning to the latest technologies to achieve low power and reduced cost advantages, leading them to confront the soft error challenge too.

Potential routes to resolving the soft error conundrum include implementing improvements to the manufacturing process, the use of hardware design techniques and applying system-level software precautions.

The use of silicon-on-insulator substrates in the semiconductor manufacturing process reduces the charge collection depth to below the drain diode, thus lowering susceptibility to particle collision. However, this gain dissipates with smaller geometries and, furthermore, is more expensive than traditional CMOS processes. Other manufacturing refinements include the use of trench capacitors, higher capacitance as well as single-transistor architectures to reduce SER. New memory technologies, such as magnetic RAM will deliver complete soft-error immunity for memory implementations, but these processes are still some way in the future for manufacturing production.

Managing Soft Errors in Memory
For today's single-bit errors, parity checking provides a simple but effective route for soft error detection and a pragmatic application of this approach would be the addition of one parity bit per byte. For cache instructions, errors that are detected can be corrected by flushing the low-end pipeline and refreshing the appropriate cache line.

In tightly-coupled memories (TCM), creating a software exception that executes an error-handling code to refresh the TCM is one approach. However, the error-handling code itself may be subject to soft errors. In systems where the expectation of SER is relatively low, it may be acceptable to handle error correction by initiating a complete system reset.

Hamming codes provide a route to not only detecting and correcting single-bit errors but also detecting double-bit errors. By adding a 4-bit Hamming code to each byte, single-bit errors can be corrected relatively independently of core standard operations with just a 1 cycle stall in the clock line. This approach offers the benefit of minimal impact on the bounded latency of fast interrupts, or the timing requirements of critical routines. For multi-bit errors, the correction mechanism becomes more complicated, incurring additional cost in terms of logic and performance.

Double or triple redundancy - copying a memory area once or twice - is a further option that supports error detection. Here, the output of each memory copy is simply compared to detect errors. The approach is simple, delivers good performance, and provides a route to managing multiple simultaneous errors. However, an overhead is incurred through duplicating or tripling memory and implementing the required control logic.

Further Memory Error Detection Options
A simple parity scheme can provide error protection for TAG RAMs. Instruction cache TAG RAMs are written on cache line fills and during cache maintenance operations whilst in write operations, parity data is written to the RAM. During cache look-ups and maintenance operations, detection of a parity error in the TAG RAM will cause a miss indication, invalidating the current cache line and initiating a replacement from main memory. The address of the parity error is stored in the Instruction Fault Address Register, whilst the Instruction Fault Status Register is also set to indicate the presence of a parity error.

Data cache RAMs can also be protected using a simple parity scheme. During write operations, parity data is written to the RAM and detection of a parity error causes a data abort. Again, the data Fault Address Register is updated with the failing address and the Fault Status Register records a parity error.

Valid and dirty RAMs however, require a different approach. These RAMs have per-bit write masks which makes recalculating the parity difficult, if the parity is stored on a per-byte basis. Instead duplication enables the detection of errors on one of the RAMs using data comparison.

Tightly Coupled Memories are large in order to contain the time-critical code necessary for embedded control applications, making them more susceptible to error than other level-1 memories. The optional addition of parity checking or error correction on TCMs can assist in improving fault tolerance. However, because error correction within a single clock cycle is difficult to achieve, utilizing a pause and repair scheme to extend the time available for undertaking repairs is advantageous. The introduction of stall cycles when an error is detected provides a window of opportunity for memory correction. If the memory error cannot be addressed, then an error input will indicate a pre-fetch abort or data abort. If, however, an error is returned from the TCM interface, the relevant Fault Address and Status Registers are updated to record this. For the Data TCM, an error will stall the core for several clock cycles before the data abort handler is initiated.

Addressing Logic Errors
In general, soft errors within logic circuits are viewed as less of a threat to circuit malfunction. Since sequential logic elements are less densely packed, they are statistically less likely to be affected by particle collisions than larger memory areas. As the industry moves towards even finer process geometries, however, this situation may well change.

The detection and protection of areas of a microprocessor from the effects of soft errors is difficult; available solutions often incur significant penalties in area and performance and are still not always 100 percent effective in resolving soft errors. Even when a solution does deliver the anticipated error detection facilities, error correction remains hugely complex.

There are a variety of approaches that offer a route to resolving logic errors. Implementing a dual processor configuration provides a route to detecting soft errors in the core logic; a functional error in one processor results in each processor having different outputs. However this approach requires at least twice as much logic as a single processor solution and the additional logic on the critical path creates a performance penalty of between 10-20 percent.

Other solutions all have a significant impact on performance as they all require the addition of logic. These include implementing time redundancy at the end of each stage of the processor pipeline, the REESE approach (Redundant Execution Using Spare Elements), reverse instruction generation and comparison, and two-rail coding.

Code checking schemes for verifying logic circuits offer a relatively minor performance overhead and do not require major design changes, but designing the logic for generating the check code does present a significant challenge. The schemes include parity, weight-based codes and modulo weight-based codes, all of which operate by generating a code based on the input to the logic circuit, delivering detection rates of between 95-99 percent.

Looking to the Future
Although it is difficult to quantify the true threat presented by soft errors, the industry is agreed that the trend for SER is clearly set to rise with increasingly smaller process geometries. Any solution therefore needs not only to be pragmatic but also acceptable to manufacturers, in terms of the cost impact on both area and performance.

The most effective approach to tackling the soft-error detection in processors is to implement steps to manage soft error issues at the manufacturing, design and software stages. Most importantly, addressing error detection at the design stage gives the system designer the opportunity weigh up what the implications of dedicating more resource to error against the correlating impact on performance. ARM has implemented memory detection mechanisms in the latest ARM cores and continues to research into fault detection and correction for future products.

Richard Phelan joined ARM in 1996 and holds the position of Embedded Cores Manager, Technical Marketing, where he is responsible for creating the definition of future ARM cores for a wide range of embedded designs. Richard graduated from the University of Nottingham in 1982 with a BSc Hons in Electronic and Computer Engineering.

©2004 Synopsys, Inc. Synopsys and the Synopsys logo are registered trademarks of Synopsys, Inc. All other company and product names mentioned herein may be trademarks or registered trademarks of their respective owners and should be treated as such.