|
|||
Solutions for Soft Errors in System on Chip DesignsBy Richard Phelan, Should designers worry about soft errors in embedded designs? With the prevalence of higher memory densities and finer process geometries, many design teams are now evaluating techniques to alleviate soft errors. Richard Phelan, product manager at ARM, discusses the soft error problem, and the solutions ARM considered when specifying the latest ARM cores. Soft error problems represent a considerable cost and reputation challenge for today's chip manufacturers. In safety critical applications, for example, unpredictable reliability can represent considerable risk, not only in terms of the potential human cost but also in terms of corporate liability, exposing manufacturers to potential litigation. In commercial consumer applications, there is again significant potential economic impact to consider. For high-volume, low-margin products, high levels of product failure may necessitate the costly management of warranty support or expensive field maintenance. Once again, the effect on brand reputation may be considerable. With ever shrinking geometries and higher-density circuits, the issue of soft errors and reliability in complex System on Chip (SoC) design is set to become an increasingly challenging issue for the industry as a whole. The soft error problem
Increasingly, manufacturers are viewing soft errors as a serious and significant issue. Certainly, the unpredictable nature of system malfunctions generated by soft errors necessitates the evaluation of potential soft error rates (SER) for any SoC system. Current research suggests that the average rate of failure for complex chips may be in excess of four errors per year. In addressing the soft error issue, chip manufacturers are beginning to tackle some of the challenges. Those producing processors for desktop PC systems are working with some of the industry's most advanced technology processes. Compared to embedded systems, desktop processors now utilize large, high-density memories, which significantly increases the vulnerability of systems to soft error failure. Embedded systems, such as those utilized in portable and wireless products, are generally more tolerant since they contain less memory and use processors designed to operate at lower clock speeds than PC systems. However, they are more likely to be used in safety-critical systems and consumer products where reliability is important. In addition, embedded processor manufacturers are increasingly turning to the latest technologies to achieve low power and reduced cost advantages, leading them to confront the soft error challenge too. Potential routes to resolving the soft error conundrum include implementing improvements to the manufacturing process, the use of hardware design techniques and applying system-level software precautions. The use of silicon-on-insulator substrates in the semiconductor manufacturing process reduces the charge collection depth to below the drain diode, thus lowering susceptibility to particle collision. However, this gain dissipates with smaller geometries and, furthermore, is more expensive than traditional CMOS processes. Other manufacturing refinements include the use of trench capacitors, higher capacitance as well as single-transistor architectures to reduce SER. New memory technologies, such as magnetic RAM will deliver complete soft-error immunity for memory implementations, but these processes are still some way in the future for manufacturing production. Managing Soft Errors in Memory In tightly-coupled memories (TCM), creating a software exception that executes an error-handling code to refresh the TCM is one approach. However, the error-handling code itself may be subject to soft errors. In systems where the expectation of SER is relatively low, it may be acceptable to handle error correction by initiating a complete system reset. Hamming codes provide a route to not only detecting and correcting single-bit errors but also detecting double-bit errors. By adding a 4-bit Hamming code to each byte, single-bit errors can be corrected relatively independently of core standard operations with just a 1 cycle stall in the clock line. This approach offers the benefit of minimal impact on the bounded latency of fast interrupts, or the timing requirements of critical routines. For multi-bit errors, the correction mechanism becomes more complicated, incurring additional cost in terms of logic and performance. Double or triple redundancy - copying a memory area once or twice - is a further option that supports error detection. Here, the output of each memory copy is simply compared to detect errors. The approach is simple, delivers good performance, and provides a route to managing multiple simultaneous errors. However, an overhead is incurred through duplicating or tripling memory and implementing the required control logic. Further Memory Error Detection Options Data cache RAMs can also be protected using a simple parity scheme. During write operations, parity data is written to the RAM and detection of a parity error causes a data abort. Again, the data Fault Address Register is updated with the failing address and the Fault Status Register records a parity error. Valid and dirty RAMs however, require a different approach. These RAMs have per-bit write masks which makes recalculating the parity difficult, if the parity is stored on a per-byte basis. Instead duplication enables the detection of errors on one of the RAMs using data comparison. Tightly Coupled Memories are large in order to contain the time-critical code necessary for embedded control applications, making them more susceptible to error than other level-1 memories. The optional addition of parity checking or error correction on TCMs can assist in improving fault tolerance. However, because error correction within a single clock cycle is difficult to achieve, utilizing a pause and repair scheme to extend the time available for undertaking repairs is advantageous. The introduction of stall cycles when an error is detected provides a window of opportunity for memory correction. If the memory error cannot be addressed, then an error input will indicate a pre-fetch abort or data abort. If, however, an error is returned from the TCM interface, the relevant Fault Address and Status Registers are updated to record this. For the Data TCM, an error will stall the core for several clock cycles before the data abort handler is initiated. Addressing Logic Errors The detection and protection of areas of a microprocessor from the effects of soft errors is difficult; available solutions often incur significant penalties in area and performance and are still not always 100 percent effective in resolving soft errors. Even when a solution does deliver the anticipated error detection facilities, error correction remains hugely complex. There are a variety of approaches that offer a route to resolving logic errors. Implementing a dual processor configuration provides a route to detecting soft errors in the core logic; a functional error in one processor results in each processor having different outputs. However this approach requires at least twice as much logic as a single processor solution and the additional logic on the critical path creates a performance penalty of between 10-20 percent. Other solutions all have a significant impact on performance as they all require the addition of logic. These include implementing time redundancy at the end of each stage of the processor pipeline, the REESE approach (Redundant Execution Using Spare Elements), reverse instruction generation and comparison, and two-rail coding. Code checking schemes for verifying logic circuits offer a relatively minor performance overhead and do not require major design changes, but designing the logic for generating the check code does present a significant challenge. The schemes include parity, weight-based codes and modulo weight-based codes, all of which operate by generating a code based on the input to the logic circuit, delivering detection rates of between 95-99 percent. Looking to the Future The most effective approach to tackling the soft-error detection in processors is to implement steps to manage soft error issues at the manufacturing, design and software stages. Most importantly, addressing error detection at the design stage gives the system designer the opportunity weigh up what the implications of dedicating more resource to error against the correlating impact on performance. ARM has implemented memory detection mechanisms in the latest ARM cores and continues to research into fault detection and correction for future products.
©2004 Synopsys, Inc. Synopsys and the Synopsys logo are registered trademarks of Synopsys, Inc. All other company and product names mentioned herein may be trademarks or registered trademarks of their respective owners and should be treated as such. |
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |