A Platform-Based Technology for Fault-Robust SoC Design

Riccardo Mariani, YOGITECH SpA
San Martino Ulmiano (PI) - ITALY

Abstract :

When designing a System-On-Chip (SoC) for safetycritical or high-reliability applications, the design space that a system architect must consider is rather large due to the variety of faults that can affect the SoC, the different failures that these faults can generate and the wide set of techniques that can be used to detect, confine or stop the resulting hazards, each one with its efficiency and cost. In this paper it is proposed a systematic platform-based technology, in which a library of reusable IPs (HW and SW) is used together with a set of tools and methodologies to find the optimum solution in this design space, following the IEC61508 guidelines.

Introduction

If we define ¡°robustness¡± as the ability to continue the mission reliably despite the existence of systematic or random faults [1], we can say that modern electronic systems are less and less robust. This is due to the complexity of the new technologies, e.g. due to softerrors susceptibility, coupling effects, leakage contribution and increased sensitivity to internal and external disturbs and so on. Things are made worse by the fact that nowadays deep-sub micron technologies and multi-CPUs electronic systems are used in high volume applications where safety is a key factor: in automotive industry, electronic systems are involved in airbags, active brakes, engine control and future x-bywire cars. In such scenario, there is more and more the need of tools, methodologies and HW/SW architectures with which engineers can manage robustness, safety and related costs at all the different level of abstractions.

International norms exist to define requirements for safety, such the IEC 61508 for functional safety of electrical/electronic/programmable electronic safetyrelated systems [2]. Even if these norms generally refer to complete system and not to System-On-Chips, they also contain precise guidelines and requirements for the system subcomponents, including CPUs, memory systems, bus infrastructure and so on. An extension of such norm to ASIC is likely to appear in the next months. One of the basic concepts of IEC61508 is the definition of ¡°safety integrity level¡± (SIL), i.e. the discrete level (one out of a possible four) for specifying the safety integrity requirements of the safety functions to be allocated to the safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest. Typical systems for automotive require a SIL2 or SIL3 safety integrity level.
An important role is played by the ¡°Beta Factor¡± i.e. the probability of common cause failures that could become a limiting factor especially when multiple, functionallyequal channels are implemented in the same silicon.

Three classical examples of approaches for safety critical microcontroller are the Delphi SMA [3] and the two Bosch mutual and asymmetric architectures [4]. Analyzing them, on one side a dual-redundant solution can guarantee the desired diagnostic coverage but it requires a significant HW, SW and power overhead as also it lacks of diversity, i.e. most of the diagnostic is done with a second CPU equal to the first one, and therefore it is prone to common-mode errors. On the other side, currently available asymmetric solutions suffer of a low diagnostic coverage and a high SW overhead is needed to compensate that. Other solutions, such including fault detection and correction circuitry in the CPU itself or similar ones, are expensive in terms of CPU redesign and they violate a basic guideline of the IEC61508 that requires that the diagnostic logic is clearly separated by the safety function itself. Therefore these intrusive techniques are more suitable to increase the CPU dependability than to allow the design of a safety-relevant electronic system. It is also worth to note that most of the state-of-art techniques mainly rely on SW to guarantee the needed diagnostic coverage for the other components of the microcontroller (such memory subsystem, peripherals and bus infrastructure) and therefore the SW development and qualification process significantly impacts the overall costs, especially for the majority of these solutions that are poorly scalable and reusable.

Platform-based approaches [5] are known to be a proven solution to this problem, since in general they allow hardware and software standardization, they are adaptable at the user¡¯s needs, they can generate clean codes & scripts in a well defined flow, they are easily linkable with the operating system, they allow easy verification of sub-blocks and they are upgradeable and scalable. For safety-critical or high-reliability applications, the standard platform-based approach must be enriched with all the measures (HW and SW) that could guarantee, with the same amount of quality, reusability and flexibility, the needed level of safety integrity for the given application: this is the mission of the technology presented in this paper

A platform-based technology for fault robustness

In this paper it is proposed a platform-based technology ¡°faultRobust¡± (schematically represented in Figure 1) composed by [6]:

a design and validation methodology in adherence with IEC 61508, including FMEA and Fault Injection;

a flexible and configurable library of HW IPs complementing (wrapping) the SoC sub-systems: Each of these faultRobust IPs (fRIPs) can be standalone or can be combined with other fRIPs for a complete solution;

a tool suite based on existing compilers to handle the HW-SW integration and configuration flow.

Figure 1 : the proposed platform-based technology

The design and validation methodology it is one of the key point. A tool suite extracts information from the Safety Requirements Specification (SRS, a document required by IEC 61508 to define the safety goals) and from the design database (RTL, gate-level and back-end netlists), by using scripts based on commercially available EDA tools. These data are entered in a very detailed Failure Mode and Effects Analysis worksheet. Then, fault models and failure modes are considered in adherence to IEC guidelines. Finally, safe failure fraction (SFF) and diagnostic coverage are automatically computed using a statistics formula embedded in the FMEA worksheet.

IEC 61508 highly recommends that fault-modeling and fault-injection are intensively used during the design, verification and validation flow. To be compliant with that, another tool suite performs a design-level fault injection working both at lowest (transistor, gate) and at the highest (block) level. It is a combination of a PERL/C based tool and Specman by Cadence. It is automatically linked to FMEA and it uses an operational profile based algorithm, enhancing the speed of fault injection campaign. This fault injector is not dedicated to specific fault models: it can handle different fault models, such transient faults, permanent faults, combination of the twos and customized fault models (they are modelled using the IEEE1647 ¡°e¡± language).

The protection of memory sub-systems

Embedded memories (Static, Dynamic RAMs and Non Volatile Memories) are still the most critical blocks concerning reliability, dependability and safety. Existing solutions have many limits, such as no adherence to specific safety norms; access time overhead due to the coder/decoders in the data path; memory area overhead due to the codes, in particular for Error Correction Codes (ECCs); protection degradation due to multiple errors; and low configurability. For SW faults, memory protection (MPU) techniques are used but they are CPUcentric and they don¡¯t offer a complete protection at system level, especially for multi-master systems on chip.

The proposed platform-based technology includes a configurable and re-usable IP for protection of memory sub-systems (fRMEM), based on a previous work [7]. Besides the use of ECC optimized for area, power and multiple errors detection probability, it adds proprietary techniques such as the implementation of measures to fulfil the requirements of IEC 61508, including a selfchecking architecture for the supervisor itself; the ¡°fasttrack¡± option enabling the highest operating frequency while maintaining the same level of ECC protection and without adding wait cycles; the ¡°scrubbing¡± option to keep the protection level and decrease the Failure In Time (FIT), maintaining the same Hamming distance of the code and freeing the CPU to manage these operations; the ¡°shared memory¡± option to allow the reduction of memory code overhead due to ECC protection, by storing codes in a separate memory allowing partitioning of data memories in different pages with selectable protection levels, by sharing the code memory between different data memories and by allowing the reuse of code memory for data; the ¡°distributed MPU¡± option to allow a distributed memory protection in multi-master architectures, checking if a memory access by a given master fulfils rules such read/write permissions, user/privileged mode and so on.

As represented in the figure 2, fRMEM is composed of two blocks wrapping the memory controller and interfacing the memory. The Fault Protection Memory Manager (F-MEM) block includes all the options related to coding/decoding and other options such fast-track and scrubbing. The Memory Controller Extension (MCE) block extends the memory controller and it manages the way the bus interacts with the fRMEM, being responsible of functions such MPU and so on. For Tightly Coupled Memories or data path memories, most of the functions are done by F-MEM. For non volatile memories, codes can be stored per word or per page basis.

Main features of fRMEM are: a memory code overhead from 3% to 30% as a function of coding scheme and use of the shared memory option; negligible access time overhead with the fast-track option; Failure In Time (FIT) decrease factor from 103 to 109 as a function of coding scheme and use of the scrubbing option; gate count between 1K and 6.5K gates depending on the configuration; ¡Ý99% test coverage; ¡Ý99% safe failure fraction (SIL3).

Figure 2 : the IP for protection of memories

The protection of CPU sub-systems

Sparse logic criticality is increasing as well, especially when deep-sub-micron or nanotechnologies are used. The fRCPU is an IP for protection of a CPU sub-system. It is not a replication of the CPU since it is architecturally and functionally diverse and due to that it strongly reduces the Beta Factor as required by IEC 61508. It is HW-centric, i.e. most of the diagnostics are in HW: alarms are generated independently from the application and there is not any performance impact on CPU. It covers only what is really relevant to reach SIL3, allowing the minimum HW overhead (and power consumption), the minimum connection requirements and complexity. In a few words, the fRCPU is composed (see figure 3) of a CPU Sniffer Unit collecting, compating and coding signals from the CPU boundary, a Shadow Processing Unit executing the same flow of CPU, including a register bank to store a shadow value of CPU main registers and a management unit to generate data/addresses. fRCPU compares its results with the ones read from CPU with a set of independent checkers supervising the different CPU ports. The fRCPU includes a Coverage Monitor Unit (CMU) providing run-time information on the current SFF with respect ¡°conditions of use¡±, e.g. respect the FMEA assumptions. It acts as a ¡°checker of the checker¡±. It flags as well unexpected SW scenarios (such infinite loops) and it allows dynamic coverage pre-integration analysis on application profiles.

Figure 3 : the IP for protection of CPU

The protection of other sub-systems

The proposed platform-based technology is completed by a System Control Unit (fRSCU) collecting and synchronizes all the alarms coming from fRCPU and from the other ¡°remote¡± fault supervisors such for instance fRMEM. Then, based on this information, it decides if the system (CPU and peripherals) is in a wrong state and, based on architectural safety requirements, it performs actions such as flagging the Operating System, forcing some fail-safe hardware configuration and so on. fRSCU is highly configurable by the user. The other ¡°remote¡± supervisors are: bus supervisors (fRBUS), consisting of a set of blocks (decoders, arbiters, checkers) monitoring sources and sinks of the bus interconnect; peripheral supervisors (fRPERI), implementing a ¡°hardware verification component¡±, i.e. a block where a subset of the protocol checks and assertions used to verify a given interface have been translated into hardware constructs. fRCPU, fRMEM, fRBUS and fRPERI communicate with the fRSCU through a dedicated on-chip robust interconnect (fRNET) guaranteeing that information is transferred without errors between the different diagnostic units. It assures as well that safety-related information travel in a separate channel respect mission data.

The proposed platform-based technology is hardware-centric, i.e. the major role is played by the hardware supervisors. However, in order to provide the best tradeoff between costs and benefits, it includes a set of SW fRIPs: they are very compact SW Built-in Self- Tests (BISTs) making use of HW resources guaranteed by the HW fRIPs.

Results

Typical automotive applications requiring the highest safety integrity levels (SIL2 or SIL3) are passive and active safety systems, especially last ones. For instance, in ESP (Electronic Stability Program) and EPS (Electric Power Steering), the MCU is playing a crucial role in the steering and braking control loop. The same happens for ACC (Adaptive Cruise Control) or ADC (Active Dynamic Control). For all these functions, safety integrity must be guaranteed and therefore a comprehensive safety-monitoring system is of fundamental importance.

The simplest architecture that uses the proposed platform-based technology is a single-CPU asymmetric solution [8], where a standard CPU-based microcontroller is complemented by the fRCPU and by some instances of other fRIPs. Results of a demonstrator based on PHILIPS SJA2510 microcontroller and on a ARM968ES CPU, show that SIL3 is reachable with this architecture. The fRCPU protection costs 20%-40% of CPU gate count (figure 4).

If system behaviors are clearly defined or if the application is fixed, ¡°profiles¡± can be used to add more conditions to the SRS and therefore to deeply optimize the fRCPU: during development, through the use of a SW profiler, it is extracted a profile for the ranges of applications of interest (e.g. detailed statistics about the instructions used in such applications). This profile is used to customize the SRS and to generate the best HW configuration for the fRCPU. After production, through the use of the fRCPU CMU, it is checked if a specific SW application is fully covered by the customized fRCPU or not. In case isn¡¯t fully covered, through the use of the SW adapter, the application can be adapted to cover such ¡°holes¡±.

Figure 4 : benchmark with dual-core

Conclusion

The proposed platform-based technology is suitable for multiprocessors architectures as well. If higher availability is needed, it can be used to enhance lockstep and mutual redundant solutions or other multiprocessors solutions such [9]. In such a case, the fRSCU acts like the comparator of a dual-core approach, while two fRCPUs (one for each CPU) guarantee enough ¡°clues¡± for the fRSCU to determine which of the two CPUs is faulty in case of a mismatch, allowing a failoperational state.

In summary, the proposed platform-based technology mainly aims at the reduction of HW and SW costs needed to implement a fault robust MCU in adherence with IEC 61508 SIL3. This is achieved by implementing an optimized HW CPU fault detection, by providing dedicated HW to replace, support or supplement SW tests and by distributing robustness to whole SoC. Being platform-based, it targets scalability and flexibility as well, by implementing a portable and reusable architecture.

References

[1] H. Tahne, ¡°Safe and Reliable Computer Control: Systems Concepts and Methods¡±, Mech. Lab, Univ. Stock, 1996

[2] CEI International Standard IEC 61508, 1998-2000

[3] T. Fruehling, ¡°Delphi Secured Microcontroller Architecture¡±, SAE Technical Paper, 2000-01-1052

[4] US Patent n.5436837 and n.5880568, and DE Patent n.19933086 by Robert Bosch Gmbh

[5] A. SanGiovanni Vincentelli, ¡°Platform-based Design and Software Design Meth. for Embedded Systems¡±, IEEE Design and Test, Nov-Dec 2001

[6] R. Mariani, M. Chiavacci, S. Motto, ¡°Dependable microcontroller, method for designing a dependable microcontroller and computer program product therefor¡±, European Patent, EP1496435

[7] R. Mariani, G. Boschi, ¡°A System Level Approach for Embedded Memory Robustness¡± JSSE Special Issue: Papers selected from the 1st International Conference on Memory Technology and Design - ICMTD¡¯05

[8] R. Mariani, P. Fuhrmann, B. Vittorelli, ¡°Cost-effective Approach to Error Detection for an Embedded Automotive Platform¡±, SAE 2006 World Congress & Exhibition, April 2006, Detroit, MI, USA

[9] M. Peri, S. Pezzini, A. Ferrari, A. Sangiovanni-Vincentelli, M. Baleani, ¡°Fault Tolerant Platforms for Automotive Safety Critical Applications¡±, CASES¡¯03, Oct. 30¨CNov. 2, 2003, San Jose, California

A Platform-Based Technology for Fault-Robust SoC Design

Contact YogiTech SpA