Lorenzo Fasanelli, Ericsson and Lorenzo Lupini and Massimo Quagliani, Marconi S.p.A.
embedded.com (March 23, 2014)
In technical literature, there are many valid ways to define a system, and an embedded system in particular. For the purposes of this article we will use one of the most general and classic model of systems theory: a system is an interconnection of subunits that can be modeled by data and control inputs, a state machine, and data and control outputs.
What turns this into an embedded system is that most of it is hidden and inaccessible, and is often characterized by real-time constraints: a sort of a black box which must react to a set of stimuli with an expected behavior according to the end user's (customer's) perception of the system. Any deviation from this behavior is reported by the customer as a defect, a fault.
At the time of their occurrence, faults are characterized by their impact on product functionality and by the chain of events that led to their manifestation. Deciding how to handle a fault when it pops up and how to compensate for its effects is typically a static design issue, dealing with allowed tolerance level and system functional constraints. On the other hand, collecting run-time information about the causes that resulted in the system misbehavior should be a dynamic process, as much as possible flexible and scalable.
As a matter of fact, fault handling strategies are commonly defined at the system design stage. They result from balancing the drawbacks naturally arising when it is allowed a minimal degree of divergence from the expected behavior, and sizing the threshold above which the effects have to be considered as unacceptable.
When the degradation of the performance resulting from the fault is such that countermeasures are needed, recovering actions have to be well defined by specifications. In fact the occurrence of an unexpected behavior, by its nature unwanted, does not necessarily mean a complete loss of functionality, that’s why establishing back-off solutions is typically part of the design process.
Depending on the system’s nature, recovery actions can be handled by implementing redundancy or allowing temporary degradation of the service while performing corrections.On the contrary, defining strategies to collect data for debug purposes is a process that often is left out, trusting the filtering applied at test stages. This deficiency is usually due to strict timing constraints at the development phase and a lack of system resources, resulting from fitting a maximum of functions into the product. Often, when there is a need to deal with malfunctioning in the field, strategies are decided and measures are set up on the fly.
One of the problems with embedded systems is that they are indeed embedded; that is, information accessibilityis usually far from being granted: during the several test phases a product goes through, the designers/troubleshooters usually can make use of intrusive tools, like target debuggers and oscilloscopes, to isolate a fault. When the product instead is in service, it is often impossible to use such instruments and the available investigation tools may be not sufficient to easily identify the root cause of the problem within a time that is reasonable from the customer’s perspective. Moreover, establishing some sort of strict synchronization between recording instruments and internal fault detection is not always possible, with the result that data collected at the inputs/outputs cannot clearly be tied to the fault occurrence itself and has to be correlated manually.
The other problem is the fault localization. While the system grows in complexity, possible deviations from the expected operation increase. This is one reason (but not the only one) why the more complex the system is, the bigger can be the distance between the fault and its symptoms. Symptoms alone are not sufficient to identify the root cause of a problem. The relevant information is hidden in the inputs and in the status of the system when the fault occurred, but in most cases this information is gone forever.
Furthermore, symptoms may contribute to getting closer to the faulty area, but can also mislead the investigation. It could happen that the evolution of a system degenerates before the user perceives it and the symptoms noticed are secondary effects of the uncontrolled process generated by the fault. Thus, providing some mechanism to enhance self-awareness in the system is a real benefit in the error localization process. A specific section of this paper will be dedicated to this aspect.
Summarizing, the fault handling process typically puts the accent on fault prevention (techniques to minimize the number of failures), and fault tolerance issues (how the system should react to avoid loss of performance after a failure), since undoubtedly they’re the core part of the reliability of a system. The techniques to speed-up troubleshooting during integration tests and maintenance phases, which constitute the core of the error detection process, are often left to a heuristic approach that the following chapters of this paper will try to systemize within an organic view.
The aim of this paper is to suggest a simple approach to the problem of fault detection and provide some hints on how to design debug features into embedded systems which have real-time constraints and suffer from lack of memory resources. We will discuss:
- Fault localization: which elements are necessary to isolate a fault
- How to collect useful data from a fault
- How to trace runtime events
- Post mortem debug and diagnostics.
- Fault Localization
Click here to read more ...