Need a watchdog for improved system fault tolerance?

By Suhas Chakravarty, Rohit Tomar, Mohit Arora, Freescale Semiconductor
CommsDesign (10/22/08, 11:41:00 PM EDT)

Embedded electronic control units are finding their way into more and more complex safety critical and mission critical applications. Many of these applications operate in adverse conditions, which can cause code runaway in the embedded control units, putting them into unknown states. A watchdog timer is the best way to bring the system out of an unknown state into a safe state. Given its importance, the watchdog has to be carefully designed, so as to reduce the chances of its operation being compromised by runaway code. This paper outlines the need for robust Watchdog and the guidelines that must be considered while designing a fault tolerant system monitor aka Watchdog. New methods for refreshing a watchdog, write protection mechanism, early detection of code runaway and a quick self-test of the watchdog have been described herein.

The need for fault tolerant systems

Electronic control units (ECU) are fast becoming ubiquitous. Among other areas, they are increasingly finding their way into safety critical and mission critical applications, such as automobile safety systems, aircraft fly-by-wire controls and spacecraft thrust controls. These control systems are supposed to work reliably under all environmental conditions. The software, running on the ECU, does experience faults while running in the real environment which may lead to partial or total system crash. Therefore it is of the utmost importance that the system displays a high degree of fault tolerance, so that if and when faults such as software crashes happen, the system is able to recover quickly and rapidly return to a safe state.

A good example of a mission and safety critical application is the thrust control of spacecrafts. One of the most delicate operations carried out in outer space is the docking of two spacecrafts. Precision direction control and maneuvers are required to line up the two bodies properly, so that they can dock. The system controlling the spacecraft's thrusters must work flawlessly. A software crash in the thrusters' ECU could result in the thrusters firing away for too long, or at the wrong angle, or both, and instead of a docking a collision would result. A safety mechanism must be in place that can detect faults and put the ECU into a safe state before the thrusters start firing away unpredictably.

Another critical application is that of the use of a robotic arm in surgery, which is becoming commonplace in advanced medical facilities. These systems can enhance the ability of physicians to perform complex procedures with minimum interventions. During an operation, the physician initiates a particular procedure, say a fine incision in a vital organ, and then control goes completely to the robotic arm wielding the scalpel. If software failure happens while the robot is at work, the robotic arm could behave unpredictably, posing a risk to the patient. If there is ability in the system to recover quickly from such crashes, the robotic operation can halt and the physician can take appropriate further actions. The operating room of the future is envisioned as a fully automated cell. The surgery would be carried out by robotic arms, under remote supervision from any place around globe. Then fault-tolerance becomes much more critical owing to the increased system dependency.

The above examples serve to highlight the need for fault tolerant systems. Looking ahead, it's not just the automotive, industrial, aeronautical, medical and space applications that need fault tolerance. With the introductions of the IEC 60730 standards, it is required that even automatic electronic controls in household appliances ensure safe and reliable operation of their component.

Click here to read more ...

Need a watchdog for improved system fault tolerance?

Contact Freescale Semiconductor