|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Avoid corruption in nonvolatile memory
Avoid corruption in nonvolatile memory Absolute power corrupts absolutely. But if you're only concerned about possible corruption of code or data stored in a nonvolatile memory, you can decrease the odds of that happening. This article will empower you to do just that. Embedded systems often include in-system programmable nonvolatile memory to store both program contents and relevant operating data. The strength and utility of in-system programmable memory is also its major weakness: the contents can be changed by the system in which it resides, leaving the system vulnerable to corruption during normal operation. Data stored in the nonvolatile memory is usually critical to proper system operation, and corruption of that data can lead to system failure, hardware damage, and even unsafe operating conditions. My personal experience indicates that corruption of nonvolatile data is inevitable, and it's impractical for a designer to assume that a system will never experience it. The design of an embedded system with nonvolatile memory is, therefore, an exercise in minimizing the occurrence and impact of data corruption. The level of confidence required varies by system and depends on desired reliability as well as repercussions of possible failures induced by a corrupting event. In this article, I will familiarize you with likely corruption mechanisms as well as methods that minimize the probability and impact of corrupted nonvolatile data in an embedded system. Nonvolatile memory Parallel interface memories connect directly to the native processor address and data bus and include flash, EEPROM, battery-backed SRAM, and NVRAM chips. Serial interface memories get divided into the subgroups of "two-wire" and "three-wire" interfaces (due to dramatically different behavior) and are usually found in EEPROM devices. Finally, custom interface memories are devices with proprietary interfaces. One example of a custom interface memory is a software-controlled potentiometer like an EEPOT using a two-wire up/down interface. I won't spend a lot of time on the particulars of the various technologies since you can easily research them on your own. In Table 1, however, I've listed the various types of nonvolatile memory commonly found in embedded systems and included characteristics for each type of memory relating to concepts in this article. Table 1: Characteristics of selected nonvolatile memory types Keep in mind that the entries are relative judgments based on my own experience using these memories in real systems designed by engineers of average competence. Also note that the characteristics listed are typical for the most commonly used parts under each memory type; parts do exist with or without features listed, but engineers don't use these parts as often. Corruption sources, types Real embedded systems have power supplies that rise and fall, logic that goes inactive or invalid, noise transients, and numerous other "out of specification" regions of operation during everyday use. Data sheets only specify IC performance under controlled conditions and avoid most of the imperfect operating possibilities imposed by use in a real system. What does a 400kHz two-wire EEPROM do when the supply voltage is at half its specified value and random 20ns pulses are feeding into its inputs? What happens when a write operation is posted to a EEPROM and power is removed before the specified maximum store time? The behavior of a part under these transient conditions is not and cannot be specified or accurately predicted. Most conditions that lead to corruption occur during power up and power down of the system. Some causes are internal to the memory; others derive from the devices connected to the memory, which explains why the interface type often influences the probability and type of corruption that occurs. The transitions and levels of power, chip enable, write/output enable, data, address, and hardware protection pins can all contribute to placing the memory in a corruptible state. Analyzing and measuring the state of these pins as you apply and remove power may help you understand what lines are vulnerable, but often you won't identify the actual cause of corruption. Prevention, therefore, is key. Corruption also occurs when properly executed write requests are interrupted before the internal operation completes within the chip due to the removal of power, leading to partially erased blocks or corrupt bytes. Data structures that consist of multiple bytes may also update partially and then be interrupted, leading to a corrupt structure in memory when the system is powered up on the next cycle. Software errors in the system code is one of the most often overlooked sources of corruption. Coding errors can cause violations of write/erase cycle timing constraints or data structure corruptions. This may result from unexpected behavior during multitasking and interrupt handler collisions. Corruption protection To explain this protection process, I define nonvolatile corrupting event as any change in memory contents that adversely affects a critical system function. Corruption protection is added to the system to prevent corrupting events. Such protection consists of additional software and hardware and is implemented through a combination of two basic methods: preventing unintentional changes to memory contents (data integrity protection) and preventing unintended data changes from affecting system functionality (data robustness protection). All systems should always incorporate methods of both types because the strengths of one covers the weaknesses in the other. While the two methods increase the immunity of the embedded system, they do so at the expense of simplicity: both add complexity to either the hardware or software designs. The true art to designing corruption protection is to find the correct balance between protection and added complexity. Table 2 presents an overview of some common corruption protection methods and general characteristics based on my own experience using these methods in actual embedded systems. Again, these are judgment calls; your assessment and success with a particular method may vary from my own. Table 2: Characteristics of common corruption protection methods Data integrity Whenever possible, I use parts that have both software and hardware data integrity protection and I always require both when using serial interface parts (a one bit change in the serial stream to two-wire interface parts turns a read into a write). I also consider protection based on disabling memory accesses as a function of the power supply voltage, a highly desired feature for any memory selected, and I always try to select parts that offer the added security of software write protection (based on unique unlock sequences). After memory type selection, the second most effective method of preventing corruption is to incorporate a well-designed reset circuit into the microprocessor. Data integrity in the nonvolatile-memory subsystem is greatly enhanced by properly-designed reset circuitry, which prevents undefined processor operation during power up, power down, and transient conditions. Vendor-supplied single-chip solutions are usually the best choice, provided they're used intelligently. Proper threshold, reset time, output levels, and any external signal inputs must all be considered when selecting a reset supervisor. Reset thresholds for single-supply systems are usually set to the highest absolute minimum operating-supply voltage for all the components in a circuit, and multiple supplies require multiple reset threshold circuitry. You must set reset time to be significantly longer than the time it takes the power supply, clocks, and other components (such as configurable FPGAs) to settle and begin normal operation. Output levels and drive need to match the requirements of the processor and all other circuitry attached to the reset output from significantly below the minimum operating voltage up to the maximum expected power-supply output voltage. You can add external inputs from other devices, such as FPGA configuration done signals, to the reset circuitry. External inputs are also useful at removing power up sequence ambiguity that can lead to corruption. A power-fail interrupt, found on some reset supervisors, can also help if the drivers that write to the memory are properly written. The driver can detect the power failure, finish pending writes to memory, prevent partially updated data structures, and issue commands to lock the nonvolatile memory before the system enters the undefined operating regions that cause most of the corrupting events. You can also use hardware protection pins, such as write disable and block protection pins, to increase data integrity in embedded systems, but their use is often less than straightforward. The problem with using such pins is that you must characterize the pin-drive circuitry far outside the normal specified operating region of the logic and memory contained in the system. The behavior of an I/O port pin on the processor may be specified for the 3 to 3.6V operating input voltage range, but can you determine what that pin is doing at a 2V power-supply input to the part? Additional logic only complicates the picture and usually leads me to reject hardware pins as protection devices unless they're directly controlled by a power supervisor with defined behavior below the minimum functional voltage (which is much lower than the "specified operating voltage") of the memory. I also resist the use of gating circuitry that's not properly "power-supply managed" for the same reasons. Using software write-locked memories greatly enhances system-data integrity. A software write-locked memory requires a special sequence of writes and reads to occur before the part allows any write requests to proceed. Another sequence should be available in the device to lock the part after writing is complete. The longer and more unusual the unlock sequence, the better the part is protected. Software write protection is much more effective if the memory is automatically forced back into the locked state at some time before system power down, leaving the memory contents secure during both the power up and power down intervals. The last and most widely overlooked hardware protection method to improve data integrity in an embedded system is to follow the recommendations of the memory manufacturer. Data sheets and application notes often show suggested connections, pull-ups or pull-downs, and other "recommended" features to be incorporated in the design. Consider such recommendations as requirements and implement them exactly as shown whenever possible. Having a problem with a particular part? Look in the latest application notes and data sheets for modifications to the previously published documents and call the company's application engineer immediately for clarification and assistance. Data robustness Avoid corruptible locations Nonvolatile memory usually corrupts with the address pins in the high or low state, and preferentially corrupt at the top and bottom of their address range. Paged devices are more likely to corrupt at the top and bottom of each page, and corruption is more likely on the first and last pages. Avoiding high-corruption locations is usually only feasible in memories used for data storage with extra capacity and is less useful in systems requiring direct execution from memory. It can also be difficult to identify which locations may corrupt (although investigating the cause of corruption in a system can generate a lot of empirical data). Add error correction codes Detailing the effective use of error correcting codes in a corruptible nonvolatile-memory environment is beyond the scope of this article. Literature concerned with error correction in communications and storage media is a useful starting point, but you must be cautious in applying these codes because the assumptions made when analyzing them are often violated (such as the independence of corruption occurring on consecutive bits) in the embedded nonvolatile-memory subsystem. You can find some promising candidates in literature regarding error-correction circuitry in RAM subsystems, but I haven't yet attempted to use them in a practical system. Create data redundancy Once again, this method is less useful if the stored data requirements approach the memory size, or if the data stored is a program that executes directly out of memory. However, the data-redundancy approach does offer the advantage of updating one data set at a time, leaving previous valid sets available if the currently written set is corrupted while being rewrittena form of automatic checkpointing. Provide data correction Examples The servo design contains two red flags that should be apparent to you now that you've read this article thus far. Using an RC reset circuit and the selection of a two-wire EEPROM interface are both highly corruptible. Removing the RC reset circuitry and replacing it with proper reset supervisor dropped the corrupting event probability significantly but still didn't eliminate all corruption issues in the system. Further investigation revealed that the two-wire EEPROM didn't include software write-protection features. Replacing the nonprotected EEPROM with a software protected, pin-compatible part (and modifying the embedded code to use it properly) reduced the probability of corruption to a level so low that it was no longer an issue on this project. The second design is actually a subsystem interfaced to a processor over a serial interface. The subsystem consists of an ASIC interfaced to a two-wire serial EEPROM and two EEPOTs. The EEPROM and EEPOTs were used to hold factory calibrations and were not modified dynamically in the field. The ASIC read and used calibration data from the EEPROM on power up and the EEPOTs control the bias voltages to critical subcircuitry. Both the EEPROM and the EEPOTs corrupted randomly, leading to complete system failure that required a return to the factory. Once again, the two-wire EEPROM didn't incorporate software write protection; replacing it with a protected version eliminated the corruption problem for that device, leaving only EEPOT corruption as an issue. Consultation with the application notes and an application engineer led us to change the resistance up and down inputs on some of the hardware. The unit was sufficiently immune to corruption after we implemented these changes. Absolute power Christopher Leddy has been programming computers for over 25 years and specializes in embedded systems hardware and software design. He holds an MSEE from the University of Southern California and a BSEE from the State University of New York. Christopher is currently a senior principal systems engineer at Raytheon. His e-mail address is caleddy@raytheon.com. Endnotes Part 2: Part 3: Copyright 2005 © CMP Media LLC |
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |