Avoid corruption in nonvolatile memory

Avoid corruption in nonvolatile memory
By Christopher Leddy, Courtesy of Embedded Systems Programming
Aug 20 2003 (18:00 PM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=13100883

Absolute power corrupts absolutely. But if you're only concerned about possible corruption of code or data stored in a nonvolatile memory, you can decrease the odds of that happening. This article will empower you to do just that.

Embedded systems often include in-system programmable nonvolatile memory to store both program contents and relevant operating data. The strength and utility of in-system programmable memory is also its major weakness: the contents can be changed by the system in which it resides, leaving the system vulnerable to corruption during normal operation. Data stored in the nonvolatile memory is usually critical to proper system operation, and corruption of that data can lead to system failure, hardware damage, and even unsafe operating conditions. My personal experience indicates that corruption of nonvolatile data is inevitable, and it's impractical for a designer to assume that a system will never experience it.

The design of an embedded system with nonvolatile memory is, therefore, an exercise in minimizing the occurrence and impact of data corruption. The level of confidence required varies by system and depends on desired reliability as well as repercussions of possible failures induced by a corrupting event.

In this article, I will familiarize you with likely corruption mechanisms as well as methods that minimize the probability and impact of corrupted nonvolatile data in an embedded system.

Nonvolatile memory
Before we discuss corruption, we need to understand the common types of corruptible nonvolatile memory used in embedded systems. A majority of corruption mechanisms seem to be common to a particular interface, so I'll divide nonvolatile memory into groups based on their interface to the processor.

Parallel interface memories connect directly to the native processor address and data bus and include flash, EEPROM, battery-backed SRAM, and NVRAM chips.

Serial interface memories get divided into the subgroups of "two-wire" and "three-wire" interfaces (due to dramatically different behavior) and are usually found in EEPROM devices.

Finally, custom interface memories are devices with proprietary interfaces. One example of a custom interface memory is a software-controlled potentiometer like an EEPOT using a two-wire up/down interface.

I won't spend a lot of time on the particulars of the various technologies since you can easily research them on your own. In Table 1, however, I've listed the various types of nonvolatile memory commonly found in embedded systems and included characteristics for each type of memory relating to concepts in this article.

Table 1: Characteristics of selected nonvolatile memory types

Technology	Density/Cost	Hardware interface complexity	Software interface complexity	Suitable for program execution	Suitable for program RAM	Internal software protection¹	Internal hardware protection	Corruption rate
Parallel Flash	High/medium	Medium	High	Yes	No	Usually	No	Low
Parallel EEPROM	Medium/medium	Medium	Medium	Yes	No	Rare	No	Medium
Two-wire EEPROM	Low/low	Low	Medium	No	No	Varies	Varies²	Medium to high
Three-wire EEPROM	Low/low	Low	Medium	No	No	Varies	Varies²	Medium to high
Parallel battery-backed SRAM	Medium/high	High	Low	Yes	Yes	No	Varies³	Medium to high
Parallel NVRAM	Low/high	Medium	Low	Yes	Yes	Varies	Yes⁴	Medium
Custom devices and EEPOTs	NA/low	Varies by interface	Varies by interface	No	No	No	No	Varies by interface
1. Protected by a unique sequence of "standard" read/write accesses 2. Protected by exercising dedicated protection inputs to the memory 3. Protection usually provided by power monitoring functions on required support circuitry 4. Protected by internal power monitoring circuitry

Keep in mind that the entries are relative judgments based on my own experience using these memories in real systems designed by engineers of average competence. Also note that the characteristics listed are typical for the most commonly used parts under each memory type; parts do exist with or without features listed, but engineers don't use these parts as often.

Corruption sources, types
Most engineers look at a nonvolatile memory-component data sheet and see a device rated for something like ten years of data storage and one million writes. They wonder how a device can meet those specifications and still exhibit data corruption when used in their system. The answer is quite simple: the specification test conditions are much kinder to the part than the embedded system is.

Real embedded systems have power supplies that rise and fall, logic that goes inactive or invalid, noise transients, and numerous other "out of specification" regions of operation during everyday use. Data sheets only specify IC performance under controlled conditions and avoid most of the imperfect operating possibilities imposed by use in a real system. What does a 400kHz two-wire EEPROM do when the supply voltage is at half its specified value and random 20ns pulses are feeding into its inputs? What happens when a write operation is posted to a EEPROM and power is removed before the specified maximum store time? The behavior of a part under these transient conditions is not and cannot be specified or accurately predicted.

Most conditions that lead to corruption occur during power up and power down of the system. Some causes are internal to the memory; others derive from the devices connected to the memory, which explains why the interface type often influences the probability and type of corruption that occurs. The transitions and levels of power, chip enable, write/output enable, data, address, and hardware protection pins can all contribute to placing the memory in a corruptible state. Analyzing and measuring the state of these pins as you apply and remove power may help you understand what lines are vulnerable, but often you won't identify the actual cause of corruption. Prevention, therefore, is key.

Corruption also occurs when properly executed write requests are interrupted before the internal operation completes within the chip due to the removal of power, leading to partially erased blocks or corrupt bytes. Data structures that consist of multiple bytes may also update partially and then be interrupted, leading to a corrupt structure in memory when the system is powered up on the next cycle.

Software errors in the system code is one of the most often overlooked sources of corruption. Coding errors can cause violations of write/erase cycle timing constraints or data structure corruptions. This may result from unexpected behavior during multitasking and interrupt handler collisions.

Corruption protection
The key to using nonvolatile memory effectively in an embedded system is to add layers of protection against corruption to the system, so the probability of a corrupting event occurring becomes negligibly small.

To explain this protection process, I define nonvolatile corrupting event as any change in memory contents that adversely affects a critical system function. Corruption protection is added to the system to prevent corrupting events. Such protection consists of additional software and hardware and is implemented through a combination of two basic methods: preventing unintentional changes to memory contents (data integrity protection) and preventing unintended data changes from affecting system functionality (data robustness protection).

All systems should always incorporate methods of both types because the strengths of one covers the weaknesses in the other. While the two methods increase the immunity of the embedded system, they do so at the expense of simplicity: both add complexity to either the hardware or software designs. The true art to designing corruption protection is to find the correct balance between protection and added complexity.

Table 2 presents an overview of some common corruption protection methods and general characteristics based on my own experience using these methods in actual embedded systems. Again, these are judgment calls; your assessment and success with a particular method may vary from my own.

Table 2: Characteristics of common corruption protection methods

Protection Method	Group	Component	Effectiveness	Memory Types
Memory type selection	Data integrity	Hardware	High	All
System reset supervisors	Data integrity	Hardware	High	All
Power fail interrupts	Data integrity	Hardware	Medium	All
Hardware protect lines	Data integrity	Hardware/software	Low	All
Chip enable gating	Data integrity	Hardware	Low	All but two wire
Boot zone protection	Data integrity	Hardware/software	Medium	Flash, EEPROM (parallel)
Software write protection	Data integrity	Hardware/software	High	All
Manufacturer's recommendations	Data integrity	Hardware	High	All
Avoid corruptible locations	Data robustness	Software	Medium	All except executables¹
Error correction codes	Data robustness	Software	Medium	All except executables¹
Data redundancy	Data robustness	Software	High	All except executables¹
Data correction	Data robustness	Software	Medium	All except executables¹
1. Actually, you can use these methods with executables, too, so long as the executables are treated as data and not running at the time. For example, in a system with a bootloader and application code both stored in flash, the bootloader can't be protected in this way; but the application code can.

Data integrity
The designer makes his or her most important choice in preventing data corruption when selecting the type of nonvolatile memory used in the system. Memory selection is influenced by a number of system constraints, but often more than one memory type can provide the needed density, access times, and cost. See Table 1 for my assessment of the relative inherent data integrity of various memory types.

Whenever possible, I use parts that have both software and hardware data integrity protection and I always require both when using serial interface parts (a one bit change in the serial stream to two-wire interface parts turns a read into a write). I also consider protection based on disabling memory accesses as a function of the power supply voltage, a highly desired feature for any memory selected, and I always try to select parts that offer the added security of software write protection (based on unique unlock sequences).

After memory type selection, the second most effective method of preventing corruption is to incorporate a well-designed reset circuit into the microprocessor. Data integrity in the nonvolatile-memory subsystem is greatly enhanced by properly-designed reset circuitry, which prevents undefined processor operation during power up, power down, and transient conditions. Vendor-supplied single-chip solutions are usually the best choice, provided they're used intelligently. Proper threshold, reset time, output levels, and any external signal inputs must all be considered when selecting a reset supervisor.

Reset thresholds for single-supply systems are usually set to the highest absolute minimum operating-supply voltage for all the components in a circuit, and multiple supplies require multiple reset threshold circuitry. You must set reset time to be significantly longer than the time it takes the power supply, clocks, and other components (such as configurable FPGAs) to settle and begin normal operation. Output levels and drive need to match the requirements of the processor and all other circuitry attached to the reset output from significantly below the minimum operating voltage up to the maximum expected power-supply output voltage. You can add external inputs from other devices, such as FPGA configuration done signals, to the reset circuitry. External inputs are also useful at removing power up sequence ambiguity that can lead to corruption.

A power-fail interrupt, found on some reset supervisors, can also help if the drivers that write to the memory are properly written. The driver can detect the power failure, finish pending writes to memory, prevent partially updated data structures, and issue commands to lock the nonvolatile memory before the system enters the undefined operating regions that cause most of the corrupting events.

You can also use hardware protection pins, such as write disable and block protection pins, to increase data integrity in embedded systems, but their use is often less than straightforward. The problem with using such pins is that you must characterize the pin-drive circuitry far outside the normal specified operating region of the logic and memory contained in the system.

The behavior of an I/O port pin on the processor may be specified for the 3 to 3.6V operating input voltage range, but can you determine what that pin is doing at a 2V power-supply input to the part?

Additional logic only complicates the picture and usually leads me to reject hardware pins as protection devices unless they're directly controlled by a power supervisor with defined behavior below the minimum functional voltage (which is much lower than the "specified operating voltage") of the memory. I also resist the use of gating circuitry that's not properly "power-supply managed" for the same reasons.

Using software write-locked memories greatly enhances system-data integrity. A software write-locked memory requires a special sequence of writes and reads to occur before the part allows any write requests to proceed. Another sequence should be available in the device to lock the part after writing is complete. The longer and more unusual the unlock sequence, the better the part is protected. Software write protection is much more effective if the memory is automatically forced back into the locked state at some time before system power down, leaving the memory contents secure during both the power up and power down intervals.

The last and most widely overlooked hardware protection method to improve data integrity in an embedded system is to follow the recommendations of the memory manufacturer. Data sheets and application notes often show suggested connections, pull-ups or pull-downs, and other "recommended" features to be incorporated in the design. Consider such recommendations as requirements and implement them exactly as shown whenever possible. Having a problem with a particular part? Look in the latest application notes and data sheets for modifications to the previously published documents and call the company's application engineer immediately for clarification and assistance.

Data robustness
No matter how effective hardware data protection methods are, they can only reduce the probability of corruption and not eliminate it. As such, it's critical to include additional software-corruption protection to the software design. Software-protection methods enhance system data robustness, enabling a system to function normally even when a certain number of bytes within the nonvolatile-memory device are corrupted. All methods rely on one or more of the following: avoidance of corruption locations, data redundancy, data-error detection and correction, and self repair. Combining all of these can produce a robust system capable of functioning properly even with multiple corrupt bytes in the memory.

Avoid corruptible locations
A simple corruption-protection method avoids locations that are likely to be corrupted. A particular type of memory usually has a set of locations that are most likely to corrupt. You can determine such locations analytically or experimentally.

Nonvolatile memory usually corrupts with the address pins in the high or low state, and preferentially corrupt at the top and bottom of their address range. Paged devices are more likely to corrupt at the top and bottom of each page, and corruption is more likely on the first and last pages.

Avoiding high-corruption locations is usually only feasible in memories used for data storage with extra capacity and is less useful in systems requiring direct execution from memory. It can also be difficult to identify which locations may corrupt (although investigating the cause of corruption in a system can generate a lot of empirical data).

Add error correction codes
Adding error correction codes (ECCs) to stored data can also be highly effective in preventing corrupting events. Append ECCs to the data and use algorithms to detect and correct the corruption of a limited number of bytes in a data set.

Detailing the effective use of error correcting codes in a corruptible nonvolatile-memory environment is beyond the scope of this article. Literature concerned with error correction in communications and storage media is a useful starting point, but you must be cautious in applying these codes because the assumptions made when analyzing them are often violated (such as the independence of corruption occurring on consecutive bits) in the embedded nonvolatile-memory subsystem. You can find some promising candidates in literature regarding error-correction circuitry in RAM subsystems, but I haven't yet attempted to use them in a practical system.

Create data redundancy
You can also use data redundancy to dramatically increase system reliability. Have multiple copies of data stored and wrapped with checksums, CRCs (cyclic redundancy codes), or error detection codes to validate their contents.^[1]

Once again, this method is less useful if the stored data requirements approach the memory size, or if the data stored is a program that executes directly out of memory. However, the data-redundancy approach does offer the advantage of updating one data set at a time, leaving previous valid sets available if the currently written set is corrupted while being rewritten—a form of automatic checkpointing.

Provide data correction
One last useful method is data correction, which actually corrects the data in memory if corruption does occur. You can derive the correction data from the existing data set (if ECC or redundant copies are present) or simply use a fail-safe data set that lets the system limp along until the contents can be properly restored by reprogramming or recalibration. This method is complex because it requires support from one of the other methods previously listed (ECC or redundant copies) to identify and correct corrupted data.

Examples
Here are two real-world example systems to help illustrate how you can use these concepts to improve system reliability. The first design, a simple servo control board, represents a standard low-cost processor/nonvolatile-memory configuration. It included an 8051 processor with an RC reset circuit connected to a two-wire serial EEPROM. Required system nonvolatile data was about 3KB out of the 4KB available within the device selected. As configured above, the system experienced frequent, catastrophic failures caused by corruption of the nonvolatile data required for proper operation.

The servo design contains two red flags that should be apparent to you now that you've read this article thus far. Using an RC reset circuit and the selection of a two-wire EEPROM interface are both highly corruptible. Removing the RC reset circuitry and replacing it with proper reset supervisor dropped the corrupting event probability significantly but still didn't eliminate all corruption issues in the system. Further investigation revealed that the two-wire EEPROM didn't include software write-protection features. Replacing the nonprotected EEPROM with a software protected, pin-compatible part (and modifying the embedded code to use it properly) reduced the probability of corruption to a level so low that it was no longer an issue on this project.

The second design is actually a subsystem interfaced to a processor over a serial interface. The subsystem consists of an ASIC interfaced to a two-wire serial EEPROM and two EEPOTs. The EEPROM and EEPOTs were used to hold factory calibrations and were not modified dynamically in the field. The ASIC read and used calibration data from the EEPROM on power up and the EEPOTs control the bias voltages to critical subcircuitry. Both the EEPROM and the EEPOTs corrupted randomly, leading to complete system failure that required a return to the factory. Once again, the two-wire EEPROM didn't incorporate software write protection; replacing it with a protected version eliminated the corruption problem for that device, leaving only EEPOT corruption as an issue. Consultation with the application notes and an application engineer led us to change the resistance up and down inputs on some of the hardware. The unit was sufficiently immune to corruption after we implemented these changes.

Absolute power
The information presented in this article can be valuable when you're correcting system corruption issues, but is even more useful when you incorporate the concepts into an embedded system during the initial design phase. Careful attention to both the hardware and software aspects of nonvolatile data integrity and robustness issues at the start of a design can prevent major redesign efforts and eliminate intermittant system failures and customer returns due to corruption incidents.

Christopher Leddy has been programming computers for over 25 years and specializes in embedded systems hardware and software design. He holds an MSEE from the University of Southern California and a BSEE from the State University of New York. Christopher is currently a senior principal systems engineer at Raytheon. His e-mail address is caleddy@raytheon.com.

Endnotes
1. A helpful three-part article on CRCs:
Part 1
Barr, Michael, "Connecting: Leveraging the 'Net," Embedded Systems Programming, November 1999, pp. 45 to 52 (beginning at "Checking up").

Part 2:
Barr, Michael, "Connecting: For the Love of the Game," Embedded Systems Programming, December 1999, pp. 48 to 54 (beginning at "Strength in numbers").

Part 3:
Barr, Michael, "Connecting: Slow and Steady Never Lost the Race," Embedded Systems Programming, January 2000, pp. 38 to 56 (beginning at "Easier said than done").