Phoenix — Those nasty neutrons that have plagued memory chip designers for the past two decades are now giving logic designers a headache, too. But while error correction coding has reduced soft-error rates (SERs) in DRAMs and SRAMs, no such quick fix exists for logic, and all current solutions involve extra cost and a drag on performance. "Logic SER may become as significant as SRAM error rates," predicted Hans Stork, the chief technology officer at Texas Instruments Inc. (Dallas), in a keynote speech here last week at the International Reliability Physics Symposium. Soft errors in logic devices are a growing concern for mission-critical systems such as servers, automotive ICs and networking equipment. Logic chip vendors already are working with system customers on ways to guard against the effects of cosmic rays and alpha particles emitted from packaging. However, reliability engineers at last week's symposium said no easy solutions exist. In the case of ASIC designs, the implications are different and the countermeasures reach deeper into the system design. "Upset has been an issue for years," said Ronnie Vasishta, vice president of technical marketing at LSI Logic Corp. (Milpitas, Calif.). "In the 0.18-micron days the issue was alpha particles. But we eventually isolated the alpha issue to solder and addressed it. That left us with the problem of neutrons. It is a much smaller issue than the alpha radiation had been, but for large communications systems it is a concern." The problem is getting worse at smaller geometries, he said. "The good news is that as you make the device smaller, you decrease the capture area into which a neutron would have to pass to cause trouble," Vasishta said. "The bad news is that operating voltage is also going down, and that is outweighing the reduction in area. Devices are getting more sensitive as we move from 130 nanometers to 90 [nm] and below." Sensitivity is rising with each new process generation, agreed Rob Aiken, senior architect at library developer Artisan Components Inc. (Sunnyvale, Calif.). "It's not going up by an order of magnitude each time, but it does keep going up." First discovered in 1979 by separate teams working at Intel Corp. and Bell Labs, soft errors were viewed back then as a failure mode in DRAMs. Much later, the industry realized they were affecting SRAMs as well. Sun Microsystems Inc. was caught up in an SER-related furor in 1999 when some of its mission-critical servers faltered because of soft errors in cache SRAMs. And controversy is now swirling around the issue of soft-error rates in FPGAs based on SRAM cells (see April 19, page 1). The industry dealt with SER in memories with error correction coding (ECC), which can reduce soft-error rates by several orders of magnitude at the cost of only a few extra bits. For logic, unfortunately, no such quick fix exists. Rob Baumann, the top SER expert at Texas Instruments, said he is working with several customers concerned about SER in 90-nm logic devices bound for the kinds of mission-critical applications that cannot afford even a few moments of down time. Most others who talk to TI about the issue go away reassured that the problem is, at worst, a minor one for their applications. "Your chip might have an FIT [failure-in-time rate] that results in one failure per year at the system level. And that is if the system is operating 24 hours a day," Baumann said. "If you are talking about a cell phone, who cares? But those same failure rates would be unacceptable for the DSPs used by the hundreds in basestations operating out in some remote area." Why is SER becoming a visible concern for logic devices now, 25 years after the problem was first identified in DRAMs? System-on-chip solutions with hundreds of millions of transistors and lower operating voltages at the 90-nm node are two of the reasons. At 90 nm, the critical charge (Qcrit) in a logic node becomes small enough that a flip-flop can go haywire after being struck by even relatively low-energy particles, said Shigehisa Yamamoto, a reliability engineer working in Mizuhara, Japan, for Renesas Technology Corp. Mitigating the lower Qcrit is the fact that as devices scale, the area for each logic cell becomes smaller. In effect, the target becomes harder to hit — but once struck, it is more easily knocked down. Those factors offset each other, to a large degree. But 90-nm chips that follow Moore's Law contain roughly twice as many devices as 130-nm chips. So the bottom line is that the SER problem doubles for each logic generation. "SER in logic can no longer be ignored," Yamamoto said. "As technology scales down, the number of circuits grows. At 90-nm design rules, very small node charges can affect a 90-nm flip-flop. As the industry goes to 65 nm, the flip-flop soft-error rate will increase." Ethan Cannon, a reliability engineer working at IBM Corp.'s Burlington, Vt., facility, said IBM's analysis also shows that the alpha particles emitted from packaging materials are an important factor in logic SER, more important than the neutrons from cosmic rays. IBM has used silicon-on-insulator (SOI) technology since the 130-nm node, and Cannon said the company calculates that its SER rates are five times less in SOI-based logic than in bulk CMOS, largely because charge collects only in the thin silicon layer above the buried oxide. On the other hand, radiation events can more easily upset SOI-based circuits because they have less junction capacitance than logic devices created in bulk silicon. Cannon said that logic vendors should make sure their packaging suppliers are taking contaminants out of their materials. "Alpha emissions from packaging account for a significant portion of the problems in most cases," he said. Packaging, and the solder bumps used in flip-chips, are going lead-free, instead using more environmentally friendly (but more brittle) tin-based compounds. But even lead-free packaging materials contain traces of uranium and thorium. "The assumption everyone should take," said TI's Baumann, "is that all metals are dirty." Hardening the bit cells in a design, a seemingly obvious solution, is no panacea by any means. Vasishta said that LSI Logic "will only use hardened elements where it is necessary to meet our FIT-rate commitments to our customers." Hardening means increasing the size, and sometimes decreasing the speed, of the cells. "There are things we can do at the bit cell level to improve resistance," said Dhrumil Gandhi, senior vice president of technology at Artisan. "You can add N-wells, for example, and improve the upset resistance by a factor of two." "But at the 130- and 90-nm nodes, we have chosen to take a different approach — error-correcting circuitry around the memories instead of more robust cells," Aiken added, citing reasons of both area and performance. "The area overhead to add ECC to a large memory block is less than 10 percent. And the reduction in single-bit errors is very great — an order of magnitude or more," he said. "The area penalty for enlarging each bit cell would be much more than 10 percent, and the benefit would be much less." Gandhi said he expects a new class of special libraries of hardened devices to emerge. "It will be very much analogous to the situation we have now, where designers work with two sets of libraries — one optimized for speed and the other for leakage current," he said. "Tools help them to select which library elements to use where." Unfortunately, no such tools exist today, Gandhi observed. Semiconductor vendors are beginning to look at what they can do at the process level to protect against SER. Y.Z. Xu and colleagues at Cypress Semiconductor Corp. (San Jose, Calif.) tested SER rates in SRAM chips made in bulk silicon wafers with varying layers of epitaxial silicon. While the epi layer did relatively little to protect against soft errors, Cypress saw a bigger gain from more radical surgery: The process was adjusted to increase the junction capacitance and an extra capacitor was added, an approach similar to the one STMicroelectronics recently used for SER protection in its SRAMs. Xu concluded that the extra capacitor provided the greatest SER protection, but at additional cost. And that is the rub — cost. Chips sent up to space have needed protection from SER for decades, but the solutions, such as watchdog timers or voter-type logic, are expensive, said Paul Dodd, a scientist working at Sandia National Laboratory. Baumann said TI is starting to tell its customers that certain types of flip-flops and latches provide more protection than other, usually less complex, logic elements. In about 90 percent of cases, customers are told they needn't worry, said Baumann, a distinguished member of the technical staff at TI in Dallas. But for the microprocessors used in workstations, servers, cellular basestations and other networking gear, customers designing 90-nm chips already are taking precautions, he said. One engineer at the reliability symposium noted that controllers used in brake-by-wire systems also are called into play in the stabilization systems seen in some expensive automobiles. If SER momentarily disturbs a stabilizing controller on a car hurtling down the autobahn at 160 kilometers an hour, serious accidents could occur, he noted. Though it may be a decade away, one cause for optimism is that FinFETs are much less susceptible than planar logic devices, largely because the vertical structures have a much smaller silicon body — a smaller target — than do planar devices. For the near term, said Norbert Seifert, a senior reliability engineer at Intel, things such as self-checking circuits are being studied. But all known solutions include die-area penalties and drags on performance. |