Embedded CPUs take on media tasks


EE Times: Latest News Embedded CPUs take on media tasks
Ron Wilson (10/18/2004 9:00 AM EDT) URL: http://www.eetimes.com/showArticle.jhtml?articleID=49901654

Movies on car dashboards, animated 3-D displays on car navigation systems. Camera phones that double as MP3 players, and PDAs as camcorders. The media-everywhere scenario brings lots of new bells and whistles to the consumer — and a whole set of mostly unfamiliar tasks to the world of the mobile-systems designer. A flurry of processor and architecture announcements early this month showed the direction in which vendors of CPU cores are moving to meet the challenge. The processing tasks themselves range in complexity and processing load from the mildly annoying to the crushing. For example, ARC International estimates that an MP3 decode would require about 17 million operations/second on a modern embedded RISC CPU. MPEG-4 AAC encoding would take 39 million operations/s, ARC said. In contrast, video computations can be quite complex, involving discrete cosine transforms and on-the-fly interpolation of pixel values during the searches used in motion detection. And 3-D graphics rendering is rocket science. Whole sequences of computations must be applied to each pixel — location, shading, lighting, transparency, fog, surface texturing — none of these trivial. The only solutions for even moderate resolutions and frame rates require custom, highly parallel rendering engines. Finding the hardware There are many ways to provision hardware to handle media-processing tasks. If cost and power are not constraints, almost anything short of 3-D rendering can be dumped onto a sufficiently monstrous DSP core fed by a sufficiently powerful memory hierarchy. The approach has the advantages of architectural simplicity — you buy the hard parts — and the not-inconsiderable benefit that the algorithms — slippery devils that they are — stay in software, where their implementation can be easily modified. A closely related approach is to reduce some or all of the critical loops to register-transfer-level, rather than DSP, code, and configure an FPGA to do the heavy lifting. This job has been considerably eased by the FPGA vendors, which have thoughtfully embedded numerous high-speed multiply-accumulator (MAC) blocks in their logic fabrics. A more design-intensive alternative that's pretty much available only to system-on-chip (SoC) designers is to isolate critical loops, as above, convert them to RTL and use the code to create an autonomous coprocessor coupled to the host CPU. Fortunately for users of this approach, the inner loops in most media-processing algorithms are much less subject to change than the rest of the code, so hardwiring the inner loops — or optimizing a firmware-driven processor around them — is less risky than developing hardware for an entire codec. In some cases, though, all this would be overkill. Mobile systems often have tight space and power constraints, and everyone short of the military has be to concerned about cost and, therefore, about die area. These issues lead many architects to first ask "Can we do this in software?" Often the answer is no. But "no" is different from "heck, no!" If there are only some very specific loops that cannot meet their deadlines when executed on a reasonable host CPU core, there may be a solution that is more silicon-efficient and easier to program then a coupled accelerator or an external FPGA. "Even in some low-end systems, it is possible to reduce both the code size and the processing footprint on the SoC by integrating specialized instructions into the CPU execution units," said Gerard Williams, chief technical officer at the Austin, Texas, design center of ARM Ltd. The idea is to isolate the critical loops and then to create special instructions that reduce the number of cycles required to traverse the loop. Simulate and repeat until the algorithm can meet its deadlines. Not that this is necessarily easy. Despite the emergence of software tools designed to do just this sort of isolation, extraction and translation to RTL, finding the right things to accelerate is not always a walk in the park. One manager, at the end of a partitioning exercise on an MPEG decoder, estimated that formal approaches would have gotten the team only halfway to the level of performance they eventually achieved. "The rest required detailed understanding of the algorithm and the hardware — especially the way the data moved between the CPU and the acceleration hardware," he said. Simple to complex These instructions can range from the relatively simple to the quite complex, as a series of recent CPU announcements indicated. In the case of ARC International's ARCSound core, the instruction-set extensions comprise little more than a powerful MAC instruction. With the native speed of ARC's processor, this is sufficient to meet the deadlines for a wide range of audio codecs. While ARC was announcing a completed implementation, ARM and MIPS Technologies Inc. both unveiled architectural designs — that is, instruction-set extensions — to add signal-processing horsepower to their CPUs. In both cases the processor architects specified single-instruction, multiple-data (SIMD) instructions that partitioned the processor data path, permitting a couple of 32-bit words or a whole bunch of bytes to be processed simultaneously. The SIMD instruction sets included arithmetic, MAC and data-manipulation instructions, giving the pipelines many of the capabilities of a typical DSP core's execution unit. In many ways, the idea of extending the instruction set has advantages. It does not add a lot to the area requirements of the CPU, but it can potentially eliminate a DSP core or a block of hardwired signal-processing logic from the SoC. Even in designs that don't actually need hardware acceleration to meet their deadlines, ARM's Williams pointed out, reducing the number of instruction cycles necessary to meet the deadline made it possible to reduce both the clock frequency and — using a technique such as ARM's Intelligent Energy Manager — the operating voltage of the CPU, resulting in big energy savings. Alternatively, in a CPU design — such as the most recent Tensilica core — that has highly flexible clock-gating control, you can run the CPU at full speed, finish early and put the whole processor to sleep until it's time to start in on the next packet. Yet, no good idea is without its fine print. If power is a serious issue, executing an inner loop on a Harvard-architecture processor is just about the least energy-efficient way to do the job. "Programmable processors will never be competitive with dedicated hardware on energy efficiency," said Jeff Bier, general manager of Berkeley Design Technology Inc. (BDTI). "The difference is multiple orders of magnitude." Not only does the data have to be moved in and out of a series of main-memory locations, cache locations and register files, but each operating cycle requires the storing and decoding of an instruction — pure overhead as far as energy analysis is concerned. And that added hardware for holding, fetching and decoding instructions contributes more than its share to leakage current. Thomas Petersen, MIPS' director of product marketing, pointed out an important exception, however. "In CPUs running in the 500-MHz range, pipeline efficiencies on the order of 50 percent are not unusual. That means half the time, roughly, the processor is sitting there waiting for memory. If you have efficient multithreading support, you can take advantage of that downtime to do signal-processing computations with relatively little increase in power," Peterson said. "You are just using cycles that would have happened anyway, perhaps with an increase in switching activity, depending on how cleverly the CPU can gate clocks while it is stalled." Also, there is the matter of programming. Adding SIMD instructions to a CPU doesn't eliminate the need for expert — usually handcrafted — code. It may make it worse. "SIMD in particular is a challenge for compilers," warned BDTI's Bier. "Using SIMD instructions effectively usually requires hand-optimized code, no matter how good the vendor claims the compiler to be. And that is an unfamiliar task for most general-purpose processor users." The other big issue is memory. As experienced signal-processing architects are painfully aware, no amount of processing power will get a task done faster than the memory system can deliver and remove the data. Dedicated hardware is often designed in a flow-through manner, so that data comes into the data path, traverses it once and goes back to memory: a single memory structure and only two transfers. But move the DSP operations into a CPU, and everything changes. CPUs don't have data-flow architectures: They have data caches. And caches, as it happens, are almost the theoretical-worst structures for dealing with streaming data, especially in the presence of other, competing tasks. Systems have failed simply because of cache thrashing, even when the computing speed and the theoretical memory bandwidth were more than sufficient for the task. "If a design doesn't have the L1 [Level 1 cache] memory bandwidth to support the processing unit, you may see little benefit from new instructions or accelerators," Bier stated. "Memory bandwidth separates the men from the boys in these applications." One solution is simply bigger, fancier caches. "In the algorithms we have analyzed, classical caching is adequate to handle the data flow," reported ARM's Williams. "But that won't always be true. The architect will have to realize what the data sets look like and how the data will be moving in the particular algorithm, and create memory structures accordingly." The ARCSound design gives a good example of this. ARC's reference design uses a typical instruction cache, but the data stream is fed not into a data cache, but into X and Y memory blocks of the type used in DSP chips. The X and Y blocks are scratchpad RAMs, not caches. Thus, significant amounts of signal processing can be successfully executed on CPUs with extended instruction sets tailored to the inner loops of the code. The technique can save board or SoC real estate, and possibly power. But power analysis is not trivial, and the creation of memory architectures that can actually support the signal-processing operations is another big job.
All material on this site Copyright © 2005 CMP Media LLC. All rights reserved. Privacy Statement \| Your California Privacy Rights \| Terms of Service