SoCs: DSP World, Cores -> New DSP architectures work harder

New DSP architectures work harder
By Jennifer Eyre, DSP Analyst, Jeff Bier, General Manager/Co-founder, Berkeley Design Technology Inc., Berkeley, Calif., EE Times
April 11, 2000 (3:49 p.m. EST)
URL: http://www.eetimes.com/story/OEG20000411S0036

Until recently, incremental enhancements accounted for most DSP processor design improvements; new DSPs tended to maintain a close resemblance to their predecessors. In the past few years, however, DSP architectures have become much more interesting, with a number of vendors announcing new architectures that are completely different from preceding generations.

Processor designers who want higher DSP performance than can be squeezed out of traditional architectures have come up with a variety of performance-boosting strategies. To improve performance beyond the increase afforded by faster clock speeds, the processor must perform more useful work in each clock cycle. This is accomplished by increasing the number of operations that are performed in parallel, which can be implemented in two ways: by increasing the number of operations performed by each instruction, or by increasing the number of instructions that are executed in every instructi on cycle.

DSP processors traditionally have used complex, compound instructions that allow the programmer to encode multiple operations in a single instruction. For example, some DSPs allow the programmer to specify all of the operations and data moves needed to compute one tap of an FIR filter in a single instruction. In addition, DSP processors traditionally issue and execute only one instruction per instruction cycle. This single-issue, complex-instruction approach allows DSP processors to achieve very strong DSP performance without requiring a large amount of program memory.

One method of increasing the amount of work performed by each instruction while maintaining the basics of the traditional DSP architecture and instruction set is to augment the data path with extra execution units. For example, instead of having one multiplier, some high-end DSPs have two. We refer to processors that follow this approach as "enhanced conventional DSPs." The basic architecture is similar to that of pre vious generations of DSPs, but has been substantially enhanced with the addition of execution units. Of course, the instruction set must also be enhanced to allow the programmer to specify more parallel operations in each instruction in order to take advantage of the extra hardware.

Parallel support

In some cases, the processor uses wider instruction to support the encoding of more parallel operations. A good example of this approach is the Lucent Technologies DSP16xxx family, introduced in September 1997. This processor's architecture is based on that of the earlier DSP16xx, but Lucent added a second multiplier, an adder (separate from the ALU), and a bit manipulation unit. To support more parallel operations and keep the processor supplied with data, Lucent also increased the data bus widths from 16 bits (in the DSP16xx) to 32 bits, allowing the processor to transfer pairs of 16-bit data words rather than single 16-bit words.

The DSP16xxx instruction set uses both 32-bit and 16-bit instructions, in contrast to the DSP16xx, which used only 16-bit instructions. A key result of these enhancements is that the DSP16xxx processor is able to sustain a throughput of two multiply-accumulates (MACs) per instruction cycle-double the MAC rate of the DSP16xx.

The "enhanced conventional DSP" approach produces a strong (though perhaps not dramatic) performance boost while maintaining cost and power-consumption levels similar to those of earlier generations of DSPs. Single instruction multiple data (SIMD) is an architectural approach that allows one instruction to operate on multiple sets of data and produce multiple results.

Thus, for example, a single MAC instruction causes the processor to perform two MAC operations and produce two independent results. There are two different approaches processor architects typically use to implement SIMD: splitting the processor's execution units (so that one unit produces multiple results), or duplicating execution units.

To impleme nt SIMD via split execution units, the processor typically treats long input operands as multiple shorter operands; e.g., a 32-bit register can be treated as two 16-bit registers, or four 8-bit registers. These shorter input operands are then operated upon in parallel by a single instruction. This style of SIMD is commonly used in high-end general-purpose processors (GPPs), such as the Pentium and PowerPC, to increase their execution speed on computations associated with multimedia and signal-processing tasks. The split-execution-unit technique works well on these processors because they typically have wide resources (registers, buses) that lend themselves to being logically split into shorter resources.

These processors typically have additional instructions to support SIMD; two examples of CPUs with SIMD instruction set extensions are the Pentium with MMX and SSE, and the PowerPC with AltiVec. SIMD extensions for high-performance GPPs vary in terms of their flexibility, data types supported (fixed -point, floating-point, or both) and other features. In contrast to splitting execution units, some processors provide duplicate execution units or duplicate data paths to implement SIMD. In this case, a single instruction controls both execution units, and they operate in lock-step. Instead of splitting registers, there may be duplicate sets of registers-one set for each data path. This approach is used on Analog Devices' ADSP-2116x and TigerSHARC.

SIMD is mainly useful in applications with a high level of data parallelism; that is, in applications that can process blocks of data in parallel rather than processing data serially. It often increases program memory use, because the programmer must include additional instructions to arrange data in memory so that the processor can retrieve multiple operands at a time.

In addition, it is often necessary to unroll loops to take full advantage of SIMD capabilities. Using SIMD often reduces the generality of algorithms, since making effective use of SIMD requires groups of operations to be executed together. Thus, for example, if a processor performs four MACs in parallel as part of a SIMD instruction, a loop that includes the SIMD MAC instruction can only produce results in multiples of four.

RISC-like instructions

DSP processors have traditionally used complex, compound instructions that are issued and executed at the rate of one instruction per instruction cycle. Recently, however, a number of DSP vendors have abandoned that tradition in an attempt to boost processor performance far above the performance achieved by conventional (or even enhanced conventional) DSPs. Instead of the usual single-issue, complex-instruction architecture, designers have opted for a more RISC-like instruction set coupled with an architecture that supports execution of multiple instructions per instruction cycle. For example, the Texas Instruments TMS320C62xx family, the StarCore SC140 core, and the Infineon Carmel core all use very-long-instruction- word (VLIW) architectures. These processors contain many more execution units than conventional DSP architectures, allowing them to perform much more work in every clock cycle.

In addition, the use of simple, RISC-like instructions allow the processors to operate at higher clock speeds than conventional DSPs because instruction decoding is simplified. As an example of the VLIW approach, consider the TMS320C62xx. This processor fetches a 256-bit instruction "packet," parses the packet into eight 32-bit instructions and routes them to its eight independent execution units. In the best case, the C6xxx executes eight instructions simultaneously, at the high clock rate of 250 MHz. The combination of high clock rate and multiple parallel instructions results in an extremely high Mips rating (e.g., 2,000 native Mips for a 250 MHz TMS320C62xx).

By way of comparison, the fastest enhanced conventional DSP currently available, the DSP16xxx, operates at 120 native Mips. The two processors' Mips ratio mig ht lead you to expect the C6xxx to execute programs approximately 16 times faster than the DSP16xxx. But there's a catch: Each C6xxx instruction is far simpler than a typical DSP16xxx instruction. Thus, even if the C6xxx could sustain an execution rate of eight instructions per cycle (which it often cannot, because of data dependencies, memory bandwidth limitations and other limiting factors), it would not be producing eight times as much useful DSP work as a DSP16xxx operating at the same clock speed. It often takes two or more C6xxx instructions to accomplish the same amount of work as is accomplished by a single DSP16xxx instruction, so the performance difference between these two processors cannot be accurately determined by comparing their Mips ratings alone.

While the C6xxx is a fast DSP to be sure, it requires noticeably more energy and program memory than previous generations of DSPs, making it unsuitable for many portable applications. Instead, the TMS320C6xxx family targets line-powered app lications, where it has the processing muscle to replace multiple less-powerful DSPs.

Like VLIW processors, superscalar processors issue and execute multiple instructions in parallel. Unlike VLIW processors, in which the programmer (or code-generation tool) explicitly specifies which instructions will be executed in parallel, superscalar processors have specialized hardware that performs dynamic instruction scheduling. This hardware determines "on the fly" which instructions will be executed concurrently based on the processor's available resources, on data dependencies and on a variety of other factors.

Superscalar architectures have long been used in high-performance general-purpose processors such as the Pentium and PowerPC. In 1998, the first commercial superscalar DSP, the ZSP164xx, was demonstrated in silicon by ZSP Corp. (ZSP, and the ZSP164xx, have since been acquired by LSI Logic, and the processor has been renamed the LSI401Z.) The LSI401Z can issue and execute up to four RISC-like instructions per instruction cycle. This processor executes at 200 MHz and provides very strong DSP performance.

One drawback to using superscalar architectures for DSP applications is their lack of execution-time predictability. Because superscalar processors dynamically determine which instructions will execute in parallel, it may be difficult to predict exactly how long a given segment of software will require to execute. This can pose a problem for software developers who need to implement a system with hard real-time constraints and may also complicate software optimization.

Many applications require a mixture of control-oriented software and DSP software. The digital cellular phone, for example, must implement both supervisory tasks and voice-processing tasks. In general, microcontrollers provide good performance in controller tasks and poor performance in DSP tasks; dedicated DSP processors have the opposite characteristics.

Hence, until recently, combination controller/signal processing applications were typically implemented using one of each. In the past couple of years, however, a number of microcontroller vendors have begun to offer DSP-enhanced versions of their controllers as an alternative to the dual-processor solution.

Using one processor to implement both types of software is attractive, because it can potentially simplify the design task, save board space, and reduce power consumption and system cost.

Microcontroller vendors have added DSP functionality to existing designs in a number of ways. Hitachi, for example, offers a DSP-enhanced version of the SH-2 microcontroller, the SH-DSP, that adds a separate 16-bit fixed-point DSP data path to the original data path of the SH-2. Although the two data paths are separate, they cannot execute multiple instructions in parallel because they are part of a single processor that processes a single instruction stream.

ARM took a different approach, adding DSP-oriented instruction-set extensions to the ARM9 microcontroller to create the ARM9E. These new instructions are supported by modest hardware enhancements to the ARM9's data path, rather than by adding a separate DSP data path. In contrast, Infineon designed an all-new hybrid architecture, TriCore, a single processor core that processes a single instruction stream. It is currently the only hybrid DSP/ microcontroller that uses a VLIW architecture.

THIS ARTICLE IS EXCERPTED FROM A PAPER PREPARED FOR A CLASS PRESENTED AT DSP WORLD SPRING CONFERENCE 2000, TITLED "UNDERSTANDING THE NEW DSP PROCESSOR ARCHITECTURES."