High-Performance DSPs -> A case for superscalar DSPs

A case for superscalar DSPs

A case for superscalar DSPs
By Nagendra Goel , Staff Applications Engineer, Shannon Wichman, Principal DSP Architect, Neal Stollon, ZSP Marketing Manager, LSI Logic Corp., Plano, Texas, ngoel@lsil.com, EE Times
November 15, 2001 (4:05 p.m. EST)
URL: http://www.eetimes.com/story/OEG20011115S0060

Selecting the right DSP for a particular application involves much more analysis than simply selecting the one that touts the highest clock speeds or the one that offers the most arithmetic resources. Furthermore, choosing a DSP means more than choosing the architecture. The customer is actually selecting an entire developmental platform — a fundamental decision that has far-reaching implications.

Those who want to develop better products must consider several competing and often conflicting factors when choosing a DSP, such as its performance and power consumption; software and hardware design and engineering costs; time-to-market; and the future recurring engineering costs.

For example, with system-on-chip (SoC) designs gaining popularity, it would be of interest to see if the DSP is offered not just as a stand-alone chip, but also as a core, and if possible, as a synthesizable core; and if it offers standard bus interfaces to support e fficient SoC integration and the necessary tool chain for system-level co-verification of the design. However, scalable superscalar architectures do seem to provide the sweet spot in terms of performance vs. ease of use.

Current approaches to support higher levels of parallelism can be placed into two prominent architectural categories: superscalar and very long instruction word (VLIW).

Superscalar architectures use hardware-based approaches to determine the number of instructions that can be executed simultaneously. For example, LSI Logic's ZSP is a superscalar processor, since it has dedicated logic in the instruction sequencing unit that schedules a group of instructions to the core resources, including dual load/store units, dual arithmetic logic units and dual MAUs (multiply-accumulate/ALU units).

Alternatively, VLIW architectures take an approach of controlling as many DSP resources as possible in a single large instruction word, which may consist of several smaller RISC-like instructions . These instructions are determined at the time of compilation. Superscalar processors are easier to program, more amenable to maintaining backward code compatibility and need less memory as compared with their VLIW counterparts.

Studies have shown up to 80 percent of all embedded product development costs are associated with software development. This is largely due to reliance on assembly-level coding. Although it is desirable to use C compilers, as of today, much of the computation-intensive code is still hand-written in assembly in order to meet the performance requirements.

One key advantage of superscalar processors is that it is not necessary for the programmer to factor in data hazards because of parallel execution, or the pipeline latencies for writing a piece of code that would function correctly. Therefore, it is possible to break down the problem of code development into two subtasks. The first is that of writing software in a linear fashion and debugging it. The second is that of reorganizing the software instructions in the computationally intensive inner loops of the algorithms to achieve the maximum amount of instruction-level parallelism.

The compiler design process is similarly simplified, as the developers can focus more on compiler performance-related issues rather than pipeline-related issues. Thus the superscalar nature of the processor accelerates the software development process.

Equipment manufacturers that use high-performance DSPs are under constant pressure to upgrade their equipment and to provide additional functionality. For a convenient and cost-effective product evolution plan, it is desirable for software to maintain compatibility with the next generation of hardware. Because instruction scheduling is performed in hardware, a superscalar architecture facilitates code migration from one generation of architecture to the next. Maintaining bi nary or even assembly-level code compatibility is difficult with processors that are not superscalar.

As silicon vendors migrate to deep-submicron process technologies, the area occupied by the DSP core is a shrinking fraction of the die area. The SoC die size, and therefore its cost are increasingly dominated by the peripheral logic, and in particular, on-chip memory. Large memory systems require more sophisticated support logic to achieve high speed than smaller memory systems, with associated increases in both gate count and power consumption. Therefore, to minimize overall system costs, it is critical to control the performance and associated power consumption in any DSP memory system design.

By treating each instruction independently, a superscalar DSP can achieve relatively high code density. Instruction memory size is usually an issue with VLIW processors, where a long instruction word consisting of several operations is fetched and executed every clock cycle. If the programmer cannot squeeze in a sufficient number of useful instructions into this instruction word, whether because of data hazard or resource conflicts, the void has to be filled with placeholder no-operation instructions. Benchmarks comparing the code densities of superscalar and VLIW architectures show that superscalar architectures often require less than half the memory required by VLIW architectures.

It is often assumed that a superscalar architecture will require complex logic for pipeline control, and therefore will have high overhead in terms of power consumption. This assumption stems from the fact that general-purpose processors typically employ such superscalar techniques as operand scoreboards, reservation stations and register renaming. Those techniques enable the processor to issue and execute instructions out of order, taking advantage of the fact that not all sequential operations have operand dependencies and that different types of operand hazards can be resolved at different stages of the pipeline.

Supers calar architectures, however, do not necessarily support out-of-order execution. A superscalar architecture is simply one that is responsible for resolving the operand and resource hazards and that has the resources to achieve an instruction throughput that is greater than one instruction per clock.

For example, in the ZSP architecture, logic dedicated to pipeline control is kept to a minimum by enforcing in-order execution and by isolating the control to a single stage at the head of the pipeline. This stage issues sequential groups of instructions that have no data dependencies or other resource conflicts. Once a group of instructions has been issued, they advance through the pipeline in lock step. In addition, the address generation units are kept separate from the arithmetic logic units, with their own set of registers. This further simplifies the resource allocation problem, thus reducing the pipeline control logic.

Since the pipeline control logic is significantly simplified, it consumes littl e power as compared with other components of the system. Also, since the pipeline protection logic is responsible for the allocation of processor resources, the same logic is used to implement dynamic clock gating schemes that effectively turn off the unused resources on a cycle-by-cycle basis, thus providing a very efficient method of reducing power consumption.