High-Performance DSPs -> New Needs and Applications Drive DSPs

New Needs and Applications Drive DSPs

New Needs and Applications Drive DSPs
By John Setel O'Donnell, Co-Founder and, Chief Technology Officer, Equator Technologies Inc., Campbell, Calif., EE Times
November 15, 2001 (3:41 p.m. EST)
URL: http://www.eetimes.com/story/OEG20011115S0057

Just over a decade ago, POTS line modems delivered 9,600 baud, at most, in the dial-up voice channel. Their modulation and compression functions were implemented by ASICs. Three years later, DSP-based modems were delivering more than three times the data rate in the same voice channel; shortly thereafter, another doubling of the data rate was delivered using the same DSP hardware.

When modem developers moved their core algorithms from hardware to DSP software implementations, they were able to conceive and deliver more complex algorithms that extracted more from the channel, and their new algorithms were brought to market faster. As a result, DSP-based modems captured essentially 100 percent of the world market within four years of their market entry. Today, multiple products, ranging from the cellular phone to the MP3 player, have been made possible by similar "simple DSP" technology.

Today's DSP landscape sees straightforward extensions of t he voice and data communications applications of a decade ago. Devices run at higher density, lower cost and lower power per channel. Cellular basestations, modem concentrators, audio and similar applications have been addressed by this evolutionary improvement.

New applications have become practical as DSP performance has increased. Examples include image processing in office equipment (copiers and printers), video security, network multimedia, videoconferencing, streaming video, HDTV and digital video recorders, and physical-layer signal processing in high-speed and wireless communications. Rising processor power has opened these problem areas to software-based solutions, allowing ASIC development cycles to be skipped and allowing more complex, variable, adaptive processing techniques to be incorporated in these devices.

Such applications are driving a need for much more powerful DSP platforms. Standard-definition digital video compression, for example, requires more than 100 times the DSP proces sing power needed for cell phone voice compression or POTS data modem functions.

Along with the need for more processing power, significant improvements in software development methodologies are urgently needed to support the new applications. The DSP approach of a decade ago, where assembly-language core algorithm development was augmented with some high-level language control software (which ran at low efficiency), is a poor match for today's shorter product development cycles, expectations of richer product functionality and more complex fundamental algorithms.

As we move beyond the classic approach to DSP to higher-performance designs, there is a wide spectrum of alternatives to consider. The first dimension is single-processor vs. multiprocessor. Several vendors offer multiple-processor approaches to high-performance DSP. Instead of building a single high-performance processor, an array of medium-performance processors is provided as a single-chip solution. The approach greatly simplifies the DSP chip developer's task, reducing it to a medium-performance design with a "step-and-repeat" layout. At the same time, however, it greatly complicates the application software developer's task.

To extract performance from an array of processors, numerous problems in task distribution, communications overhead vs. computation, and load balancing must be solved. Programming of the core loops becomes only a small portion of the overall development. As algorithms become more sophisticated and adaptive, the problem becomes even more severe.

For problems with substantial data independence or with the ability to carry out computations on multiple data elements with minimal dependencies or communication among elements, array-style DSPs may be worth considering. For other problems, the engineer should first seek a fast single processor to meet the need.

Over time, improvements in semiconductor fabrication have improved the available clock rate for existing architectures. Frequency scaling is limited by fixed relationships between memory and logic speeds, and unbalanced improvements limit overall improvement. As a result, new architectures have been developed to go beyond the classic one-MAC DSP. They include simple extensions of the "classic DSP," architectures that incorporate new features designed to speed a class of problems, and even architectures designed primarily for a particular task.

Fundamental tools available to the DSP architect include:

Multiple instruction issue. Instead of issuing and executing one instruction per clock, the DSP can issue and execute multiple instructions. Many desktop processors today use "superscalar" architecture to implement multiple instruction issue; however, in many cases, more than 70 percent of the on-chip logic transistors are consumed with superscalar instruction parsing and operation scheduling, making the superscalar approach far too expensive for DSP-type applications.
Very long instruction word (VLIW) architecture provides a framework fo r issuing and executing multiple arbitrary instructions per clock cycle at the lowest possible hardware cost. In a "pure" VLIW, all machine management is handled by the compiler or the assembly language developer. VLIW architectures can directly deliver two- to eightfold performance improvements over single-issue DSPs, on a wide range of algorithms — at the cost of greatly increased demands on the compiler or significantly increased assembly language development complexity.
SIMD architecture. Single-instruction multiple-data (SIMD) ideas have shown up in many modern architectures. Here, the same instruction operates on multiple data elements. In some cases they are full-word data types; in other cases they are shorter data elements packed into a longer word size. SIMD techniques reduce instruction code size and simplify execution unit design for regular algorithms, in which many operations are applied in a pattern that can be organized into one machine instruction.
Deep pipelines. Old er DSPs carry out a register or memory read, arithmetic operation chain and register write in a single clock period, enabling "single-cycle loops" and simplifying assembly language programming. The trade-off is relatively low clock frequency. With modest area changes, a machine can be re-pipelined to separate the register read, arithmetic operations and register write into separate clock periods, thereby enabling a significantly faster clock and higher cost-effectiveness.
The cost of deep pipelines shows up in complex pipeline latency management at the assembly-code level, thus increasing the demand for a highly effective compiler.
Application-optimized resources. In a "pure" DSP or RISC processor, each operation fetches values from general registers, carries out a single operation and stores the result back to registers. In many cases, the register file consumes more power and chip area than the attached computation unit.

These ideas can collectively yield a large (10x to 100x) performance improvement over simpler DSPs.

Control issues

Beyond pure processor speed, another key issue to consider in choosing a DSP approach is how control and DSP algorithms will inter-operate. The classic "pure DSP" approach has been to require the application developer to split the overall tasks into "control" portions, which typically operate on a microcontroller or RISC microprocessor, and the "DSP" portions. Numerous examples of this approach can be found in MP3 players, cell phones and other DSP-powered devices.

While the approach may be well-suited to simple signal-processing applications, it poses problems for sophisticated image processing and physical-layer processing, and it greatly complicates the task of moving advanced applications from desktop PC platforms to embedded appliances. The overhead of synchronization and the burden of code partitioning can be severe, for example, in sophisticated MPEG processing, which combines sophisticated data-dependent filtering and search techniques with brute-force DCT-type operations.

An alternative approach makes system partitioning and software porting much more straightforward: the unification of the DSP and microprocessor into a single device. That does not mean placing two "cores" on a single die; it means adding DSP instructions to the microprocessor instruction set, or adding RISC operations and processor features (fast context switching, caches, memory protection) to the DSP. That can allow the entire application and operating system to execute natively on the fused DSP/microprocessor, speeding time-to-development and greatly broadening the software choices available to the developer.

Every one of the ideas and trends discussed above increases the demands placed on the high-level language compiler for the high-speed DSP.

Merging the microprocessor and DSP core requires handling C++, Java, and other high-level languages, and supporting a wide range of user interface, operating system, and networking code, providing hig h compatibility with other platforms and high speed native execution on the same processor which is handling image processing and signal processing fundamentals.
Application-optimized resources require compiler support. That may extend as far as providing dedicated compilers for specialized coprocessors, or it may be limited to exposing special constructs in C and C++.
Deep pipelines require compiler management of operation latency, and multi-issue (VLIW) architecture requires compiler management and scheduling of many operations per cycle. Compilers must discover and utilize instruction-level parallelism in DSP applications in order to extract performance from pipelined VLIW DSPs.
SIMD architecture and sub-word operation parallelism require compiler support for the special op codes, as well as compiler pattern recognition and loop optimizations to recast computations into patterns that can exploit these special op codes.

In the past, DSP vendors have provided compilers, whic h handled only the "control" portion of applications, and have required assembly language coding for the critical DSP loops. As algorithms become more adaptive and complex, and as high-performance DSPs embody more of the techniques discussed above, the mixed C/assembly methodology becomes less and less tenable.

Equator set out to build a broadband signal processor family to transform digital video products in the same manner that DSPs transformed modems a decade ago. Equator's goal was to handle a range of functions, including three-stream SDTV recording and HDTV decoding purely in C, with native support for the Linux and VxWorks operating systems.Achieving those goals required high parallelism embedded in a general-purpose processor, with deep pipelines, VLIW architecture, SIMD operations and specialized operation units all coming into play.

Today, Equator's MAP-CA and MAP-BSP-15 processors execute more than 100 pixel-level operations per clock cycle and carry out 16 multiply-add (MAC/MADD) operati ons per clock cycle. Specialized functional units alongside the CPU carry out high-speed memory transfers, encryption, video filtering and entropy coding steps needed in digital video systems, greatly speeding overall performance at modest area impact. The mixed design approach has allowed Equator's processors to match the efficiency and cost characteristics of hardwired solutions while providing the benefits of complete programmability.

Operating at 333MHz, MAP-BSP-15 delivers 5.8 billion MACs per second and more than 33 Gops of image-processing power, at less than 2-watt core power. Processing CIF MPEG-4 requires less than 400 milliwatts.

MAP is the only high-performance DSP that natively runs full Linux. Linux on a high-performance DSP means a huge space of custom products can be developed rapidly, with minimal porting effort, using a large base of readily available software, with massive DSP power for key functions.

That enables, for example, interactive television appliances that combine net work stacks, user interfaces, digital-music management, full web access, HDTV picture processing, digital recording and image enhancement in a single platform.