High-Performance DSPs -> New Needs and Applications Drive DSPs
New Needs and Applications Drive DSPs
By John Setel O'Donnell, Co-Founder and, Chief Technology Officer, Equator Technologies Inc., Campbell, Calif., EE Times
November 15, 2001 (3:41 p.m. EST)
URL: http://www.eetimes.com/story/OEG20011115S0057
Just over a decade ago, POTS line modems delivered 9,600 baud, at most, in the dial-up voice channel. Their modulation and compression functions were implemented by ASICs. Three years later, DSP-based modems were delivering more than three times the data rate in the same voice channel; shortly thereafter, another doubling of the data rate was delivered using the same DSP hardware. When modem developers moved their core algorithms from hardware to DSP software implementations, they were able to conceive and deliver more complex algorithms that extracted more from the channel, and their new algorithms were brought to market faster. As a result, DSP-based modems captured essentially 100 percent of the world market within four years of their market entry. Today, multiple products, ranging from the cellular phone to the MP3 player, have been made possible by similar "simple DSP" technology. Today's DSP landscape sees straightforward extensions of t he voice and data communications applications of a decade ago. Devices run at higher density, lower cost and lower power per channel. Cellular basestations, modem concentrators, audio and similar applications have been addressed by this evolutionary improvement. New applications have become practical as DSP performance has increased. Examples include image processing in office equipment (copiers and printers), video security, network multimedia, videoconferencing, streaming video, HDTV and digital video recorders, and physical-layer signal processing in high-speed and wireless communications. Rising processor power has opened these problem areas to software-based solutions, allowing ASIC development cycles to be skipped and allowing more complex, variable, adaptive processing techniques to be incorporated in these devices. Such applications are driving a need for much more powerful DSP platforms. Standard-definition digital video compression, for example, requires more than 100 times the DSP proces sing power needed for cell phone voice compression or POTS data modem functions. Along with the need for more processing power, significant improvements in software development methodologies are urgently needed to support the new applications. The DSP approach of a decade ago, where assembly-language core algorithm development was augmented with some high-level language control software (which ran at low efficiency), is a poor match for today's shorter product development cycles, expectations of richer product functionality and more complex fundamental algorithms. As we move beyond the classic approach to DSP to higher-performance designs, there is a wide spectrum of alternatives to consider. The first dimension is single-processor vs. multiprocessor. Several vendors offer multiple-processor approaches to high-performance DSP. Instead of building a single high-performance processor, an array of medium-performance processors is provided as a single-chip solution. The approach greatly simplifies the DSP chip developer's task, reducing it to a medium-performance design with a "step-and-repeat" layout. At the same time, however, it greatly complicates the application software developer's task. To extract performance from an array of processors, numerous problems in task distribution, communications overhead vs. computation, and load balancing must be solved. Programming of the core loops becomes only a small portion of the overall development. As algorithms become more sophisticated and adaptive, the problem becomes even more severe. For problems with substantial data independence or with the ability to carry out computations on multiple data elements with minimal dependencies or communication among elements, array-style DSPs may be worth considering. For other problems, the engineer should first seek a fast single processor to meet the need. Over time, improvements in semiconductor fabrication have improved the available clock rate for existing architectures. Frequency scaling is limited by fixed relationships between memory and logic speeds, and unbalanced improvements limit overall improvement. As a result, new architectures have been developed to go beyond the classic one-MAC DSP. They include simple extensions of the "classic DSP," architectures that incorporate new features designed to speed a class of problems, and even architectures designed primarily for a particular task. Fundamental tools available to the DSP architect include: Very long instruction word (VLIW) architecture provides a framework fo r issuing and executing multiple arbitrary instructions per clock cycle at the lowest possible hardware cost. In a "pure" VLIW, all machine management is handled by the compiler or the assembly language developer. VLIW architectures can directly deliver two- to eightfold performance improvements over single-issue DSPs, on a wide range of algorithms at the cost of greatly increased demands on the compiler or significantly increased assembly language development complexity. The cost of deep pipelines shows up in complex pipeline latency management at the assembly-code level, thus increasing the demand for a highly effective compiler. Control issues Beyond pure processor speed, another key issue to consider in choosing a DSP approach is how control and DSP algorithms will inter-operate. The classic "pure DSP" approach has been to require the application developer to split the overall tasks into "control" portions, which typically operate on a microcontroller or RISC microprocessor, and the "DSP" portions. Numerous examples of this approach can be found in MP3 players, cell phones and other DSP-powered devices. While the approach may be well-suited to simple signal-processing applications, it poses problems for sophisticated image processing and physical-layer processing, and it greatly complicates the task of moving advanced applications from desktop PC platforms to embedded appliances. The overhead of synchronization and the burden of code partitioning can be severe, for example, in sophisticated MPEG processing, which combines sophisticated data-dependent filtering and search techniques with brute-force DCT-type operations. An alternative approach makes system partitioning and software porting much more straightforward: the unification of the DSP and microprocessor into a single device. That does not mean placing two "cores" on a single die; it means adding DSP instructions to the microprocessor instruction set, or adding RISC operations and processor features (fast context switching, caches, memory protection) to the DSP. That can allow the entire application and operating system to execute natively on the fused DSP/microprocessor, speeding time-to-development and greatly broadening the software choices available to the developer. Every one of the ideas and trends discussed above increases the demands placed on the high-level language compiler for the high-speed DSP. Equator set out to build a broadband signal processor family to transform digital video products in the same manner that DSPs transformed modems a decade ago. Equator's goal was to handle a range of functions, including three-stream SDTV recording and HDTV decoding purely in C, with native support for the Linux and VxWorks operating systems.Achieving those goals required high parallelism embedded in a general-purpose processor, with deep pipelines, VLIW architecture, SIMD operations and specialized operation units all coming into play. Today, Equator's MAP-CA and MAP-BSP-15 processors execute more than 100 pixel-level operations per clock cycle and carry out 16 multiply-add (MAC/MADD) operati ons per clock cycle. Specialized functional units alongside the CPU carry out high-speed memory transfers, encryption, video filtering and entropy coding steps needed in digital video systems, greatly speeding overall performance at modest area impact. The mixed design approach has allowed Equator's processors to match the efficiency and cost characteristics of hardwired solutions while providing the benefits of complete programmability. Operating at 333MHz, MAP-BSP-15 delivers 5.8 billion MACs per second and more than 33 Gops of image-processing power, at less than 2-watt core power. Processing CIF MPEG-4 requires less than 400 milliwatts. MAP is the only high-performance DSP that natively runs full Linux. Linux on a high-performance DSP means a huge space of custom products can be developed rapidly, with minimal porting effort, using a large base of readily available software, with massive DSP power for key functions. That enables, for example, interactive television appliances that combine net work stacks, user interfaces, digital-music management, full web access, HDTV picture processing, digital recording and image enhancement in a single platform.
These ideas can collectively yield a large (10x to 100x) performance improvement over simpler DSPs.
In the past, DSP vendors have provided compilers, whic h handled only the "control" portion of applications, and have required assembly language coding for the critical DSP loops. As algorithms become more adaptive and complex, and as high-performance DSPs embody more of the techniques discussed above, the mixed C/assembly methodology becomes less and less tenable.
Related Articles
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |