SoC Configurable Platforms -> Adaptable computing right for MPEG-4
Adaptable computing right for MPEG-4
By Paul Master, Vice President of Technology, QuickSilver Technology Inc. San Jose, Calif., EE Times
August 14, 2000 (2:58 p.m. EST)
URL: http://www.eetimes.com/story/OEG20000814S0033
The arrival of "Internet time" is straining current design methodologies, just like putting a 1,500-horsepower engine into a Model T would break the structure around it. The arrival of Internet time on new algorithms also applies to digital signal processors (DSPs)-it's rapidly outpacing the ability of DSP silicon designers to produce DSPs that run the latest generation of algorithms efficiently. The fundamental problem is that today's algorithms are running on DSP chips that have been specified, conceived, designed and etched into silicon two years in the past. A key example of this technology evolution is the rapid development in multimedia, especially as it is envisioned to operate on mobile communications devices. Current second-generation (2G) personal digital assistants, as well as the new 2.9G and 3G mobile wireless handsets/terminals, call for MPEG-4 streaming video, the latest-generation video compression/decompression standard pri marily targeted at devices with medium-speed data communication links. A complexity analysis of MPEG-4 shows that roughly 20 to 30 percent of the MPEG-4 computations are in the discrete cosine transform (DCT) function; 20 to 30 percent are in Huffman encoding; and another 20 to 30 percent of the computations are in motion estimation. Such streaming video demands considerably higher levels of processing performance. A dual streaming video MPEG-4 data stream (one encode channel and one decode channel) at full color, 160 x 160-pixel resolution and at 15 frames per second requires about 3.3 billion operations per second. We will take a more detailed look at DCT, since it is also used in MPEG-1, MPEG-2 and JPEG. OEM handset engineers are expected to encounter major design issues even armed with today's most advanced DSP technology. In order to reduce the computations required, developers could deploy multiple DSPs, which consume inordinate power; build custom ASIC acceleration circuitry or a combi nation of DSPs and ASIC acceleration circuitry; or degrade the video quality. The problems lie squarely in the inflexibility of DCT algorithms implemented for both DSPs and ASICs. In both instances, the systems engineer experiences failures because these devices were designed in earlier years with a particular set of algorithms in mind. And they are not flexible enough to change with the latest generation of algorithms. Adaptive computing machine (ACM) technology, on the other hand, is an example of the newer, more efficient system-on-chip configurable design approaches that give system designers wide latitude for implementing virtually any algorithm, old and new. ACM enables a wireless handset to be packed with an endless number of different functions.
Let's illustrate the problems of rigid silicon hardware. All current-generation DSPs have been optimi zed to perform finite-impulse-response filtering very well-it was the obvious killer application two to three years ago when those products were designed. A FIR filter is used heavily in digital signal processing. At its basic core it requires the ability to multiply data against a coefficient and store the results in an accumulator. What sets a DSP apart from a standard RISC processor is the additional silicon that has been added to accelerate FIR operations on a DSP. A modified Harvard architecture allows the reading, decoding and executing of two separate instructions in parallel.
Two memory banks, along with additional buses to access them, allow two simultaneous data accesses per clock cycle. Specialized address-mode hardware allows the memories to be read sequentially without executing separate instructions. In addition, specialized hardware is added to allow zero-overhead loops to be performed. With those changes added, the DSP is then able to process two simultaneous multiply-accumulate (MAC ) operations every clock cycle in a sustained manner.
Building block
Note the process that occurred here. The FIR algorithm was used to add and change the architecture of the DSP so that this operation could be performed efficiently.
Because there is a two-year lag between an Internet standard appearing and DSPs designed to run the standard efficiently, expect to see optimized DCT-DSPs in two years. In a design example, a 1-D DCT is used as a building block for a 2-D DCT. The overall performance of the 2-D DCT built using 1-D DCTs is dependent on the organization of the 1-D DCTs.
The JPEG still-image compression standard, the H.261 and H.263 videoconferencing standards and MPEG-1, -2 and MPEG-4 digital video standards use the DCT. In those standards, a 2-D DCT is applied to 8 x 8 blocks of pixels in the image that is compressed.
The 64 coefficients produced by the DCT are then quantized to provide the actual compression. In typical images, most DCT coefficient s for an 8 x 8 block of pixels are small and become zero after quantization. That property of the DCT on real-world images is critical to the compression schemes.
The complexity of this algorithm significantly exceeds the complexity of the FIR filter. A DSP architecture, which is designed to do FIR operations efficiently, fails to do DCT operations efficiently. The complexity can be seen in the richness of the interconnection, the high usage of MACs, and the large number of specialized arithmetic logic unit (ALU) operations required. Each 1-D DCT requires on the order of 29 additions and five multipliers.
The latest modified Harvard architecture highlights these issues with dual-instruction and dual-MAC DSPs. A DCT operation can be implemented in two different ways, keeping in mind two instructions can be executed in one clock cycle.
On a DSP of that t ype, for regular address modes, four data words are pulled from the dual data memories, the dual MACs operate on this data and specialized zero-overhead loop hardware keeps track of the looping. However, the addressing modes required for DCT are nonregular, and so the hardware which is in place for FIR-filter addressing is the wrong type for DCT addressing.
MAC limitations
This nonregularity of addressing forces the programmer to explicitly calculate an address for each MAC operation. The accessing of data for DCT is a nonregular
address scheme; it is not a regular incrementing-by-N structure. The eight pixels of data are all addressed multiple times in a non-sequential, mixed manner until the DCT is finished, and then the next eight pixels are processed.
There are two ways of structuring the instructions to process the DCT. In the first one, on each clock cycle two instructions are executed-one to (calculate-address & get-data) + (MAC operation). In the second case, o n the first clock cycle, (calculate-address & get-data) + (calculate-address & get-data) is performed. On the second clock cycle, (MAC operation) + (MAC operation). In both cases, due to the advanced DSP's architecturally inherent limitations, only one MAC can be used in a sustained manner. In effect the hardware architecture, developed at great time and cost, is wasted.
Based on one of these latest-generation DSPs, execution time for one 8 x 8 DCT has been estimated to be 1,853 CPU cycles. Entering and exiting the DCT function requires 25 CPU cycles or 1 percent of the total execution time. DCT on rows requires 926 CPU cycles or 50 percent of the total execution time, and DCT on columns requires 902 CPU cycles, or 49 percent of the total execution time. Loop frequencies are taken into account for the purpose of attaining an accurate power estimation and clock cycle.
Studies show that 15 images per second is an acceptable frame rate. Some 600 basic DCTs are required to process one frame of 16 0 x 160-pixel full-color images, which leads to allocating approximately 17 MHz of the total DSP's available clock cycles just to perform the DCT portion of MPEG-4. Increasing the pixel density to get a larger picture will, of course, increase the processing requirements exponentially.
The DCT algorithm influences the DSPs currently being designed as well as those coming to market in about two years. Designers can expect to see much more complicated addressing hardware, multiported data memories feeding multiple MAC units, even more ALUs, and specialized ALU instruction modes to perform both add/sub pairing operations found in DCT.
With rigid hardware, as long as the rate of innovation is relatively slow, then the time it takes to design and build new DSP hardware is not critical. However, the rate of innovation is increasing.
The latest-generation DCT algorithm coming out of the research community is the BIN (binary) DCT algorithm, which takes the previous DCT implementation with add ers and multipliers and replaces the multipliers with adders. Adders require fewer gates and comprise considerably simpler structures. BIN DCT consists of two major parts, the adder matrix realization and the add-and-shift process.
The adder matrix consists of a two-level adder butterfly structure. The first level consists of 12 eight-bit adders working in parallel. The eight DCT inputs are fed to this level. Outputs of this level are 12 nine-bit numbers, each of which is the sum of two inputs. Each output is routed to more than one adder input at the next level.
The second level consists of 22 nine-bit adders working in parallel. Outputs are the 22 different four-input sums (each 10 bits wide) needed by BIN DCT to generate the final eight outputs of the DCT. Outputs are routed to the next level, which is the add-and-shift process.
There are many ways to perform the add-and-shift of the 12 partial results. The most straightforward way is to use one adder with the output fed back to th e one input, performing the process serially.
A faster technique is to use a 4-2 compressor tree, which is shared by all the DCT outputs. A level of 12 eight-to-one muxes follows the adder matrix, selecting the proper 12 partial results to be routed to the compressor tree. The 12 partial results are then compressed into 2-bit numbers. A final adder is used to generate the final DCT output. This process is repeated until the DCT outputs are calculated.
BIN DCT is expected to outperform all previous DCT architectures. However, its processing requirements are radically different from the standard DCT algorithm. Instead of multipliers, additional large numbers of ALUs are required. The interconnection scheme between those units is even more complicated than the DCT scheme. This is, of course, one of many new algorithms that are vying to be used. The development of this algorithm shows that the creation of new algorithms exceeds the capacity of pure rigid hardware to keep up.
Implementing second- or third-generation DCT in an adaptive computing machine gives the system engineer a considerably more efficient design. ACM technology refers to microchips that can change at blazingly fast speeds, on the fly, via different software algorithm downloads. Adaptive computing gives the designer the ability to bring into existence computational units and the appropriate interconnect to achieve parallelism. Here, more MACs can be assembled to attain that coarse-grain parallelism and thus, the ACM-turbocharged DCT will considerably faster. The ability to specialize the hardware during run-time allows an ACM-based IC to process the most demanding multimedia applications. An adaptive computing machine IC enables an algorithm to dictate the exact hardware that it requires.
Specifically, in the standard DCT example a specialized addressing unit, customized to the particular DCT algorithm, does the x1 to x7 addressing, which can be brought into existence to feed the MAC. This would double the performa nce of the DCT algorithm from 1,800 clocks to 900 clocks. Since with an ACM the interconnect structure is not pre-allocated, more MACs and ALUs can be added to speed up the algorithm. An addition of four more ALUs over the conventional DSP nearly quadruples the ACM's performance. The algorithm should not be dictated by an architecture designed two years ago for solving old problems, but rather the architecture should be dictated by the algorithm at run-time.
The ability to change and track the constantly changing standards in today's Internet-driven world-for example, the latest multimedia algorithms-requires a new silicon technology, the adaptive computing machine. Essentially, the ACM allows the architecture to track the changing standards, eliminating the need to run today's algorithms on yesterday's hardware architectures.
operate considerably faster. The ability to specialize the hardware during run-time allows an ACM-based IC to process the most demanding multimedia applications. An ada ptive computing machine IC enables an algorithm to dictate the exact hardware that it requires.
Specifically, in the standard DCT example a specialized addressing unit, customized to the particular DCT algorithm, does the x1 to x7 addressing, which can be brought into existence to feed the MAC. This would double the performance of the DCT algorithm from 1,800 clocks to 900 clocks. Since with an ACM the interconnect structure is not pre-allocated, more MACs and ALUs can be added to speed up the algorithm. An addition of four more ALUs over the conventional DSP nearly quadruples the ACM's performance. The algorithm should not be dictated by an architecture designed two years ago for solving old problems, but rather the architecture should be dictated by the algorithm at run-time.
The ability to change and track the constantly changing standards in today's Internet-driven world-for example, the latest multimedia algorithms-requires a new silicon technology, the adaptive computing machine. Essen tially, the ACM allows the architecture to track the changing standards, eliminating the need to run today's algorithms on yesterday's hardware architectures.