High-Performance DSPs -> Reconfigurable approach supersedes VLIW/ superscalar

Reconfigurable approach supersedes VLIW/ superscalar

Reconfigurable approach supersedes VLIW/ superscalar
By Alan Marshall, Chief Technology Officer, Elixent, Bristol, England, EE Times
November 15, 2001 (3:30 p.m. EST)
URL: http://www.eetimes.com/story/OEG20011115S0054

As the complexity of advanced DSPs increases, it is becoming much more difficult to program them effectively to meet system performance goals. The possibility of assembler level programming becomes ever more unrealistic and compilers have become extremely complex. Consequently, the addition of new parallel execution units in a typical VLIW machine does not lead to a linear increase in performance. The effective use of the processor is also highly dependent on the algorithm that is being executed.

Reconfigurable signal processing architectures are an effective alternative to achieving high performance while overcoming these issues. For example, the architectures can be easily targeted toward any number of defined algorithms that can be run in parallel or sequentially. All of the allocated compute resource will be fully used independent of the type of algorithm being implemented. Reconfigurability also provides algorithmic flexibility, such as swappin g between DCT, Quantizer and Huffman functions as required. This flexibility has the potential to save silicon area. The designer has the ability to trade off the system performance for the final implementation area, saving vast silicon costs.

In the world of digital signal processing, designers have a number of ways of increasing performance but the majority of these either sacrifice die area or are currently unfeasibly complicated to implement successfully.

Traditional processor designs for very high-performance applications are known as "superscalar." In superscalar designs, the processor or compiler determines if a group of instructions may be executed simultaneously or whether they must be executed in a particular sequence. The processor can then use multiple execution units to simultaneously carry out two or more independent instructions at a time. This is known as instruction-level parallel processing and it can be a powerful tool for dealing with high-level, very complex, algorithms.

Par allel processing has the ultimate objectives of obtaining unlimited scalability (handling an increasing amount of interactions at the same time), while simultaneously reducing execution time. However, difficulties arise with using it in a DSP environment; as design complexity rises, programming the DSP accurately and quickly becomes harder and harder. The possibility of using an assembler level program becomes increasingly unrealistic and the only other way of implementing the code, a compiler, also increases in complexity.

Many designers and chip architects have looked at alternatives to superscalar designs, finding few viable options. One option, that of very long instruction word (VLIW), has been seen to show promise. VLIW compilers extract a parallel instruction stream from an application program and distribute those instructions to numerous execution units. Unlike a superscalar processor there is no further reordering of instructions undertaken by the VLIW hardware and the hardware is therefore simp ler. Sadly, even the "smartest" compiler software cannot always extract sufficient parallelism from the application to keep all the parallel execution units running all the time in a complex VLIW. Clearly, a different approach is needed.

Elixent's engineers have taken the idea of an easily scalable, reconfigurable architecture and turned it into reality to help solve the performance and flexibility issues plaguing designers. The company's reconfigurable ALU array (RAA) provides an innovative design alternative to both superscalar and VLIW architectures. It uses an array of 4-bit arithmetic logic units (ALUs) and register/buffer blocks that may be combined to support variable data word widths, something that VLIW architectures currently do not do well. The ALUs are positioned on the array in the style of a chessboard, alternating with adjacent "switch boxes" that can serve as either a crossbar switch or 64 bits of user memory.

An additional 256-byte memory block may be inserted for every 16 ALU block s. The ALU is complex enough to support a useful and broad instruction set while still providing area-efficient arithmetic for the 8- to 24-bit data lengths that are typically needed for multimedia applications.

This scheme allows extremely flexible interconnectivity, with each ALU having input and output buses on all four sides, enabling each to send or receive data from any of eight surrounding ALUs or to a longer distance routing network. It greatly simplifies interconnect and routing, minimizing the silicon overhead necessary for programmability. A further advantage of the architecture is its rapid reconfigurability — multiple complex algorithms can be run sequentially and it is easy to swap between them whenever needed. This means that all the algorithms can be run on the same piece of silicon, minimizing die area.

The choice of a nibble-sized ALU and bus structure means that only a few bytes of memory are required to configure each ALU and its associated wiring. This small amount of config uration memory is a major contributor to the density of RAA, and the rapid reconfiguration. It allows a 512-ALU RAA to be completely reconfigured in few tens of microseconds over a DRAM interface, for example, and partial reconfiguration can occur in even less time. This makes the updating of an RAA extremely rapid and straightforward.

Having many small ALUs and a very flexible interconnection structure also helps solve the compiler problem. Instead of trying to find parallelism in the application that matches the capabilities of the execution units in a VLIW or superscalar processor, an RAA compiler can identify the actual parallelism in the application, and then connect up the ALUs to match it.

With the RAA architecture, it is possible to create highly flexible SoC designs and depending on the system requirements, performance can be easily traded off against die size. Extra room can be left for future implementation of additional algorithms, upgrades and emerging standard, for example. This in its elf is a very powerful ability — the shelf lives of entire product families can be extended significantly because they can be reconfigured to include the most up-to-date algorithms.

Current digital still cameras typically use an ASIC, coupled with a number of peripheral chips, including an image sensor, and DRAM and flash memory. The ICs usually contain a preview engine, DSP, RISC processor and interfaces, performing a sequence of processing operations that includes:

1. Generation of the preview image and the capture and buffering of that image in DRAM as the picture sequence is started.

2. Running a number of complex algorithms that perform operations such as correcting for lens aberrations, interpolating bad pixels and colors, autofocus, edge enhancement, image compression and writing to flash memory.

The performance of the DSP element of the controller is critical to the overall usability of the camera, and can introduce a range of problems including long focus delay, long shot-to-sho t delays and poor replay rates. Often, the DSP technology integrated into digital-still-camera controller chips are spin-offs from standard general-purpose DSP products, with fixed multiplier capabilities, which achieve throughputs of around 2 Mpixels/second in a typical case. This equates to an overall performance of perhaps 1 frame/s for a midpriced camera.

If the DSP segment of the controller IC is implemented with an RAA containing 1,024 ALUs, a number of advantageous architectural changes can be used. The RAA can increase performance for each pixel processing stage by reconfiguring its datapath, an operation that takes fractions of a millisecond. This allows each image processing stage to be performed by an array configured especially for the task.

Throughput is increased dramatically because of datapath optimization and because all of the silicon can be applied in either massive parallelism or via multiple signal paths. Applying these gains throughout the entire RAA-based pixel processing chain , and an overall performance of around 60+ Mpixels/s can be realized, over an order of magnitude greater than the traditional solution. This performance is so far in excess of that required to enhance camera operation that it opens up the additional possibility of eliminating other elements of the controller chip set.

In a digital still camera, there are two obvious areas for improvement. The "preview engine" block that displays a smaller, lower-resolution version of the image on the LCD is usually implemented via a dedicated piece of logic. The RAA can produce a preview at virtually any point in the pixel processing sequence, making a separate preview engine redundant. The second area is in the DRAM. To make sure that cameras can shoot multiple shots quickly, DRAM is typically used to buffer images before pixel processing takes place, with JPEG compressed images stored in flash memory. Because of the pixel processing required, a 6-Mpixel camera might require 16 Mbytes of DRAM for each frame of storage, and two or more frames are usually required for a reasonable degree of usability. To dispense with this silicon, the RAA has the computational bandwidth and on-chip memories that allow image compression on the fly, allowing further images to be taken from the sensor at the clock rate.

In general, IC reconfigurability carries a penalty, which when compared to a standard chip set or ASIC, can be summarized as larger silicon area and higher power consumption. However, with the performance increases and chip-set savings already mentioned, it is easy to see that the penalties described above can be offset, making Elixent's RAA technology more than viable.

As the use of the RAA eliminates the preview engine and DRAM array, any marginal increases in the size of the DSP core in relation to the overall real estate of the controller is easily compensated for. Each RAA ALU requires sixteen bytes of data per configuration change. So, a 10-step pixel processing sequence for a 1,024 ALU design would require a few hundred kbytes, a trivial amount of additional memory in the overall system.

For a given function, an RAA core will consume more power than the equivalent optimized ASIC, but when considered at the system level with the elimination of other silicon taken into account, it more than balances out. However, when compared with the DSP element of an ASSP controller, the RAA typically consumes 25 percent less power. This is because DSP cores traditionally used are variants of standard designs and come with a range of general-purpose resources. Since many of these resources are redundant during multimedia operations, they consume unnecessary power.

Reconfigurable signal processor architectures offer an effective and more than viable alternative to using VLIW architectures when trying to meet high-performance system requirements. In addition, the RAA format eliminates many of the software problems associated with both superscalar and VLIW options.

See related chart

Industry Articles

High-Performance DSPs -> Reconfigurable approach supersedes VLIW/ superscalar