Reconfiguring Design -> How to extend configurable CPU performance

How to extend configurable CPU performance

How to extend configurable CPU performance
By Steve Leibson, Technology Evangelist, Tensilica Inc., San Jose, Calif., EE Times
February 8, 2002 (11:50 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020208S0043

Embedded systems generally have a few, well-identified processing choke points. Systemic approaches to increased processing needs take a broad-brush approach to a problem that likely can be solved with a more focused solution with less power and at less cost.

One way to add processing power for specific tasks that represent a design's choke point is by adding hardware accelerators. This is a proven approach and is common when the processing need can easily be encapsulated. MPEG-2 decoders are examples of such accelerators used in the past.

Extensible, configurable (tailored) processors now provide another way of solving processing performance problems for system-on-chip-based designs. Tailored processors offer many of the benefits of hardware accelerators (adding hardware for specific processing problems), while solving some of the problems associated with the design of hardware accelerators.

The practice of developing hardware accelerators started when microprocessor instructionset architectures (ISAs) were fixed and immutable. Although every fixed-ISA packaged processor contains control and sequencing logic that could operate the hardware accelerator, that logic is embedded in the processor and is not available for other uses. Even in the ASIC era, most processor cores tend to come packaged so that the processor's internal control and sequencing logic is not exposed for other uses.

Now that synthesizable processors are available, this need not be. It is possible to develop tailorable processors that can be customized to meet processing needs not foreseen by the processor's designers. This approach maximizes transistor use and results in system designs that meet processing goals at the lowest cost in terms of transistors, power dissipation and silicon use. In addition, using the processor's proven control and seque ncing logic to manage acceleration hardware greatly simplifies the task of designing and debugging that acceleration hardware, and simplifies the task of harnessing the accelerator in software.

Customized processing

A good example of this is in the implementation of the Data Encryption Standard specification, which first appeared in 1977 as a U.S. federal standard, and is based on a 56-bit encryption key. A variation of DES called Triple-DES iterates the algorithm three times with three different keys, resulting in an effective key length of 168 bits.

Many Triple-DES designs employ a hardware accelerator to achieve the desired level of encryption and decryption performance. Tailored microprocessors provide another way to achieve desired performance levels by allowing the design team to add the specialized registers, instructions and function units needed to execute the DES algorithm while reusing the microprocessor's proven sequencing and control logic.

Triple-DES tai loring of a microprocessor requires the addition of three special-purpose registers: a 64-bit register (DATA) to hold the input data for encryption or decryption, and two 28-bit registers-called C and D in this example-to hold the split-key values. In addition, adding four relatively simple machine instructions to the base set of processor instructions collapses the inner loops of the Triple-DES algorithm. These are:

STKEY, to move key from general-purpose registers to C and D register,
STDATA, to move 64-bit data word from general-purpose registers to DATA register,
LDDATA, to move high or low 32 bits from DATA to a general-purpose register,
DES, to rotate C and D registers by 1 or 2 bits for encryption or decryption.

Each of these new instructions executes in one machine cycle just like the microprocessor's native RISC instructions. Each of the new machine instructions replaces a C-level call to a subroutine that performed the same task in the original Triple-DES program written in C, so the original program flow is not affected. The addition of the three new registers and four new instructions has a profound effect on program execution.

With the acceleration factors observed, a tailored 200-MHz 32-bit RISC processor can encrypt or decrypt a 10-Mbit/second MPEG-II data stream using less than 3 percent of its available bandwidth while adding just 5,000 gates to the core processor design. An untailored 200-MHz processor could not even attempt to handle the 10-MHz data stream in real-time.

Such image-compression algorithms as JPEG and JPEG 2000 generally start by converting RGB image data into the YCrCb color space, which separates luminance information from color information. The reason for the conversion is that color information is much less important, and can be stored at lower resolution, than luminance information, resulting in substantial image compression. Conversion requires passing each RGB pixel through a matrix.

Th e conversion of each pixel requires 18 operations: nine multiplications and nine additions or subtractions. A conventional processor will need to perform each of those operations separately.

A tailored processor allows the designer to create one instruction-a matrix-multiplication instruction with pre- and postscaling-that performs the entire operation.

A general-purpose color-space-conversion instruction for Tensilica's synthesizable Xtensa processor, for example, adds 7,500 gates to the base processor design, resulting in an instruction that converts one pixel into YCrCb at the rate of one conversion in two clock cycles.

The net acceleration for this function is 8.3 times. That's the equivalent of running a 200-MHz processor at 1.6 GHz. This conversion of the color-space-conversion instruction implements the matrix constants as registered variables to allow for conversion to multiple color spaces. If the feature isn't needed, the constants could be hard-coded into the instruction, thereby reducing the number of gates needed for implementation.