Customized DSP -> Parallel DSPs: speed at a price

Parallel DSPs: speed at a price
By Argy Krikelis, Chief Technology Officer, Aspex Technology Ltd., Uxbridge, United Kingdom, EE Times
March 12, 2001 (1:36 p.m. EST)
URL: http://www.eetimes.com/story/OEG20010312S0088

Parallel processors-the collection of processing units operating concurrently to perform an application-are a natural way to increase processing in digital computing devices, particularly in high-performance arenas like communications, networking and imaging. But the acceptance of parallel processors in mainstream computing has been slow. Such devices are considered suitable for only highly specialized uses, and are thought too complex to be mainstream. Moreover, they have a history of being difficult to program.

Two approaches for parallel-processing DSP cores are in the mainstream today: the very long instruction word (VLIW) and single-instruction, multiple-data (SIMD) architectures. The capabilities of both approaches are described here, along with a modified SIMD architecture based on associate-string processing that holds the promise of better programmability and performance for system designers.

VLIW architectures achieve a fo rm of parallelism by packing multiple operations into a single instruction word, which is then executed as a very wide instruction unit. The grouping and scheduling of instructions for execution is done at compile time, rather than execution time, and the compiler optimization of the high-performance instruction sequences makes VLIW implementations poor in handling the data-dependent customized tasks required in developing end-user applications.

Indeed, experience from existing VLIW implementations indicates that while they achieve very high performance for tasks associated with standard mathematical functions, they do poorly when performing user-dependent tasks. For example, many systems need to deploy VLIW processors together with high-performance microprocessors to address the overall application requirements.

Specialized knowledge
The programming limitations of VLIW processors with high-level languages require engineers to use very low-level assembly programming or precoded function s to achieve high performance. Such programming requires specialized knowledge of the hardware architecture, which can be extremely complex in VLIW processor designs, thus forcing a solution instead of exploiting the application's properties, especially its natural parallelism.

The SIMD model assumes a number of processing elements, each executing exactly the same sequence of instructions on their local data. The key advantages of this approach are a reduction in overall hardware complexity, design regularity, the enhancement of computing resources and a simplified path to software development. These stem from the fact that only a single instruction-decode-and-dispatch unit is required, leaving the majority of transistors in the design free for useful computation.

SIMD is a good fit for the computational demands of signal- and visual-processing applications, namely the linear and numerical processing of deterministic data structures. Indeed, SIMD is the main feature in architectures like Intel's MMX, AMD's 3DNow!, the new streaming SIMD in the Pentium III processors and Analog Devices' Sharc processors. MMX, for example, allows a processor to exploit up to eight-way parallelism and manipulate up to eight data items, performing eight simultaneous (identical) computation steps.

In addition, a number of high-performance DSP device designs are adopting a VLIW or wide-structured SIMD processor architecture to increase their computing performance. Among them are the VelociTIC6xxx series from Texas Instruments, Philips' Trimedia, Infineon's Carmel processor architecture, the Man Array architecture from BOPS, Equator's MAP architecture and Sun's MAJC architecture, among others.

A new approach, the Modular-MPC (for "massively parallel computing"), offers a compelling alternative to VLIW and traditional SIMD architectures. Aspex is developing cores and standalone DSPs based on Modular-MPC technology and is quickly discovering it delivers the highest performance and scalability of any programma ble architecture on the market.

Simply described as a "deep" SIMD structure, the Modular-MPC comprises a number of identical "processing channels," each supporting its own external I/O that can be implemented to support any standardized or customized external interface to address the requirement for scalable I/O bandwidth. If a single-channel interface can cope with the external data bandwidth required by the media-processing application, then there is no requirement for additional processing channels. However, if a single interface is not adequate, then an appropriate number of processing channels can be included in the system to help balance the data bandwidth by evenly distributing the data stream among the different channels.

Each processing channel comprises two elements: an ASP (associative-string processor) Module-a programmable, homogeneous and fault-tolerant SIMD parallel processor incorporating a string of identical processing units, a reconfigurable intercommunication network and a vector data buffer for fully overlapped data I/O-along with a Storage Module for processing data or the results of media-related processing. The Storage Module is implemented as a collection of memory modules, which can be scaled according to the application and system requirements and independently of the size of the ASP Module.

The processing elements in the ASP architecture are called associative-processing elements (APEs). Each processing unit incorporates a data register and a bit-serial arithmetic logic unit. The size of the data register can vary from implementation to implementation. Besides storing data for arithmetic operations involving the local ALU, the data register can support associative-processing operations (to direct support for logical and relational operations).

Under program control, the data register can be dynamically configured to fields that store processing operands. The partitioning can be arbitrary and not necessarily at byte boundaries. To use an analogy from t he traditional processor architectures, the data register can be seen as a pool of registers of varying length, used to store operands that can be processed in the local associative-processing element.

The processing units are connected through an intercommunication network, a flexible system that supports data transfers and navigation of data structures. The network can be dynamically reconfigured, in a programmable and user-transparent way, thus providing a cost-effective emulation of common network topologies. The network implements a simply scalable, fault-tolerant and dynamically reconfigurable processing-element interconnection strategy.

Processing hierarchy
The topology of the network is derived from a shift register and a chordal ring. The latter enables the network to be implemented as a hierarchy of processing groups. Thus, communication times are significantly reduced through automatic bypassing of those processing-element groups that do not include destination-processing el ements. In a similar way, bypassing of faulty groups of processing units gives a useful level of fault tolerance.

For data-parallel operations, data is distributed over the processing units and stored in the local data register. Successive computational tasks are performed on the stored data and the results are dumped. The ASP supports a form of set-associative processing, in which a subset of active APEs (those that associatively match broadcast scalar information) support scalar-vector (between a scalar and data registers) and vector-vector (within data registers) operations. Matching APEs either are activated directly or they source interprocessing element communications to indirectly activate other APEs. The control interface provides feedback on whether none or some processing units match. The instruction set for the ASP is based on four basic operations: match, add, read and write. More complicated functionality is achieved by combining those operations.

The ASP architecture is tuned to sup port application development in high-level programming languages. Specialized tools are provided to Aspex customers, allowing them to embed into the C source code ASP core-processing intrinsics that represent an abstract view of the ASP architecture. They do not reflect the details appearing in various implementations of the architecture, but that does not compromise the performance of the compiled code, since the Aspex development tools optimize the code according to the implementation constraints.

Compilers for the majority of high-performance digital signal processors provide a mechanism, called "asm" statements, that allows direct embedding of assembly language instructions in a C program and in-line expansion to insert low-level code. But asm statements are not an efficient approach for optimization, because of a lack of portability from generation to generation. The asm statements require the programmer to manually allocate registers and schedule instructions that are specific to a device imple mentation.

Programming tools
Aspex programming tools accept the ASP core-processing intrinsics just like any other C operators (for example, addition), and the compiler generates code according to the target device. As a result, coded programs can be recompiled for new generations of devices and exploit all the additional features of the device without rewriting any code.

In general, DSP software developers jump into coding at the assembly-language level at the project start, without understanding either the product requirements or the alternative design solutions. Issues with product functions, performance and interfaces with hardware are not resolved and cannot be reworked until found during testing.

Adopting an application-development process methodology-one in which the requirements, high-level design and low-level algorithm implementation phases are complete before linear assembly coding-will greatly alleviate these problems.

See related chart