Beefy parallel processor packs 128 cores

Beefy parallel processor packs 128 cores
By Peter Clarke, EE Times
October 9, 2000 (11:08 a.m. EST)
URL: http://www.eetimes.com/story/OEG20001009S0026

LONDON — German startup Pact GmbH will attempt to leapfrog the growing field of highly parallel processors targeting communications when it rolls out a complex 30-million-transistor CPU that integrates 128 thirty-two-bit processors at this week's Microprocessor Forum.

Pact will launch the Extreme Processor Platform (XPP) at the forum this week in San Jose, Calif., where a handful of competing startups also will leverage a mix of parallelism and reconfigurability to attack the toughest processing jobs in broadband communications.

Pact (Munich, Germany) aims to use sheer muscle to break out of the pack. Each of the XPP's 128 processor cores sports its own 32-bit fixed-point multiplier, yielding a theoretical output of 12.8 billion multiply-accumulate operations per second at an expected clock frequency of 100 MHz. Pact claims the architecture will scale to produce devices capable of more than 400 giga operations/s in 2002 and into the peta-o ps range within a decade.

David Salisbury, business development director at Pact, said that the processor architecture could be tuned to address a variety of disciplines but that he expects to seek partners in broadband, networking, 3G and 4G wireless communications, speaker-independent voice recognition and encryption.

A huge die size, which indicates high manufacturing costs and potential issues for production yields, is one clear downside to the new architecture. The chip, which measures 2 cm on a side, will be showcased on a binder at the Microprocessor Forum, although developers are not slated to present a paper on the device there.

The current die, fabricated in a 0.21-micron CMOS process by the Korean foundry of Amkor Technologies Inc. (Chandler, Ariz.), only represents a proof of concept. Although the prototype chips might be used as engineering samples to send t o prospective partners, Salisbury said he expects production devices will use a smaller version of the two 8 x 8 arrays on that device until Pact can move to finer geometries.

By stripping the problem down to the number of computer elements on the die, the Pact chip gains a muscular architecture but one that will be difficult to program, potentially narrowing its market, said Linley Gwennap, technology analyst at the Linley Group (Sunnyvale, Calif.). "It offers a tremendous amount of compute power, and there will be certain applications that demand that," he said. "But it will be pretty tricky to program this thing, and you have to keep in mind it is not a general-purpose architecture."

"They've got a lot of horsepower there but you've got to have a lot of reins to control it," said Will Strauss of Forward Concepts (Tempe, Ariz.), a market research firm. "The proof of the pudding is going to be in the software support," he added.

Phenomenally difficult

Strauss said the mapping of software processes to available resources, reconfiguring the array appropriately and making sure it is all done without error and without creating deadlocks is a phenomenally difficult problem. If the abundance of resources makes dealing with the problem easier, there's still the issue of whether it is computationally efficient — whether the many processors are kept waiting for a result elsewhere in the array, he added.

"We adapt the array to the code. We make the problem easier, not harder," said Martin Vorbach, chief technology officer at Pact and the company's founder.

While an on-chip configuration manager works automatically, the XPP requires that programmers have knowledge of one of two unorthodox languages for programming.

One is Pact's own Native Mapping Language (NML). Designs are entered in flow-graph style using NML. Each node of the flow graph is automatically mapped to a processing element in the array by Pact's Xmap compiler. Edges of the flow graph are mapped to high-speed e lements of the internal routing network, creating connections between processing elements.

The point-to-point connection architecture of the internal network eliminates the need for explicit programmed data moves that classical microprocessors require.

Arithmetic operations are executed in one clock cycle and automatic data-flow synchronization ensures nodes compute only if all inputs are valid.

An alternative approach is to use the Lela language, which allows the programming of streaming applications on a more abstract level. Equations are entered and are compiled to a set of native processing elements. Lela was developed by professor Nikolaus Wirth, formerly of Zurich, author of the Pascal programming language.

Programs expressed in either NML or Lela can be compiled and loaded into the array but XPP has no C compiler yet.

Compiler promised

Analyst Strauss said that lack would not necessarily hurt Pact's chances of commercial success. "C is absolutely not the optimal lang uage for DSP and video programming. Unfortunately it is the de facto standard. But people in this area would love to find an alternative," he said. Nonetheless Vorbach pledged that XPP would have a C compiler within a year.

The XPP is a mixture of a parallel-processing array with an interconnect architecture similar in style to that of an FPGA. But Vorbach said the second crucial element is the transparent run-time reconfiguration technology that dynamically controls the processing resources. Vorbach said this technology automatically makes changes to the array interconnect, assigning processes to clusters of processors based on internal or external events.

In another sense the XPP is a reconfigurable platform because the array is set up to allow data to be sent between processing elements as a particular task is invoked. While these processors are operating other tasks may be starting elsewhere in the array. And when the first task is retired and the result written out to an external memory, the pro cessing elements become available for use by a new task.

"It's really a new class of multiprocessor architecture. It's not a data-flow machine although it's good at data flow. But we're also good at sequential operations," Vorbach said. "We can do multiple sequential operations in the array."

Salisbury said the XPP can, in theory, be set up to transfer data between any two processors in an array, using a system of virtual communications channels based on the passage of packets between processors. Processing usually goes on between a contiguous cluster of devices, Salisbury said, because it is more efficient and minimizes the time that nodes have to wait for a result from another processor.

Although the 32-bit processor at each processing node is proprietary, it is a full-featured device with an instruction set running to about 70 instructions. These include 32-bit multiply-accumulate with 64-bit result, 32-bit add, 32-bit subtract, shift, 32-bit count, Boolean functions, compare, range, sort, ou tput-shift, and add or subtract in the vertical backward channel. Vorbach noted, though, that this arithmetic-logic unit is not a fixed feature of the architecture.

"We could replace the ALUs with floating-point units, for example," Vorbach said. Similarly, nodes in the array could be used to accommodate local SRAM.

Salisbury said simulation had predicted that at 100 MHz the XPP-128 would consume a maximum of 33 watts. Tests on real-world devices showed that number was an overestimate and 18 W was a more accurate figure.

Vorbach left the University of Karlsruhe in 1996 to form Pact, linking up with Marcel Kreutler, an entrepreneur who became chief executive of the fledgling company. Since then Pact has built up a staff of about 30 and has now produced one of the most complex processor chips ever built.

— Additional reporting by Rick Merritt.