New degress of parallelism in SoCs

New degress of parallelism in SoCs
By Keith Diefendorff and Yannick Duquesne, EE Times
September 13, 2002 (2:12 p.m. EST)
URL: http://www.eetimes.com/story/OEG20020910S0032

Rising development costs motivate companies to design fewer systems-on-chip, but to make each one they do design more flexible and programmable. Doing so makes it possible to reuse designs to take advantage of economies of scale and shorten time-to-market. Moreover, programmability will allow companies to keep products in the market longer, boosting integrated profits.

For programmability to be an option, however, embedded-processor cores must deliver the type and level of performance needed to implement functions that today require hardwired logic blocks or specialized (difficult-to-program) processors. Delivering this level of performance in a cost-effective, power-efficient, easy-to-program processor will require different architectures and techniques than those commonly used today.

High-performance embedded processors have traditionally relied mainly on clock frequency and superscalar instruction issue to boost performance. Caches ha ve played an important part in enabling the potential of these techniques in the face of increasing memory latencies. While frequency and superscalarity have served the industry well and will continue to be used, they have limitations that will diminish the gains they will deliver in the future.

The gains in operating frequencies, which have historically come at a rate of about 35 percent per year, are attributable to two factors, each contributing roughly half: semiconductor feature scaling and deeper pipelining. But each of these factors is approaching the area of diminishing returns.

Superscalar techniques are similarly nearing their limits, in part because of the exponential increase in complexity in dispatch logic with increasing issue width. Even if these problems can be solved, gains from superscalarity are, in the end, limited by the opportunities for instruction-level parallelism inherent in the code to be executed.

Very-long-instruction-word machines were devised to reduce t he complexity of issuing multiple instructions in parallel. But for a variety of reasons, in practice VLIW implementations have not been convincingly less complex than superscalars with similar performance.

But that doesn't mean we're at a dead end. Fortuitously, 90-nanometer semiconductor technology will arrive in time to enable some new techniques to pick up where pipelining and superscalar techniques leave off.

Three such techniques that will come to the fore are vector processing, multithreading and chip multiprocessing (CMP). These techniques have two characteristics in common: They exploit different levels of parallelism than pipelining and superscalar issue, and they are transistor intensive. Unlike pipelining and superscalar techniques, which are extraordinarily complex, vectors, threading and CMP are simpler, relying more on arrayed data path elements than complex control structures.

Pipelining and superscalar te chniques both exploit fine-grain instruction-level parallelism — pipelining by temporal means and superscalar by spatial means. Vector processing, in contrast, exploits fine-grain data-level parallelism; multithreading exploits medium-grain thread-level parallelism; and chip multiprocessing exploits coarse-grain process-level parallelism.

Each of these techniques can be deployed in its own way to exploit these additional opportunities for parallelism. And each can be used in conjunction with traditional pipelining and superscalar techniques.

Multithreading, for example, will exploit thread-level parallelism to boost throughput and execution unit efficiency, despite increasing memory latency. Vector processing in the modern form of SIMD execution units will be added to conventional embedded RISC cores to exploit the data-level parallelism abundant in multimedia streams. Chip multiprocessing will exploit process-level parallelism to raise task capacity and improve real-time response while at the same time offering a superior architecture for more scalable, more programmable SoC devices.

These new techniques will work together to apply vastly increased transistor budgets in the pursuit of greater computing power. And they will use the increased transistor count far more effectively than would continued reliance on the traditional approaches of superscalarity and pipelining.

Keith Diefendorff is vice president of product strategy and Yannick Duquesne is Field Applications Engineer at Mips Technologies inc. (Mountain View, Calif.).