Reconfiguring chip design

Reconfiguring chip design
By Clive Maxfield, EEdesign
September 27, 2002 (7:40 p.m. EST)
URL: http://www.eetimes.com/story/OEG20020927S0066

For the last couple of years, QuickSilver Technology has been tempting us with promises of delectation and delight with regard to their soon-to-be-released Adaptive Computing Machine (ACM), which they herald as being a revolutionary new form of digital IC.

According to QuickSilver, the ACM's architecture is specifically designed to allow it to be adapted "on-the-fly" tens, or hundreds of thousands of times a second, while consuming very little power. That is, any part of the device - from a few small portions all the way up to the full chip - can be adapted blazingly fast, in many cases within a single clock cycle. This architecture allows dynamic software algorithms to be directly converted into dynamic hardware resources, which supposedly results in the most efficient use of hardware in terms of cost, size, performance, and power consumption. Due to the way the ACM performs its magic, it can dramatically reduce - or eliminate - the need for custom ASIC material in a typical high-performance system on chip (SoC).

Since the folks at QuickSilver commenced their quest, my interest has been two-fold. First of all, the technology they described in hushed tones seemed to be incredibly cool, even though - until now - they've provided only the vaguest of hints and the occasional crumb of information as to how they intended to achieve their objectives. Secondly, I've been pondering how one would go about creating EDA tools geared to designing devices whose internal functionality can be "transmogrified" on a clock-by-clock basis.

Algorithmic evaluation
As we all know, there are many ways to categorize ICs. For example, we might say that ASICs are "fine-grained," because designer's can specify their functionality down to the level of gates (and switches) and wires. Correspondingly, the majority of today's FPGAs may be classed as "medium-grained," because they consist of islands of small blocks of programmable l ogic and memory "floating around" in a sea of programmable interconnect. (Yes, I know that today's FPGA offerings can also include processor cores, large blocks of memory, and numbers of functions like adders and multipliers, but the underlying architecture is as described above.)

And so we come to the ACM, which features an incredibly cunning "coarse-grained" architecture. Before we consider this architecture, however, it's worth spending a few moments discussing the algorithmic evaluation QuickSilver performed in order to determine the underlying requirements.

When QuickSilver first started to consider Adaptive Computing (AC) in the late 1990s, they realized that the conventional Reconfigurable Computing (RC) doctrine was considering the problem at either too macro of a level (at the level of entire applications or algorithms, resulting in inefficient use of resources) or at too micro of a level (at the level of individual gates or FPGA blocks, resulting in hideously difficult application progra mming). In reality, it is only necessary to consider the problem at the level of algorithmic elements.

Based on this viewpoint, they asked themselves "just how many core algorithmic elements are there?" They started by considering the elements used in word-oriented algorithms. In this case, they commenced with the compute-intensive Time Division Multiple Access (TDMA) algorithm used in digital wireless transmission. Any variants such as Sirius, XM Radio, EDGE, and so forth form a subset of this algorithmic class, so an AC architecture that could handle high-end TDMA should also be able to handle its less sophisticated cousins.

Figure 1 - Algorithm space

Having evaluated word-oriented algorithms, they took a 180 degree turn to consider their bit-orientated counterparts, such as Wideband Code Division Multiple Access (W-CDMA) - used for wideband digital radio communications of Internet, multimedia, v ideo, and other capacity-demanding applications - and its sub-variants such as CDMA2000, IS-95A, etc. And they also looked at algorithms comprising various mixes of word-oriented and bit-oriented components, such as MPEG, voice and music compression, and so forth.

The results of these evaluations revealed that algorithms are heterogeneous in nature, which means that if you take a bunch of diverse algorithms, their constituent elements are wildly different. In turn, this indicates that the homogeneous architectures associated with traditional FPGAs (having the same lookup table replicated tens of thousands of times) is not appropriate for many algorithmic tasks. The solution is to use a heterogeneous architecture that fully addresses the heterogeneous nature of the algorithms it is required to implement.

A cool beans architecture
Based on the above evaluations, the backroom guys and gals with size-16 brains (they don't let them out very often) came up with a heterogeneous node-based archit ecture:

Figure 2 - A node-based architecture

They started with five types of nodes: arithmetic, bit-manipulation, finite state machine, scalar, and configurable input/output used to connect to the outside world (the later is not shown in the above figure).

Each node consists of a number of computational units - which can themselves be adapted on the fly, but that's another story - and its own local memory (approximately 75% of a node is in the form of memory). Additionally, each node includes some configuration memory. Unlike FPGAs with their serial configuration bit-stream, an ACM has a dedicated 128-bit or 256-bit bus to carry the data used to adapt the device.

It's important to understand that each node performs tasks at the level of complete algorithmic elements. For example, a single arithmetic node can be used to implement different (variable width) linear arithmetic functions such as a F IR filter, a Discrete Cosign Transform (DCT), a Fast Fourier Transform (FFT), etc. Such a node can also be used to implement (variable width) non-linear arithmetic functions such as ((1/sine A) x (1/x)) to the 13th power.

Similarly, a bit-manipulation node can be used to implement different (variable-width) bit-manipulation functions, such as a Linear Feedback Shift Register (LRSR), Walsh code generator, GOLD code generator, TCP/IP packet discriminator, etc.

A finite state machine node can, not surprisingly, be used to implement any class of FSM. Last but not least, a scalar node can be used to execute legacy code, while a configurable input/output node (not shown in the figure) can be used to implement I/O in the form of a UART or bus interfaces such as PCI, USB, or Firewire.

Once again, a key point is that any part of the device - from a few small portions all the way up to the full chip - can be adapted blazingly fast, in many cases within a single clock cycle. This allows for a radical chang e in the way in which algorithms are implemented. As opposed to passing data from function to function, the data can remain resident in a node while the function of the node changes on a clock-by-clock basis.

It also means that, unlike an ASIC implementation in which algorithms are effectively "frozen in silicon," the ACM's ability to be adapted tens or hundreds of thousands of times a second means that only those portions of an algorithm that are actually being executed need to be resident in the device at any one time. This provides for tremendous reductions in silicon area and power consumption.

Another key point is that the ACM's architecture is incredibly scalable. A level 1 node cluster is formed from an arithmetic, bit-manipulation, FSM, and scalar node connected via a Matrix Interconnection Network (MIN). A level 2 node cluster is formed from four level 1 clusters linked by their own MIN, while a level 3 cluster is formed from four level 2 clusters linked by their own MIN, and so forth. Diffe rent ACMs can contain from one to thousands of node clusters, as required.

The proof of the pudding
As engineers, we're all used to seeing wild and wonderful claims for emerging technologies (I could tell you some stories that would make your hair stand on end!), but QuickSilver's ACM technology seems to be real. For example, they compared their first physical test chip to best-in-class ASIC implementations for a number of well-understood problems as follows (note that, thus far, these tasks have always been implemented using ASICs, because DSP and FPGA implementations are way too slow and power hungry).

In the case of a CDMA2000 searcher with 2x sampling using 512-chip complex correlations, with captured data processed at 8x chip rate (equivalent to 16 parallel correlators running in real time), the best-in-class ASIC took 3.4 seconds, while the ACM took only 1.0 seconds (3.4 x faster). In the case of a CDMA2000 pilot search with the same parameters as above, the ASIC took 184 ms, while t he ACM took only 55ms (3.3 x faster). And in the case of a W-CDMA searcher with 1x sampling using 256-chip correlations with streaming data, an ASIC took 533 us while the ACM took only 232 us (2.3 x faster). Furthermore, in addition to out-performing the three ASIC implementations, the ACM counterpart (one ACM was used to perform all three tasks) used significantly less silicon real estate and consumed only a fraction of the power!

Other things to ponder
All of the above opens up an entire raft of questions. For example, is anyone else doing anything like this? Well it's funny you should ask, because there's a company called PACT which has a fin er-grained architecture based on reconfigurable "clusters of arrays" (wow!) of 32-bit arithmetic processing elements and embedded memory. Meanwhile, in Japan, a company called Clive (Max) Maxfield is president of Techbites Interactive, a marketing consultancy firm specializing in high-tech. Author of Bebop to the Boolean Boog ie (An Unconventional Guide to Electronics) and co-author of EDA: Where Electronics Begins, Max was once referred to as a "semiconductor design expert" by someone famous who wasn't prompted, coerced, or remunerated in any way.