|
|||||
Reconfiguring chip design
Reconfiguring chip design For the last couple of years, QuickSilver Technology has been tempting us with promises of delectation and delight with regard to their soon-to-be-released Adaptive Computing Machine (ACM), which they herald as being a revolutionary new form of digital IC. According to QuickSilver, the ACM's architecture is specifically designed to allow it to be adapted "on-the-fly" tens, or hundreds of thousands of times a second, while consuming very little power. That is, any part of the device - from a few small portions all the way up to the full chip - can be adapted blazingly fast, in many cases within a single clock cycle. This architecture allows dynamic software algorithms to be directly converted into dynamic hardware resources, which supposedly results in the most efficient use of hardware in terms of cost, size, performance, and power consumption. Due to the way the ACM performs its magic, it can dramatically reduce - or eliminate - the need for custom ASIC material in a typical high-performance system on chip (SoC). Since the folks at QuickSilver commenced their quest, my interest has been two-fold. First of all, the technology they described in hushed tones seemed to be incredibly cool, even though - until now - they've provided only the vaguest of hints and the occasional crumb of information as to how they intended to achieve their objectives. Secondly, I've been pondering how one would go about creating EDA tools geared to designing devices whose internal functionality can be "transmogrified" on a clock-by-clock basis. Algorithmic evaluation And so we come to the ACM, which features an incredibly cunning "coarse-grained" architecture. Before we consider this architecture, however, it's worth spending a few moments discussing the algorithmic evaluation QuickSilver performed in order to determine the underlying requirements. When QuickSilver first started to consider Adaptive Computing (AC) in the late 1990s, they realized that the conventional Reconfigurable Computing (RC) doctrine was considering the problem at either too macro of a level (at the level of entire applications or algorithms, resulting in inefficient use of resources) or at too micro of a level (at the level of individual gates or FPGA blocks, resulting in hideously difficult application progra mming). In reality, it is only necessary to consider the problem at the level of algorithmic elements. Based on this viewpoint, they asked themselves "just how many core algorithmic elements are there?" They started by considering the elements used in word-oriented algorithms. In this case, they commenced with the compute-intensive Time Division Multiple Access (TDMA) algorithm used in digital wireless transmission. Any variants such as Sirius, XM Radio, EDGE, and so forth form a subset of this algorithmic class, so an AC architecture that could handle high-end TDMA should also be able to handle its less sophisticated cousins. Figure 1 - Algorithm space Having evaluated word-oriented algorithms, they took a 180 degree turn to consider their bit-orientated counterparts, such as Wideband Code Division Multiple Access (W-CDMA) - used for wideband digital radio communications of Internet, multimedia, v ideo, and other capacity-demanding applications - and its sub-variants such as CDMA2000, IS-95A, etc. And they also looked at algorithms comprising various mixes of word-oriented and bit-oriented components, such as MPEG, voice and music compression, and so forth. The results of these evaluations revealed that algorithms are heterogeneous in nature, which means that if you take a bunch of diverse algorithms, their constituent elements are wildly different. In turn, this indicates that the homogeneous architectures associated with traditional FPGAs (having the same lookup table replicated tens of thousands of times) is not appropriate for many algorithmic tasks. The solution is to use a heterogeneous architecture that fully addresses the heterogeneous nature of the algorithms it is required to implement. A cool beans architecture
Figure 2 - A node-based architecture They started with five types of nodes: arithmetic, bit-manipulation, finite state machine, scalar, and configurable input/output used to connect to the outside world (the later is not shown in the above figure). Each node consists of a number of computational units - which can themselves be adapted on the fly, but that's another story - and its own local memory (approximately 75% of a node is in the form of memory). Additionally, each node includes some configuration memory. Unlike FPGAs with their serial configuration bit-stream, an ACM has a dedicated 128-bit or 256-bit bus to carry the data used to adapt the device. It's important to understand that each node performs tasks at the level of complete algorithmic elements. For example, a single arithmetic node can be used to implement different (variable width) linear arithmetic functions such as a F IR filter, a Discrete Cosign Transform (DCT), a Fast Fourier Transform (FFT), etc. Such a node can also be used to implement (variable width) non-linear arithmetic functions such as ((1/sine A) x (1/x)) to the 13th power. Similarly, a bit-manipulation node can be used to implement different (variable-width) bit-manipulation functions, such as a Linear Feedback Shift Register (LRSR), Walsh code generator, GOLD code generator, TCP/IP packet discriminator, etc. A finite state machine node can, not surprisingly, be used to implement any class of FSM. Last but not least, a scalar node can be used to execute legacy code, while a configurable input/output node (not shown in the figure) can be used to implement I/O in the form of a UART or bus interfaces such as PCI, USB, or Firewire. Once again, a key point is that any part of the device - from a few small portions all the way up to the full chip - can be adapted blazingly fast, in many cases within a single clock cycle. This allows for a radical chang e in the way in which algorithms are implemented. As opposed to passing data from function to function, the data can remain resident in a node while the function of the node changes on a clock-by-clock basis. It also means that, unlike an ASIC implementation in which algorithms are effectively "frozen in silicon," the ACM's ability to be adapted tens or hundreds of thousands of times a second means that only those portions of an algorithm that are actually being executed need to be resident in the device at any one time. This provides for tremendous reductions in silicon area and power consumption. Another key point is that the ACM's architecture is incredibly scalable. A level 1 node cluster is formed from an arithmetic, bit-manipulation, FSM, and scalar node connected via a Matrix Interconnection Network (MIN). A level 2 node cluster is formed from four level 1 clusters linked by their own MIN, while a level 3 cluster is formed from four level 2 clusters linked by their own MIN, and so forth. Diffe rent ACMs can contain from one to thousands of node clusters, as required. The proof of the pudding In the case of a CDMA2000 searcher with 2x sampling using 512-chip complex correlations, with captured data processed at 8x chip rate (equivalent to 16 parallel correlators running in real time), the best-in-class ASIC took 3.4 seconds, while the ACM took only 1.0 seconds (3.4 x faster). In the case of a CDMA2000 pilot search with the same parameters as above, the ASIC took 184 ms, while t he ACM took only 55ms (3.3 x faster). And in the case of a W-CDMA searcher with 1x sampling using 256-chip correlations with streaming data, an ASIC took 533 us while the ACM took only 232 us (2.3 x faster). Furthermore, in addition to out-performing the three ASIC implementations, the ACM counterpart (one ACM was used to perform all three tasks) used significantly less silicon real estate and consumed only a fraction of the power! Other things to ponder
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |