Monterey, Calif. Moderate-scale systems-on-chip (SoCs) have grown so similar that most can be described with a single block diagram. A central processor emits a processor-specific high-speed bus that connects it to a fast DRAM interface and perhaps a few other processing sites. A bridge originates a lower-speed peripheral bus again usually processor-specific that connects to a series of lower-speed I/O controllers and functional blocks. The whole thing could have been lifted off the data sheet for any reasonably mainstream single-board computer of 15 years ago. But increasingly, designers are asking whether the method that was used to interconnect chips on a circuit board in the days of the Z80 should be used to interconnect intellectual-property (IP) blocks on a 90-nanometer SoC. At least one company, startup Arteris SA (Paris), says the answer is no. Arteris cites two primary issues with using shared-bus architectures or, more commonly, their on-chip surrogate, the multiplexer tree for SoC interconnect. The first is a chip design issue. The more complex the SoC gets, the more difficult it becomes to close timing on the globally synchronous high-speed portion of the system bus. In practice, this problem, at the very least, limits layout options by forcing all the blocks on the high-speed bus to be clustered in the floor plan, lest the router be presented with a problem on which it will wax creative. At worst, it can force the design into a costly series of place-route-extraction iterations to close timing after it is too late to change the floor plan. Arteris believes that this problem will only be accentuated by the growing disparity between unloaded transistor switching speed which is nearing zero in 90-nm and 65-nm processes and the actual propagation delay of signals through the fully loaded interconnect, especially after taking into account interconnect variability. The second issue is more system-level than chip-level. That is the question of bus latency. "The latency figures for on-chip buses always look great on paper, the way the chip designer sees them," said Philippe Martin, product-marketing director for Arteris. "But when the application developers have the drivers in place and you see the bus fully loaded with all its contention issues showing, then the latencies are horribly longer than anyone suggested and, even worse, are very unpredictable." Keeping an eye on the growing body of academic research on globally asynchronous, locally synchronous interconnect architectures, affectionately known as GALS, Arteris has developed an alternative approach. Without going to the extreme of imposing asynchronous design techniques on users, the company has devised a standardized, message-passing interconnect system that can nearly drop into an SoC in place of a licensable bus interconnect structure, but with major advantages in timing, power and end-system performance. First described publicly a year ago this month, the scheme is now available for licensing (see March 1, 2004, page 4). Architecturally, the Arteris scheme looks more like a point-to-point network than a bus. Packet transport controllers, each with its own local memory, are embedded in each of the major blocks of the SoC. Point-to-point connections are created between these controllers to interconnect the blocks. Critically, since this is a packet protocol, the width and speed of these connections is determined by the necessary bandwidth and signal integrity environment between the blocks, not by the width of the data to be transported. The controllers maintain strict in-order transaction requests, but can accept out-of-order fulfillment and any number of outstanding requests. The controllers also have the capacity to store and consolidate transactions to apply bursting or load-store transactions as appropriate to maximize available bandwidth. Because the point-to-point packet network is a simple dedicated and unidirectional source-synchronous design, it can operate in excess of 750 MHz in 90-nm processes without imposing Herculean challenges on the timing-closure process. The arbitration scheme is straightforward and is tuned to the SoC configuration, so the average arbitration latency is two cycles at the 250-MHz arbitration frequency, instead of the tens of cycles typical on buses. And Arteris claims that the total gate count for the interconnect system, including the packet transport wrappers, is about half of the total gate count for an on-chip bus system with its arbitration logic, with similar dynamic-power consumption. The IP is part of the Danube Network-on-Chip library, which includes the switches and links that form the physical interconnect; the network interface units that wrap the functional blocks of the SoC; Denali-derived memory and Databahn I/O controllers; an embedded network analyzer for system debug; and a service bus for run-time configuration and error recovery. The library is accessed through a system exploration tool and a compiler that generates SystemC and RTL models, floor plans and synthesis scripts, physical-design data, and data plus scripts for FPGA prototyping. The company has in hand test silicon implementing a system of 81 masters and 47 slaves in a core with a total area of 100 mm2. This is the total chip area, not the area occupied by the interconnect. The chip was designed with the abovementioned tools in a standard digital design flow using Artisan 90 GT libraries and shuttled through Taiwan Semiconductor Manufacturing Co.'s standard 90-nm process. The tool set and library license are available now, Arteris said. |