|
||||||||||||||||||||||||||||
Synthesizable Switching Logic For Network-On-Chip Designs on 90nm TechnologiesBy Tapani Ahonen, Tampere University of Technology Abstract : Study on synthesizable switch structures for network-on-chip designs is presented. Sum-of-products one-hot multiplexing is identified as an appealing general-purpose choice with a low latency and an acceptable resource utilization on both ASIC and FPGA. In average, one-hot multiplexing was observed to save 18% in area and provide 26% faster operation in comparison to conventional multiplexing on a 90nm ASIC process. Conventional multiplexers provide the lowest FPGA resource utilization, but present higher latencies than one-hot multiplexers. For small ASIC area and potentially low power consumption, tri-state buffers could be considered. Cost of link capacity sharing techniques is discussed using a short case study. From a purely architectural point of view it seems more attractive to provide physically independent network resources. 1. INTRODUCTION The trend towards more scalable, plug-and-play type of communication architectures has become apparent as SoCs have built up on complexity. However, there is currently no widely accepted on-chip communication mechanism that would offer unlimited scalability. Existing multiprocessor SoCs use hierarchical buses such as AMBA [1] and CoreConnect [2]. Sonics Micronetwork [3] and STBus [4] present a step towards more flexible bus architectures. The recent research in the field of design productivity has suggested that communication networks should be utilized on-chip. These so-called networks-on-chip (NoCs) [5] [6] are envisioned to serve as the ultimate backplanes for ever growing levels of integration. They have been designed to enable the decoupling of computation from communication, i.e., to serve as a platform between computation modules. Besides tackling the scalability issues, NoCs also offer better energy efficiency over segmented buses [7]. As NoCs are in their infant stage, the architectures deployed in the near future are very likely to offer low cost comparable to that of a hierarchical bus: the first two NoC startups Arteris [8] and Silistix [9] both promote low cost solutions. Switching logic forms an important part of any NoC architecture. In processor design terms, the switch serves as the functional unit (FU) and is located on the data path of a router. Continuing with this analogy, the arbitration logic controls the data path while the routing logic is responsible for instruction decoding. Implementation of the switch has a significant impact on the overall cost as well as performance of a NoC router. This is especially true with simple control logic e.g. with deterministic routing, analogous with RISC type processors. Complex control logic such as required by adaptive routing may make the switching logic seem diminishing in terms of performance metrics. A cost-oriented study on synthesizable switching logic for NoC designs is presented in this paper. The main motivation for this study was to come up with a switch structure for NoC implementation that is portable between different technology platforms. The goal was to ensure straightforward reusability of the design without significant area or performance penalties. The results of the study were also used as background material to reflect on the cost of added parallelism through link capacity sharing. 2. TECHNOLOGY PLATFORMS Standard cell ASIC designs are made using a library of functional logic instances supplied by the technology vendor. These ASIC cells can be placed on cell rows and routed using the Manhattan routing style with 90 degrees corners. Currently available FPGAs are based on look-up tables (LUTs), synchronous registers (flip-flops), multiplexers, and programmable routing. The programmability of the routing is most often realized using static random access memory (SRAM) cells to indicate whether a given connection is used or not. In addition to these freely programmable logic structures the modern FPGAs provide on-chip random access memories (RAMs) and computation logic, commonly known as DSP blocks, programmable to varying degrees. The RAMs are organized in a hierarchical manner to ensure fast operation. Until recently, there has been a strong consensus on the efficiency of a fixed four-input LUT. It has been based on the observation of the average input degree of a logical function in a large design. This conception has now been challenged by Altera, the company that developed the so called advanced logic module (ALM). An ALM is based on a six-input fracturable LUT. It can host one logical function with up to seven inputs or it can be used for two separate logical functions of varying input degree. The adaptivity of an ALM also includes the propagation delay, which depends on the input used. Thus underutilized ALMs operate faster than fully utilized ones. The traditional four-input LUT always presents a constant propagation delay regardless of the input utilization. The ALM structure was introduced in the Stratix II family of devices. [10] Xilinx has introduced the first 65nm FPGAs under the name Virtex 5. The Virtex 5 architecture is based on six-input LUTs that can be used as dual 5-input LUTs with separate outputs. It thus seems that the industry is accepting the configurable 6-LUT on a wider front. 3. SWITCH STRUCTURES Three different types of switch structures were evaluated on 90nm technologies. The considered switch implementation styles include conventional multiplexing with sequentially encoded select signal, sum-of-products type multiplexing with one-hot encoded select signal, and tri-state buffering. Figure 1 shows straightforward standard logic realizations for 3-input multiplexers with one-hot and sequentially encoded select signals. A sum of products based N-to-1 multiplexer with one-hot encoded select signal is realized with a row of N AND gates at the input and an N-input OR gate at the output. The outputs from the N AND gates are routed as inputs to the N-input OR gate. A multiplexer with sequentially encoded select signal is essentially a multiplexer with one-hot encoded select signal preceded by sequential to one-hot decoding logic. As depicted in figure 1 this sequential to one-hot decoding logic can be realized with inverters for the expected zero bits in the sequentially encoded select signal followed by AND gates to form the one-hot select (enable) bit. The actual implementation of these multiplexing functionalities depends heavily on the target technology and its resources. The best way to find near-optimal implementations among the possible combinations of target structures is by logic synthesis. Tri-state buffers can be directly instantiated from standard cell technology libraries, but the replication of their behavior requires special structures on FPGAs. Macro cells for conventional multiplexers with sequential select signal encoding are also readily available in the ASIC libraries up to a certain library dependent limit on input width. After that limit the synthesis tool composes the wider multiplexing structures from the finer grain logic cells of the library. Conventional multiplexers usually occupy the general-purpose look-up table (LUT) logic resources on FPGAs, but some devices offer dedicated multiplexer structures. The one-hot multiplexing logic maps always to the LUT resources on the FPGAs. The focus of this synthesis-based study is on the latency and device real-estate characteristics for the three switch types. In order to facilitate the process the scope of the study was limited and power consumption was omitted. However, those interested in power consumption of NoC switches may want to take a look at [11]. Three target technologies were chosen. A 90nm low power standard cell library was used for ASIC realization. FPGAs were presented by the Xilinx Virtex 4 and the Altera Stratix II device families. 4. SWITCHES ON A STANDARD CELL ASIC Figures 2 and 3 summarize the standard cell synthesis results. The results were derived for a slow-slow process corner with a temperature of 125 degrees Celsius and supply of 0.95 Volts. Five metal layers were assumed and the heaviest wire load model available was used. Synthesis was directed to optimize for fastest possible operation, which results in the instantiation of large high drive strength cells.
Fig. 2. Latency on a standard cell library.
5. SWITCHES ON VIRTEX 4 AND STRATIX II The figures presented in this section were extracted for the Xilinx Virtex 4 and Altera Stratix II devices using the same synthesis tool and applying the same constraints. Physical synthesis as well as place and route were omitted in order to facilitate this study and cut down the time-consuming part of the required efforts. Besides, the effectivity of physical synthesis and place and route algorithms are very much application and target device dependent. Thus their inclusion would make this study application and/or device specific. We tried to avoid such limitations of scope. Devices with the slowest speed grade available were used from both manufacturers for comparable latency results. Even though the results provide just rough estimates, the general trend and device features can be judged with fair accuracy. The tri-state switch structures were evaluated using the tri-state propagation feature of the synthesis tool. Figure 4 illustrates the approximate switch structure latencies on the Virtex 4 and Stratix II devices. The latency figures of the tri-state realizations have been omitted to improve readability. The tri-state latency was consistently very close to the one-hot multiplexer latency on both devices.
As expected, the highest resource utilization on both devices was encountered with the tri-state propagating logic structure. The tri-state propagating structure maps onto the LUT and register resources on all devices. This structure is useful for ASIC emulation on FPGA, since it ensures functional equivalence to the design's behavior in a logic simulator. Due to the architectural differences between Virtex 4 and Stratix II, the resource utilization figures are not directly comparable. Such strongly subjective comparisons are left for the reader. 6. SWITCH STRUCTURE CHARACTERISTICS Table I summarizes the characteristics for the different switch realizations. The figures have been normalized to the highest cost in each row for area, latency, or leakage to enable relative comparison of the logic structures. The structures are presented with their averages over the studied range of I/O degrees.
These average figures are representative of the switch structures on the target platforms. Due to the similarity of standard cell libraries and static CMOS processes, the ASIC results are fairly general. The FPGA results on the other hand are representative of the two architectures and cannot thus be generalized to device families with different ones. Tri-state buffers offer low ASIC area and leakage power in the range of approximately one third of the figures for comparable multiplexers with sequential select signal encoding. The latency of a tri-state buffer implementation is also acceptable. Thus the problems with tri-states on ASIC culminate on the management of short circuit current. Multiplexers with sequential select signal encoding provide the lowest FPGA resource utilization, but present higher latencies than multiplexers with one-hot select signal encoding on all technologies. Multiplexers with one-hot select signal encoding are ideally suited for high performance applications independently of the technology. The area of a one-hot multiplexer based switch is also moderate making it an appealing general purpose choice in the light of this study. For small area and potentially low power consumption, tri-state buffers could be considered for ASICs and multiplexers with sequential select signal encoding for FPGAs. 7. SWITCHES FOR LINK CAPACITY SHARING NoC architectures often incorporate techniques to share link capacity between different sources. This is done to ease up traffic management at the application level. Parallel connections are enabled on a single physical link through either time or space division. With space division the term lane is used to describe a connection. The term virtual channel is used to describe a time divided connection. Virtual channels require individual queues/buffers dedicated to each channel resulting in a large overall buffer space and thus router area. In some architectures the virtual channels are multiplexed before the main router switch and de-multiplexed after the main router switch. On top of the increased buffer space and router area this approach results in the added area and latency of the channel multiplexers and de-multiplexers. Only one switch input at a time can be connected to the multiple queues (channels) of an output link through the regular size (e.g. NxN, N being the I/O degree of a symmetric router) switch using this approach. The concept and the resulting architecture are thus equivalent to multi-queueing. Some NoCs, like MANGO [12], use a switch of size Nx(NxM), where M is the number of virtual channels on a link. This enables multiple switch inputs to access separate queues (channels) on a single output link at the same time. The channel de-multiplexers are eliminated since their functionality is incorporated into the switch. Multiple input queues/channels still cannot be served in parallel. Utilization of link bandwidth is slightly increased through this technique. In many designs however, the resulting increase in router latency may decrease the usable link bandwidth close to or more than the achieved increase in bandwidth utilization. To realize the benefits of virtual channeling to their full extent or to implement fully connected space (lane-based) division on the links, a switch of (NxM)x(NxM) is required. Such a switch imposes no restrictions on connection (channel or lane) parallelism. This switch type forms the focus of the study presented in this section. Table 2: Impact of Link Capacity Sharing on a Fully Connected Switch
Table II summarizes the impact of link capacity sharing on the cost of a fully connected switch with the studied technologies. This information has been derived from the data presented previously in this paper using the following approach. As a starting point, a fully connected switch matrix with I/O degree of two was chosen to present the cost of a single-channel or single-lane realization. This reference cost was calculated as an average of the three structures - OH-MUX, SQ-MUX, and TRI-STATE. Since a fully connected switch was assumed, the link capacity shared versions effectively multiply the I/O degree with the number of virtual channels/lanes. Thus the figures for the switches with 2, 4, and 8 virtual channels/lanes were retrieved from the data for I/O degrees of 4, 8, and 16 respectively. All quoted figures were computed as averages of the three structures divided by the corresponding reference cost of a single-channel/single-lane realization.
CONCLUSION Synthesizable switch structures for network-on-chip designs were studied. Target technologies included a 90nm ASIC process as well as the Altera Stratix II and Xilinx Virtex 4 FPGA devices. Sum-of-products one-hot multiplexing resulted in an acceptable resource utilization on both ASIC and FPGA while it consistently provided a low latency. In average, one-hot multiplexing was observed to save 18% in area and provide 26% faster operation in comparison to conventional multiplexing on the 90nm ASIC process. Conventional multiplexers provide the lowest FPGA resource utilization, but present higher latencies than one-hot multiplexers. Tri-state buffers offer small ASIC area and potentially low power consumption, provided that short circuit currents are properly managed. Link capacity sharing techniques are used to ease up network traffic management at the application level. Cost of these techniques was discussed using a limited scope case study based on the assumption of a fully connected switch, for which relative figures were quoted. From a purely architectural point of view physically independent network resources seem more attractive in comparison to link capacity sharing. REFERENCES [1] “Amba home page,” URL: http://www.arm.com/products/solutions/AMBAHomePage.html. [2] “The CoreConnect bus architecture home page,” URL: http://www.ibm.com/chips/products/coreconnect/. [3] “Sonics home page,” URL: http://www.sonicsinc.com/. [4] S. Murali and G. De Micheli, “An application-specific design methodology for stbus crossbar generation,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE), vol. 2, ICM, MESSE Munich, Germany, March 2005, pp. 1176–1181. [5] W. J. Dally and B. P. Towles, Principles and Practices of Interconnection Networks. San Francisco, California, USA: Elsevier Inc. / Morgan Kaufmann Publishers, 2004. [6] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. berg, M. Millberg, and D. Lindqvist, “Network-on-a-Chip: An architecture for billion transistor era,” in Proceedings of NORCHIP 2000, Turku, Finland, November 2000. [7] P. T. Wolkotte, G. J. M. Smit, N. Kavaldjiev, J. E. Becker, and J. Becker, “Energy model of networks-on-chip and a bus,” in Proceedings of the IEEE International Symposium on System-on-Chip, Tampere, Finland, November 2005. [8] “Arteris Network on Chip Company,” URL: http://www.arteris.net/. [9] J. Bainbridge and S. Furber, “Chain: A delay-insensitive chip area interconnect,” IEEE Micro, vol. 22, no. 5, September-October 2002. [10] D. Lewis, E. Ahmed, G. Baeckler, V. Betz, M. Bourgeault, D. Cashman, D. Galloway, M. Hutton, C. Lane, A. Lee, P. Leventis, S. Marquardt, C. McClintock, K. Padalia, B. Pedersen, G. Powell, B. Ratchev, S. Reddy, J. Schleicher, K. Stevens, R. Yuan, R. Cliff, and J. Rose, “The stratix ii logic and routing architecture,” in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, Monterey, California, USA, February 2005, pp. 14–20. [11] T. T. Ye, L. Benini, and G. De Micheli, “Analysis of power consumption on switch fabrics in network routers,” in Proceedings of the 39th Design Automation Conference (DAC), New Orleans, Louisiana, USA, June 2002, pp. 524–529. [12] T. Bjerregaard and J. Sparso, “A router architecture for connection-oriented service guarantees in the mango clockless network-on-chip,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE), March 2005, pp. 1226–1231.
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |