Synthesizable Switching Logic For Network-On-Chip Designs on 90nm Technologies
By Tapani Ahonen, Tampere University of Technology
Tampere Finland
Abstract :
Study on synthesizable switch structures for network-on-chip designs is presented. Sum-of-products one-hot multiplexing is identified as an appealing general-purpose choice with a low latency and an acceptable resource utilization on both ASIC and FPGA. In average, one-hot multiplexing was observed to save 18% in area and provide 26% faster operation in comparison to conventional multiplexing on a 90nm ASIC process. Conventional multiplexers provide the lowest FPGA resource utilization, but present higher latencies than one-hot multiplexers. For small ASIC area and potentially low power consumption, tri-state buffers could be considered. Cost of link capacity sharing techniques is discussed using a short case study. From a purely architectural point of view it seems more attractive to provide physically independent network resources.
1. INTRODUCTION
The trend towards more scalable, plug-and-play type of communication architectures has become apparent as SoCs have built up on complexity. However, there is currently no widely accepted on-chip communication mechanism that would offer unlimited scalability. Existing multiprocessor SoCs use hierarchical buses such as AMBA [1] and CoreConnect [2]. Sonics Micronetwork [3] and STBus [4] present a step towards more flexible bus architectures.
The recent research in the field of design productivity has suggested that communication networks should be utilized on-chip. These so-called networks-on-chip (NoCs) [5] [6] are envisioned to serve as the ultimate backplanes for ever growing levels of integration. They have been designed to enable the decoupling of computation from communication, i.e., to serve as a platform between computation modules. Besides tackling the scalability issues, NoCs also offer better energy efficiency over segmented buses [7]. As NoCs are in their infant stage, the architectures deployed in the near future are very likely to offer low cost comparable to that of a hierarchical bus: the first two NoC startups Arteris [8] and Silistix [9] both promote low cost solutions.
Switching logic forms an important part of any NoC architecture. In processor design terms, the switch serves as the functional unit (FU) and is located on the data path of a router. Continuing with this analogy, the arbitration logic controls the data path while the routing logic is responsible for instruction decoding.
Implementation of the switch has a significant impact on the overall cost as well as performance of a NoC router. This is especially true with simple control logic e.g. with deterministic routing, analogous with RISC type processors. Complex control logic such as required by adaptive routing may make the switching logic seem diminishing in terms of performance metrics.
A cost-oriented study on synthesizable switching logic for NoC designs is presented in this paper. The main motivation for this study was to come up with a switch structure for NoC implementation that is portable between different technology platforms. The goal was to ensure straightforward reusability of the design without significant area or performance penalties. The results of the study were also used as background material to reflect on the cost of added parallelism through link capacity sharing.
2. TECHNOLOGY PLATFORMS
Standard cell ASIC designs are made using a library of functional logic instances supplied by the technology vendor. These ASIC cells can be placed on cell rows and routed using the Manhattan routing style with 90 degrees corners.
Currently available FPGAs are based on look-up tables (LUTs), synchronous registers (flip-flops), multiplexers, and programmable routing. The programmability of the routing is most often realized using static random access memory (SRAM) cells to indicate whether a given connection is used or not. In addition to these freely programmable logic structures the modern FPGAs provide on-chip random access memories (RAMs) and computation logic, commonly known as DSP blocks, programmable to varying degrees. The RAMs are organized in a hierarchical manner to ensure fast operation.
Until recently, there has been a strong consensus on the efficiency of a fixed four-input LUT. It has been based on the observation of the average input degree of a logical function in a large design. This conception has now been challenged by Altera, the company that developed the so called advanced logic module (ALM). An ALM is based on a six-input fracturable LUT. It can host one logical function with up to seven inputs or it can be used for two separate logical functions of varying input degree.
The adaptivity of an ALM also includes the propagation delay, which depends on the input used. Thus underutilized ALMs operate faster than fully utilized ones. The traditional four-input LUT always presents a constant propagation delay regardless of the input utilization. The ALM structure was introduced in the Stratix II family of devices. [10]
Xilinx has introduced the first 65nm FPGAs under the name Virtex 5. The Virtex 5 architecture is based on six-input LUTs that can be used as dual 5-input LUTs with separate outputs. It thus seems that the industry is accepting the configurable 6-LUT on a wider front.
3. SWITCH STRUCTURES
Three different types of switch structures were evaluated on 90nm technologies. The considered switch implementation styles include conventional multiplexing with sequentially encoded select signal, sum-of-products type multiplexing with one-hot encoded select signal, and tri-state buffering. Figure 1 shows straightforward standard logic realizations for 3-input multiplexers with one-hot and sequentially encoded select signals. A sum of products based N-to-1 multiplexer with one-hot encoded select signal is realized with a row of N AND gates at the input and an N-input OR gate at the output. The outputs from the N AND gates are routed as inputs to the N-input OR gate. A multiplexer with sequentially encoded select signal is essentially a multiplexer with one-hot encoded select signal preceded by sequential to one-hot decoding logic. As depicted in figure 1 this sequential to one-hot decoding logic can be realized with inverters for the expected zero bits in the sequentially encoded select signal followed by AND gates to form the one-hot select (enable) bit.
Fig. 1. Example standard logic realizations for 3-input multiplexers with one-hot encoded select signal (on the left) and sequentially encoded select signal (on the right).
The actual implementation of these multiplexing functionalities depends heavily on the target technology and its resources. The best way to find near-optimal implementations among the possible combinations of target structures is by logic synthesis. Tri-state buffers can be directly instantiated from standard cell technology libraries, but the replication of their behavior requires special structures on FPGAs. Macro cells for conventional multiplexers with sequential select signal encoding are also readily available in the ASIC libraries up to a certain library dependent limit on input width. After that limit the synthesis tool composes the wider multiplexing structures from the finer grain logic cells of the library. Conventional multiplexers usually occupy the general-purpose look-up table (LUT) logic resources on FPGAs, but some devices offer dedicated multiplexer structures. The one-hot multiplexing logic maps always to the LUT resources on the FPGAs.
The focus of this synthesis-based study is on the latency and device real-estate characteristics for the three switch types. In order to facilitate the process the scope of the study was limited and power consumption was omitted. However, those interested in power consumption of NoC switches may want to take a look at [11].
Three target technologies were chosen. A 90nm low power standard cell library was used for ASIC realization. FPGAs were presented by the Xilinx Virtex 4 and the Altera Stratix II device families.
4. SWITCHES ON A STANDARD CELL ASIC
Figures 2 and 3 summarize the standard cell synthesis results. The results were derived for a slow-slow process corner with a temperature of 125 degrees Celsius and supply of 0.95 Volts. Five metal layers were assumed and the heaviest wire load model available was used. Synthesis was directed to optimize for fastest possible operation, which results in the instantiation of large high drive strength cells.
Fig. 2. Latency on a standard cell library.
Fig. 3. Gate count on a standard cell library.
The tri-state buffers present a load to each other when driving the same signal leading to a linear latency growth. As expectable, the shallow logic structure of the one-hot multiplexer (denoted by OH-MUX in these and the following figures) provides faster operation than a conventional multiplexer with sequential select signal encoding (denoted by SQ-MUX in the figures). Area of the OH-MUX grows linearly with input degree. In this particular technology library there are a couple of suitable macro cells for the AND-OR logic of the OH-MUX. They were utilized by the synthesis tool with input degrees of five and six, which shows up as a deviation from the linear slope in the area chart.
The latency of the conventional multiplexer exhibits a stepping pattern where a step higher takes place every time a power of two is exceeded. A less pronounced stepping pattern can be seen with the one-hot multiplexer as well. This happens because two-input logic provides the fastest operation and an additional level of two-input logic is needed every time a power of two is exceeded. With conventional multiplexers also an additional select bit is required resulting in a deeper tree of AND gates in the select signal decoding logic.
5. SWITCHES ON VIRTEX 4 AND STRATIX II
The figures presented in this section were extracted for the Xilinx Virtex 4 and Altera Stratix II devices using the same synthesis tool and applying the same constraints. Physical synthesis as well as place and route were omitted in order to facilitate this study and cut down the time-consuming part of the required efforts. Besides, the effectivity of physical synthesis and place and route algorithms are very much application and target device dependent. Thus their inclusion would make this study application and/or device specific. We tried to avoid such limitations of scope.
Devices with the slowest speed grade available were used from both manufacturers for comparable latency results. Even though the results provide just rough estimates, the general trend and device features can be judged with fair accuracy. The tri-state switch structures were evaluated using the tri-state propagation feature of the synthesis tool.
Figure 4 illustrates the approximate switch structure latencies on the Virtex 4 and Stratix II devices. The latency figures of the tri-state realizations have been omitted to improve readability. The tri-state latency was consistently very close to the one-hot multiplexer latency on both devices.
Fig. 4. Latency on Virtex 4 and Stratix II.
The switch latency on Virtex 4 increases in steps with each additional level of look-up tables required. Latencies grow smoothly with input degree on Stratix II. This is due to a fundamental difference in the device architectures. The latency of an adaptive look-up table (ALUT) in Stratix II depends on the transitioning input line. Hence if an ALUT is only partially utilized, the inputs providing the lowest latencies are used. In contrast, the look-up table (LUT) latency in Virtex 4 is constant for all inputs.
The switch latencies are generally lower for Stratix II in comparison to Virtex 4. Once again, this is caused by the architectural difference. An ALUT of Stratix II can have a maximum of seven inputs whereas the fixed LUT structure of Virtex 4 has four inputs. The use of wide ALUTs on Stratix II results in a more shallow switching logic in comparison to the 4-input LUT trees on Virtex 4.
Realization of wide multiplexers with sequential select signal encoding are large and slow with 4-input LUTs. However, the Virtex 4 architecture provides some multiplexer resources that can be utilized for logic mapping. The latency chart for multiplexing with sequentially encoded select signal on Virtex 4 exhibits dips as a result from the usage of these multiplexer resources.
Fig. 5. Utilization of programmable logic resources (LUTs) on Virtex 4.
Fig. 6. Utilization of programmable logic resources (ALUTs) on Stratix II.
Figures 5 and 6 summarize the utilization of look-up table resources for the different switch structures on Virtex 4 and Stratix II respectively. As in the latency chart the dips caused by the utilization of the multiplexer resources on Virtex 4 show up for the multiplexers with sequential select signal encoding. The LUT utilization on Virtex 4 is roughly the same for both multiplexing styles. On Stratix II however, multiplexing with one-hot encoded select signal results in ALUT utilization overhead in comparison to multiplexing with sequentially encoded select signal. This becomes more prominent with higher input degrees because of the growing difference in the select signal width.
As expected, the highest resource utilization on both devices was encountered with the tri-state propagating logic structure. The tri-state propagating structure maps onto the LUT and register resources on all devices. This structure is useful for ASIC emulation on FPGA, since it ensures functional equivalence to the design's behavior in a logic simulator. Due to the architectural differences between Virtex 4 and Stratix II, the resource utilization figures are not directly comparable. Such strongly subjective comparisons are left for the reader.
6. SWITCH STRUCTURE CHARACTERISTICS
Table I summarizes the characteristics for the different switch realizations. The figures have been normalized to the highest cost in each row for area, latency, or leakage to enable relative comparison of the logic structures. The structures are presented with their averages over the studied range of I/O degrees.
Table 1: Summary of Switch Characteristics
Char. | Tech. | OH-MUX | SQ-MUX | TRI-STATE |
Area Area Area Latency Latency Latency Leakage | ASIC S II V 4 ASIC S II V 4 ASIC | 0.82 0.68 0.68 0.72 0.99 0.88 0.66 | 1.00 0.44 0.64 0.97 1.00 1.00 1.00 | 0.37 1.00 1.00 1.00 0.99 0.89 0.31 |
Average Average Average | ASIC S II V 4 | 0.73 0.83 0.78 | 0.99 0.72 0.82 | 0.56 1.00 0.95 |
These average figures are representative of the switch structures on the target platforms. Due to the similarity of standard cell libraries and static CMOS processes, the ASIC results are fairly general. The FPGA results on the other hand are representative of the two architectures and cannot thus be generalized to device families with different ones.
Tri-state buffers offer low ASIC area and leakage power in the range of approximately one third of the figures for comparable multiplexers with sequential select signal encoding. The latency of a tri-state buffer implementation is also acceptable. Thus the problems with tri-states on ASIC culminate on the management of short circuit current. Multiplexers with sequential select signal encoding provide the lowest FPGA resource utilization, but present higher latencies than multiplexers with one-hot select signal encoding on all technologies.
Multiplexers with one-hot select signal encoding are ideally suited for high performance applications independently of the technology. The area of a one-hot multiplexer based switch is also moderate making it an appealing general purpose choice in the light of this study. For small area and potentially low power consumption, tri-state buffers could be considered for ASICs and multiplexers with sequential select signal encoding for FPGAs.
7. SWITCHES FOR LINK CAPACITY SHARING
NoC architectures often incorporate techniques to share link capacity between different sources. This is done to ease up traffic management at the application level. Parallel connections are enabled on a single physical link through either time or space division. With space division the term lane is used to describe a connection. The term virtual channel is used to describe a time divided connection. Virtual channels require individual queues/buffers dedicated to each channel resulting in a large overall buffer space and thus router area.
In some architectures the virtual channels are multiplexed before the main router switch and de-multiplexed after the main router switch. On top of the increased buffer space and router area this approach results in the added area and latency of the channel multiplexers and de-multiplexers. Only one switch input at a time can be connected to the multiple queues (channels) of an output link through the regular size (e.g. NxN, N being the I/O degree of a symmetric router) switch using this approach. The concept and the resulting architecture are thus equivalent to multi-queueing.
Some NoCs, like MANGO [12], use a switch of size Nx(NxM), where M is the number of virtual channels on a link. This enables multiple switch inputs to access separate queues (channels) on a single output link at the same time. The channel de-multiplexers are eliminated since their functionality is incorporated into the switch. Multiple input queues/channels still cannot be served in parallel. Utilization of link bandwidth is slightly increased through this technique. In many designs however, the resulting increase in router latency may decrease the usable link bandwidth close to or more than the achieved increase in bandwidth utilization.
To realize the benefits of virtual channeling to their full extent or to implement fully connected space (lane-based) division on the links, a switch of (NxM)x(NxM) is required. Such a switch imposes no restrictions on connection (channel or lane) parallelism. This switch type forms the focus of the study presented in this section.
Table 2: Impact of Link Capacity Sharing on a Fully Connected Switch
Char. | Tech. | 2 VCs/lanes | 4VCs/lanes | 8 VCs/lanes |
Area Area Area Latency Latency Latency leakage | ASIC S II V 4 ASIC S II V 4 ASIC | 1.67x 1.50x 2.25x 1.56x 1.25x 1.38x 1.20x | 3.56x 3.25x 4.25x 2.41x 1.52x 1.44x 2.96x | 7.73x 6.75x 8.75x 3.43x 1.77x 1.77x 5.62x |
Table II summarizes the impact of link capacity sharing on the cost of a fully connected switch with the studied technologies. This information has been derived from the data presented previously in this paper using the following approach. As a starting point, a fully connected switch matrix with I/O degree of two was chosen to present the cost of a single-channel or single-lane realization. This reference cost was calculated as an average of the three structures - OH-MUX, SQ-MUX, and TRI-STATE. Since a fully connected switch was assumed, the link capacity shared versions effectively multiply the I/O degree with the number of virtual channels/lanes. Thus the figures for the switches with 2, 4, and 8 virtual channels/lanes were retrieved from the data for I/O degrees of 4, 8, and 16 respectively. All quoted figures were computed as averages of the three structures divided by the corresponding reference cost of a single-channel/single-lane realization.
The results suggest that the switch area is roughly multiplied by the number of virtual channels/lanes regardless of the technology. Lowest latency is achieved on ASIC with two-input logic whereas FPGAs are generally more effective with four input logic. Due to this, the ASIC latency grows approximately with the number of virtual channels/lanes divided by two and the FPGA latency grows roughly by the number of virtual channels/lanes divided by four. These figures were as expected.
However, the behavior of the ASIC library used bears an interesting notion. Its leakage power has a tendency of being lower in relation to area with higher I/O degrees. This stems from the initial assumption of a heavy parasitic load at the output requiring high drive strength for acceptable latency. Higher drive strength is obtained by increasing the cut-through area of the CMOS channel, which causes more leakage. Increased I/O degree requires more complex switching logic where a smaller portion of the transistors are driving output loads. Hence the lower leakage power in relation to area with higher I/O degrees.
In conclusion, link capacity sharing requires a considerable amount of additional device resources. Even though the link capacity sharing techniques increase the effective bandwidth of wormhole or virtual cut through routed NoCs, it is very probable that the use of link capacity sharing results in a decreased overall communication bandwidth per device resources used. Bandwidth is decreased through higher switch and control latency. The control overhead is not in the scope of this study, but its significance should be obvious.
In this light it seems more attractive to use the same device resources to provide physically independent network resources. One common form of physically independent networks separates the request and response infrastructures. This separation has the added benefit of avoiding certain types of possible deadlock conditions.
CONCLUSION
Synthesizable switch structures for network-on-chip designs were studied. Target technologies included a 90nm ASIC process as well as the Altera Stratix II and Xilinx Virtex 4 FPGA devices.
Sum-of-products one-hot multiplexing resulted in an acceptable resource utilization on both ASIC and FPGA while it consistently provided a low latency. In average, one-hot multiplexing was observed to save 18% in area and provide 26% faster operation in comparison to conventional multiplexing on the 90nm ASIC process. Conventional multiplexers provide the lowest FPGA resource utilization, but present higher latencies than one-hot multiplexers. Tri-state buffers offer small ASIC area and potentially low power consumption, provided that short circuit currents are properly managed.
Link capacity sharing techniques are used to ease up network traffic management at the application level. Cost of these techniques was discussed using a limited scope case study based on the assumption of a fully connected switch, for which relative figures were quoted. From a purely architectural point of view physically independent network resources seem more attractive in comparison to link capacity sharing.
REFERENCES
[1] “Amba home page,” URL: http://www.arm.com/products/solutions/AMBAHomePage.html.
[2] “The CoreConnect bus architecture home page,” URL: http://www.ibm.com/chips/products/coreconnect/.
[3] “Sonics home page,” URL: http://www.sonicsinc.com/.
[4] S. Murali and G. De Micheli, “An application-specific design methodology for stbus crossbar generation,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE), vol. 2, ICM, MESSE Munich, Germany, March 2005, pp. 1176–1181.
[5] W. J. Dally and B. P. Towles, Principles and Practices of Interconnection Networks. San Francisco, California, USA: Elsevier Inc. / Morgan Kaufmann Publishers, 2004.
[6] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. berg, M. Millberg, and D. Lindqvist, “Network-on-a-Chip: An architecture for billion transistor era,” in Proceedings of NORCHIP 2000, Turku, Finland, November 2000.
[7] P. T. Wolkotte, G. J. M. Smit, N. Kavaldjiev, J. E. Becker, and J. Becker, “Energy model of networks-on-chip and a bus,” in Proceedings of the IEEE International Symposium on System-on-Chip, Tampere, Finland, November 2005.
[8] “Arteris Network on Chip Company,” URL: http://www.arteris.net/.
[9] J. Bainbridge and S. Furber, “Chain: A delay-insensitive chip area interconnect,” IEEE Micro, vol. 22, no. 5, September-October 2002.
[10] D. Lewis, E. Ahmed, G. Baeckler, V. Betz, M. Bourgeault, D. Cashman, D. Galloway, M. Hutton, C. Lane, A. Lee, P. Leventis, S. Marquardt, C. McClintock, K. Padalia, B. Pedersen, G. Powell, B. Ratchev, S. Reddy, J. Schleicher, K. Stevens, R. Yuan, R. Cliff, and J. Rose, “The stratix ii logic and routing architecture,” in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, Monterey, California, USA, February 2005, pp. 14–20.
[11] T. T. Ye, L. Benini, and G. De Micheli, “Analysis of power consumption on switch fabrics in network routers,” in Proceedings of the 39th Design Automation Conference (DAC), New Orleans, Louisiana, USA, June 2002, pp. 524–529.
[12] T. Bjerregaard and J. Sparso, “A router architecture for connection-oriented service guarantees in the mango clockless network-on-chip,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE), March 2005, pp. 1226–1231.