|
|||||||||||||||||||||||||||||||||
A comparison of Network-on-Chip and Bussesby Arteris (www.arteris.com) A number of research studies have demonstrated the feasibility and advantages of Network-on-Chip (NoC) over traditional bus-based architectures. This whitepaper summarizes the limitations of traditional bus-based approaches, introduces the advantages of the generic concept of NoC, and provides specific data about Arteris’ NoC, the first commercial implementation of such architectures. Using a generic design example we provide detailed comparisons of scalability, performance and area of traditional busses or crossbars vs. Arteris’ NoC. 1. Trends Busses have successfully been implemented in virtually all complex System on Chip (SoC) Silicon designs. Busses have typically been handcrafted around either a specific set of features relevant to a narrow target market, or support for a specific processor. Several trends have forced evolutions of systems architectures, in turn driving evolutions of required busses. These trends are:
These trends have driven of the evolution of many new bus architectures. These include the introduction of split and retry techniques, removal of tri-state buffers and multi-phase-clocks, introduction of pipelining, and various attempts to define standard communication sockets. However, history has shown that there are conflicting tradeoffs between compatibility requirements, driven by IP blocks reuse strategies, and the introduction of the necessary bus evolutions driven by technology changes : In many cases, introducing new features has required many changes in the bus implementation, but more importantly in the bus interfaces (for example, the evolution from AMBA ASB to AHB2.0, then AMBA AHB-Lite, then AMBA AXI), with major impacts on IP reusability and new IP design. Busses do not decouple the activities generally classified as transaction, transport and physical layer behaviors. This is the key reason they cannot adapt to changes in the system architecture or take advantage of the rapid advances in silicon process technology. Consequently, changes to bus physical implementation can have serious ripple effects upon the implementations of higher-level bus behaviors. Replacing tri-state techniques with multiplexers has had little effect upon the transaction levels. Conversely, the introduction of flexible pipelining to ease timing closure has massive effects on all bus architectures up through the transaction level. Similarly, system architecture changes may require new transaction types or transaction characteristics. Recently, such new transaction types as exclusive accesses have been introduced near simultaneously within OCP2.0 and AMBA AXI socket standards.Out-of-order response capability is another example. Unfortunately, such evolutions typically impact the intended bus architectures down to the physical layer, if only by addition of new wires or op-codes. Thus, the bus implementation must be redesigned. As a consequence, bus architectures can not closely follow process evolution, nor system architecture evolution. The bus architects must always make compromises between the various driving forces, and resist change as much as possible. In the data communications space, LANs & WANs have successfully dealt with similar problems by employing a layered architecture. By relying on the OSI model, upper and lower layer protocols have independently evolved in response to advancing transmission technology and transaction level services. The decoupling of communication layers using the OSI model has successfully driven commercial network architectures, and enabled networks to follow very closely both physical layer evolutions (from the Ethernet multi-master coaxial cable to twisted pairs, ADSL, fiber optics, wireless..) and transaction level evolutions (TCP, UDP, streaming voice/video data). This has produced incredible flexibility at the application level (web browsing, peer-to-peer, secure web commerce, instant messaging, etc.), while maintaining upward compatibility (old-style 10Mb/s or even 1Mb/s Ethernet devices are still commonly connected to LANs). Following the same trends, networks have started to replace busses in much smaller systems: PCI-Express is a network-on-a board, replacing the PCI board-level bus. Replacement of SoC busses by NoCs will follow the same path, when the economics prove that the NoC either:
In each case, if all other criteria are equal or better NoC will replace SoC busses. This paper describes how NoC architecture affects these economic criteria, focusing on performance and manufacturing cost comparisons with traditional style busses. The other criteria mostly depend on the maturity of tools supporting the NoC architecture and will be addressed separately. 2. NoC Architecture The advanced Network-on-Chip developed by Arteris employs system-level network techniques to solve onchip traffic transport and management challenges. As discussed in the previous section and shown in Figure 1, synchronous bus limitations lead to system segmentation and tiered or layered bus architectures.
Contrast this with the Arteris approach illustrated in Figure 2. The NoC is a homogeneous, scalable switch fabric network, This switch fabric forms the core of the NoC technology and transports multi-purpose data packets within complex, IP-laden SoCs. Key characteristics of this architecture are:
2.1 NoC layers IP blocks communicate over the NoC using a threelayered communication scheme (Figure 3), referred to as the Transaction, Transport, and Physical layers
The Transaction layer defines the communication primitives available to interconnected IP blocks. Special NoC Interface Units (NIUs), located at the NoC periphery, provide transaction-layer services to IP blocks with which they are paired. This is analogous, in data communications networks, to Network Interface Cards that source/sink information to the LAN/WAN media. The transaction layer defines how information is exchanged between NIUs to implement a particular transaction. For example, a NoC transaction is typically made of a request from a master NIU to a slave NIU, and a response from the slave to the master. However, the transaction layer leaves the implementation details of the exchange to the transport and physical layer. NIUs that bridge the NoC to an external protocol (such as AHB) translate transactions between the two protocols, tracking transaction state on both sides. For compatibility with existing bus protocols, Arteris NoC implements traditional address-based Load/ Store transactions, with their usual variants including incrementing, streaming, wrapping bursts, and so forth. It also implements special transactions that allow sideband communication between IP Blocks. The Transport layer defines rules that apply as packets are routed through the switch fabric. Very little of the information contained within the packet (typically, within the first cell of the packet, a.k.a header cell) is needed to actually transport the packet. The packet format is very flexible and easily accommodates changes at transaction level without impacting transport level. For example, packets can include byte enables, parity information, or user information depending on the actual application requirements, without altering packet transport, nor physical transport. A single NoC typically utilizes a fixed packet format that matches the complete set of application requirements. However, multiple NoCs using different packet formats can be bridged together using translation units. The Transport Layer may be optimized to application needs. For example, wormhole packet handling decreases latency and storage but might lead to lower system performance when crossing local throughput boundaries, while store-and forward handling has the opposite characteristics. The Arteris architecture allows optimizations to be made locally. Wormhole routing is typically used within synchronous domains in order to minimize latency, but some amount of store-and forward is used when crossing clock domains. The Physical layer defines how packets are physically transmitted over an interface, much like Ethernet defines 10Mb/s, 1Gb/s, etc. physical interfaces As explained above, protocol layering allows multiple physical interface types to coexist without compromising the upper layers. Thus, NoC links between switches can be optimized with respect to bandwidth, cost, data integrity, and even off-chip capabilities, without impacting the transport and transaction layers. In addition, Arteris has defined a special physical interface that allows independent hardening of physical cores, and then connection of those cores together, regardless of each core clock speed and physical distance within the cores (within reasonable limits guaranteeing signal integrity). This enables true hierarchical physical design practices. A summary of the mapping of the protocol layers into NoC design units is illustrated by the following figure
2.2 NoC Layered approach benefits A summary of the benefits of this layered approach are:
2.3 NoC pitfalls In spite of the obvious advantages, a layered strategy to on-chip communication must not model itself too closely on data communications networks. In data communication networks the transport medium (i.e., optical fiber) is much more costly than the transmitter and receiver hardware and often employs “wave pipelining” (i.e. multiple symbols on the same wire in the case of fiber optics or controlled impedance wires). Inside the SoC the relative cost and performance of wires and gates is different and wave pipelining is too difficult to control. As a consequence, NoCs will not, at least for the foreseeable future, serialize data over single wires, but find an optimal trade-off between clock rate (100MHz to 1GHz) and number of data wires (16, 32, 64…) for a given throughput. Further illustrating the contrast, data communications networks tend to be focused on meeting bandwidthrelated quality of service requirements, while SoC applications also focus on latency constraints. Moreover, a direct on-chip implementation of traditional network architectures would lead to significant area and latency overheads. For example, the packet dropping and retry mechanisms that are part of TCP/IP flow control require significant data storage and complex software control. The resulting latency would be prohibitive for most SoCs. Designing a NoC architecture that excels in all domains compared to busses requires a constant focus on appropriate trade-offs. 3. Comparison with traditional busses In this section we will use an example to quantify some advantages of the NoC approach over traditional busses. The challenge is that comparisons depend strongly on the actual SoC requirements. We will first describe an example we hope is general enough that we may apply the results more broadly to a class of SoCs. The “design example” is comprised of 72 IP Blocks, 36 masters and 36 slaves (the ratio between slaves and masters does not really matter, but the slaves usually define the upper limit of system throughput) The total number of IP Blocks implies a hierarchical interconnect scheme; we assume that the IP Blocks are divided in 9 clusters of 8 IP Blocks each. Within each cluster, IP blocks are locally connected using a local bus or a switch, and the local busses or switches are themselves connected together at the SoC level. With a regular, hierarchical floorplan, the two architectures look somewhat like Figure 5
The SoC is assumed to be 9mm square, clusters are 3mm square and the IP blocks are each about 1mm square. Let us also assume a 90nm process technology and associated standard cell library where an unloaded gate delay is 60pS, and DFF traversal time (setup+hold) is 0.3nS. Based on electrical simulations, we also can estimate that a properly buffered wire running all along the 9mm of the design would have a propagation delay of at least 2nS. According to the chosen structure it then takes approximately 220pS for a wire transition to propagate across an IP block, 660pS across a cluster. In the bus case, cluster-level busses connect to 4 master IP Blocks, 4 slave IP Blocks, and the SoC level bus, which adds a master and slave port to each cluster-level bus. Thus, each cluster-level bus has 5 master and 5 slave ports, and the SoC-level bus has 9 master and 9 slave ports. The length of wire necessary to connect the 9 ports of the top-level bus is at least the half-perimeter of the SoC-level interconnect area, approximately between 2 and 4 cluster sides (i.e., between 6 and 12 mm) depending on the actual position of the connection ports to the cluster busses. Similarly in the NoC case, two 5x5 (5 inputs, 5 outputs) switches are required in each cluster, one to handle requests between the cluster IP Blocks and the SoC-level switch, and another identical one managing responses. The SoC-level switches are 9x9. However since the NoC uses point-to-point connections, the maximum length wires between the center of the SoC, where the 9x9 switch resides, and the ports to the cluster-level switches, is at worst only half of the equivalent bus length, i.e. 1 to 2 cluster sides or between 3 and 6 mm. Actual SoC designs differ from this generic example, but using it to elaborate comparison numbers and correlating these to commonly reported numbers on actual SoC designs provides valuable insight about the superior fundamentals of NoC. 3.1 Maximum frequency estimation From a physical implementation perspective, a major difference between a bus (or a crossbar) and Arteris NoC is that the NoC uses a point-to-point, GALS approach, while the bus is synchronous and multipoint. As we shall see, NoC implementations can sustain far higher clock frequencies. 3.1.1 Maximum Bus frequency For some busses, control signals must traverse the bus several times within a clock cycle: assuming arbitration is pipelined and happens in an earlier cycle; data must be OR-ed between the masters then fanned-out to all the slaves. Slaves must decode the request, decide if it is targeted to them, and then issue an acknowledgment that must be registered by all masters in the same cycle. This is quite typical of traditional busses. With more recent pipelined busses control signals need only traverse the bus a single time within a clock cycle, usually at the expense of more pipeline stages and thus added cycles of latency. Due to process and test restrictions, recent busses do not utilize tristates - they make use of MUXes or ORs to combine inputs together, and then fan-out the results. Busses tend to have many wires (a hundred or more) and create congestion as these wires must converge upon the MUX block, The MUXing can also be performed in a distributed fashion with less wire congestion. However, this lengthens the wires and increases the number of logic stages (at least one OR gate per bus port). When unconstrained, physical synthesis tools are likely to find some solution in between these bounds. For our design the inter-cluster bus presents the greatest challenge since it spans the most distance and masters/slaves. Let us assume a 9mm total wire length to reach all connection points of Figure 5, and about 12 gate stages for MUXing the data and performing the necessary control and decode operations. This leads to an approximate transport time of 2+12*0.06=2.7nS. To compute maximum operation frequency, we must also account for clock skew and DFF setup/output delay times. In a synchronous bus that must span the entire chip, we recognize that clock skew can be very significant, around 1.0nS. Maximum frequency for the bus-based transport is thus estimated to be 1/(2.7+1.0+0.3) = 250MHz. This rule of thumb estimate matches operating frequencies reported for existing bus-based SoCs. Those using standard lightly pipelined interconnects are usually in the range of 80 to 150MHz, while the highest reported frequencies for pipelined interconnects are inching towards 250Mhz. (In this analysis we discount heavily handcrafted busses that reportedly run at 500MHz - these are application specific and require such physical design techniques as differential signaling). 3.1.2 Maximum NoC Frequency In the NoC case, point-to-point links and GALS techniques greatly simplify the timing convergence problem at the SoC level. Synchronous clock domains typically need only span individual clusters. Arteris has demonstrated a mesochronous link technique operable at 800MHz for an unlimited link length, at the expense of some latency which will be accounted for later in this analysis. Thus only the switches and clusters limit the maximum frequency of our generic design. Within the synchronous clusters, point-to-point transport does not exceed 2mm. Arteris has taken care to optimize the framing signals accompanying the packets in order to reduce to 3 gates or less the decision logic to latch the data after its transport. Thus transport time is no more than 2*2/9+3*0.06=0.6ns. Within a cluster, skew is more easily controlled than at SoC level and is typically about .3ns. Taking into account the DFF, we compute a maximum operating frequency of 1/(0.6+0.3+0.3)=800MHz. But in fact, this estimate is rather pessimistic, because within a synchronous cluster the switch pipeline stages tend to be distributed (this may be enforced by the physical design synthesis tools) such that there should never be cluster-level wires spanning 2mm. Experiments using a standard physical synthesis tool flow show that proper pipelining of the switches enables NoC operating frequencies of 800Mhz for 3x3mm clusters. The very simple packet transport and carefully devised packet framing signals of Arteris NoC architecture enable such pipelining (most pipelining stages are optional in order to save latency cycles in the case that high operating frequencies are not required). A layered, GALS, point-to-point architecture currently has a 3x advantage compared to busses - and this advantage will increase together with the dominance of wire delays over gate delays as silicon processes evolve. Conversely, busses in large SoCs will continue to be limited to 250 MHz. 3.2 Peak throughput estimation For the remainder of this analysis we assume frequencies of 250MHz for the bus-based architecture, and 750MHz for the NoC-based. This relationship scales. For example, a set of implementations employing limited pipelining might run at 166MHz vs. 500Mhz. Assuming all busses are 4-byte data wide, the aggregate throughput of the entire SoC (9 clusters) is 250*4*9 = 9GB/s, assuming one transfer at the same time per cluster. The NoC approach uses crossbar switches with 4-byte links. Aggregate peak throughput is limited by the masters or slaves send/receive data. Here however we must take into account two factors:
The NoC has a potential 10x throughput advantage over the bus-based approach. The actual ratio may be lower if multi-layered busses are used at the cluster level. Because multi-layers are similar to crossbars, the added complexity could limit the target frequency. 3.3 Minimum Latency Latency is a difficult comparison criterion, because it depends on many application-specific factors: are we interested in minimum latency on a few critical paths, or statistical latency over the entire set of dataflows? The overall system–level SoC performance usually depends only on a few latency-sensitive dataflows (typically, processor cache refills) while for most other dataflows only achievable bandwidth will matter. But even for the latter dataflows, latency does matter in the sense that high average latencies require intermediate storage buffers to maintain throughput, potentially leading to area overhead. Let us first analyze minimum latency for our architectures. We assume that all slave IP blocks have a typical latency of 2 clock cycles @250MHz, i.e. 5nS. This translates into 6 clock cycles @750MHz (for comparison fairness we assume that IP Blocks run at the same speed as in the bus case). 3.3.1 Cluster-level latency Bus-based transactions require the following tasks to be performed: Arbitration, Transport, Decode and issue to Target, Target processing, and Transport response back to initiator. Heavily pipelined busses capable of reaching 250MHz will take at least 4 cycles, without counting target processing. At lower frequencies it might take fewer cycles, but the transaction duration will remain approximately constant, i.e. a total of 4+2=6 cycles @250MHz. For the heavily pipelined NoC able to reach 750MHz, packet transport and switching takes 3 cycles, and the conversion between NoC protocol and IP protocol, together with the retiming of signals between the two clocks (the NoC clock is assumed to be a synchronous multiple of the IP block clock), takes on average one cycle @250MHz . Request path is thus 2 cycles @250MHz and the response path is the same. Together with 2 cycles in the slave we arrive at 6 cycles @250MHz - identical to the bus architecture example. Again, achievable latency will vary according to a number of factors. For example, the NoC packet latency overhead is reduced and throughput is doubled when using 8-byte wide links rather than 4-byte links. Also, in many cases IP communication clock rates are limited more by their bus socket implementation than the IP itself. This is very apparent with CPUs that run internally at a multiple frequency of their communication socket speed. For such IP, latency could be minimized by running the IP interface and NoC at much higher frequency. 3.3.2 SoC level latency When communication spans different clusters, we must consider the latency within each cluster, between the clusters and within the target IP block. In the case of busses, the minimum round-trip latency is 4+4+4+2 = 14 cycles @250MHz. But this is an ideal case: because an access through the inter-cluster bus would otherwise make it busy during the full target cluster roundtrip, i.e. 6 cycles. Therefore it must use split/retry techniques to sustain proper bandwidth utilization. Retry cycles typically add another inter-cluster bus roundtrip, ending up in 18 cycles @250MHz. Again another approach is to use multi-layered implementations, but these are even more challenging at the SoC level, especially with many masters. For requests, the NoC implementation takes 3 cycles @750MHz in each pipelined switch, on average 1.5 resynchronization cycles on each toplevel asynchronous link, and one cycle @250MHz to resynchronize with the IP Blocks. Traversing 3 switches, 2 links and 2 IP boundaries totals to 3*3+2*1.5+2*1=14 cycles @750MHz, i.e. less than 5 cycles @250MHz. Doubling this number for response propagation, and adding slave latency gives 2*5+2=12 cycles @250MHz. Thus, while packetization and GALS overhead introduces several cycles of latency, these are more than compensated for by the higher frequency. Clearly the NoC advantage increases with the number of hierarchy levels, i.e. as the number of IP Blocks scales up. 3.4 System throughput and average latency Peak throughput and minimum latency are indications of potential system performance. In reality, conflicts in accessing shared resources (whether busses, point-topoint links or slaves) create arbitration latencies that increase latency, and lower throughput. These effects are very specific to dataflow patterns and thus very hard to quantify without simulation, Special architectural exploration tools and models are necessary. Nevertheless we are still able to estimate and refine our evaluations. Lets first assume that 20% of the traffic is inter-cluster, while 80% is cluster-level. 3.4.1 System Throughput Assuming that the inevitable retry or busy cycles limit the inter-cluster bus to 50% efficiency it can handle 250*4*50%=0.5GB/s. This is the system bottleneck and limits overall traffic to 2.5GB/s, each of the 9 clusters having a local traffic of 2500/9=277MB/s, far from the potential peak. Inter-cluster peak traffic could be increased, typically by making the bus wider. Doubling inter-cluster bus width would increase the total average traffic up to 5GB/s but at the expense of area, congestion, and inter-cluster latency. Similarly, a lower ratio of inter-cluster traffic, for example 10% instead of 20%, also leads to 5GB/s total system throughput. For reasonable traffic patterns the achievable system throughput is thus limited to much lower sustainable rates than theoretical peak throughput, because the backbone performance does not scale with traffic complexity requirements. Within the NoC architecture the inter-cluster crossbar switches are less limiting to system-level traffic. Assuming
The achievable system throughput is (750*4*9*2*50%)/20% = 130GB/s. This is higher than the peak throughput that the initiators and targets can handle, clearly illustrating the intrinsic scalability of the hierarchical NoC approach. 3.4.2 Average latency A well-known behavior of communication systems with flow control (as opposed to strict TDMA without flow control) is that when requested throughput approaches a certain limit – the so-called ‘knee-of-the-curve’, throughput saturates and latency skyrockets to multiple orders of magnitude greater than minimum latency. Analysis of a simple statistical model of a bus shared by 9 initiators, each requiring 10% of the bus BANDWIDTH, demonstrates that arbitration between transactions increases average latency by 3 times the transaction durations, with peaks exceeding 10 times. This assumes identical transaction types from all initiators, random (Poisson) distribution and fair arbitration. As an example, at 90% bus load, arbitration between 4-word bursts takes on average 12 cycles between request and grant. Again these simple estimates match quite well characteristics observed in actual systems. They force bus-based architectures to limit burst sizes typically to 16 or 32 bytes (the opposite trend driving burst size upwards is memory target efficiency), and also impose a practical limit to the achievable bus utilization, with 90% being an upper limit. According to this congestion modeling, for 16-byte bursts on a 32-bit bus, an average arbitration latency of 3x4 = 12 cycles @250MHz is a realistic expectation for the inter-cluster bus. The same modeling shows that 8 cycles can be expected on the cluster-level busses with 4 initiators if they are loaded at the same rate. An inter-cluster request will be arbitrated in the source and destination clusters and the inter-cluster bus, raising the average latency from the minimum 14 cycles to 14+8+12+8 = 42 cycles @250MHz, and even more if each transaction must be arbitrated several times because of the retry mechanisms. These skyrocketing average latencies explain why:
The NoC approach using full crossbar switches does not encounter high latency unless several of the ports of the inter-cluster switch reach high utilization loads. With our assumption that the inter-cluster traffic is 20% of the total traffic or less, inter-cluster crossbar port utilization stays below 20%, so there is a lot of room to increase inter-cluster traffic before arbitration latency has a significant effect. Statistical modeling shows that at 50GB/s aggregate throughput average arbitration latency is less than the average duration of packets, in our case around 6 cycles @750Mhz = 2 cycles @250MHz for an average packet size of 16 bytes, way below arbitration overhead of traditional busses, even with a much higher aggregate traffic. In addition, if the ports between a cluster and the toplevel switch are heavily loaded, we can easily duplicate these ports and split the dataflows between the parallel ports thus created, effectively doubling the available bandwidth between a cluster and the inter-cluster switches. This would selectively reduce the arbitration latency on these paths, with limited impact (approx. 10%) on the cluster and inter-cluster switch areas. 3.5 Area and Power comparison 3.5.1 Area Traditional busses have been perceived as very area efficient because of their shared nature. As we already discussed, this shared nature drives both operation frequency and system performance scalability down. Some techniques have been introduced in recent busses to fix these issues:
Pipelining and buffering add up to 250K gates. Adding bus MUXes, arbiters, address decoders, and all the state information necessary to track the transaction retries within each bus, the total gate count for a system throughput of less than 10GB/s is higher than 400K gates. The NoC implementation uses two 4x5 32-bit wide switches in each cluster. Including three levels of pipelining, this amounts to about 8k gates. Because arbitration latency is much smaller than for busses, intermediate buffers are not needed in the switch fabric. The two inter-cluster switches are approximately 30K gates each, for a total of 9*8*2+2*30=210K gates. Thus for a smaller gate count, the NoC is able to handle an order of magnitude more aggregate traffic - up to 100GB/s. 3.5.2 Dynamic power With respect to power dissipation, the main difference between busses and NoC is that busses fan out their wires to all the targets, as we have seen in section 3.1, while the NoC uses point-to-point links. In the NoC case, proper floorplanning of the switches leads to a smaller wire length and associated capacitance load switched per transaction than for the bus, resulting in lower dynamic power. Also, the power-wasting bus retry cycles do not exist in the NoC case. It is easy in a packet-based point-to-point architecture to implement a power-saving strategy: for example, the framing signals around the packet can be used to compute clock gating signals on all pipelining stages, so that very little switching power is used on quiet paths (only the state machines that monitor the framing signals need the clock). Fair comparisons between the bus and NoC approach require actual layout experiments on a real application as well as consideration of the exact amount of clock gating that both architectures allow. But we expect that dynamic power consumption is usually lower for the NoC than for the busses at equivalent system performance levels. 3.5.3 Static power Silicon processes below .13um drain considerable static power. Static power consumption is roughly proportional to the silicon area, and we have seen in section 3.5.1 that for our example, area is lower for the NoC than for traditional busses. Therefore static power consumption should also be lower by the same factor. Again comparison numbers can only be meaningful for actual layouts of implementations with the same system performance level. 3.5.4 Power management The modular, point-to-point NoC approach enables several power management techniques that are difficult to implement with traditional busses:
Quantification of power consumption improvements due to these techniques is too closely tied to the application to be estimated on our generic example. 4. Comparison to crossbars In the previous section the NoC was compared and contrasted with traditional bus structures. We pointed out that system level throughput and latency may be improved with bus based architectures by employing pipelined crossbars or multilayer busses. However, because traditional crossbars still mix transaction, transport and physical layers in a way similar to traditional busses, they present only partial solutions. They continue to suffer the following:
Common crossbars also lack additional services that Arteris NoC offers and which are outside of the scope of this whitepaper, such as error logging, runtime reprogrammable features, and so forth. 5. Summary and conclusion Table 1 summarizes the comparisons results described in section 3 :
Table 1 : Comparison table summary This table shows that for designs of the complexity level that we used for the comparison, the NoC approach has a clear advantage over traditional busses for nearly all criteria, most notably system throughput. As discussed in section 4, hierarchies of crossbars or multilayered busses have characteristics somewhere in between traditional busses and NoC, however they fall far short of the NoC with respect to performance and complexity. Detailed comparison results necessarily depend on the SoC application, but with increasing SoC complexity and performance, the NoC is clearly the best IP block integration solution for high-end SoC designs today and into the foreseeable future.
More info: www.arteris.com Copyright © 2005 Arteris. All rights reserved. Arteris, the Arteris logo, Arteris, NocCompiler, and NoCexplorer are trademarks of Arteris S.A. All other trademarks are property of their respective owners. |
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |