|
|||||
SuperHyway provides SoC backbone
SuperHyway provides SoC backbone Recent technology improvements have made it cost-effective to integrate components previously connected on a printed-circuit board onto a single piece of silicon. These so-called systems-on-chip (SoC) generally comprise most of the blocks commonly found on a computer motherboard plus some application-specific intellectual property (IP). This means design issues that were formerly the province of systems designers are now within the realm of the chip architect. As a result, interconnection schemes common at the system and network level, such as packet switching, must now be considered at the SoC level. This introduces a whole slew of issues into chip design since, in general, bus architectures designed for boards are seldom efficient on silicon. For example, in conventional board design, signal count is directly related to cost, so that shared tristate buses and multiplexed designs are common. On-chip, signal count is of less importance, so it is possible to take advantage of the higher performance and design simplifications that nonmultiplexed signaling can offer. Furthermore, this avoids the problems of tristate buses with the current generation of synthesis and production test tools. Paramount among the concerns for building integration backplanes is making the component IP reusable in other applications. Central to this task is providing an infrastructure for interconnecting components to a common address space. Many existing approaches are skewed toward providing performance efficiently at the expense of simple reuse. This can be seen by the number of chip architectures that have a profusion of different bus structures linked by nontrivial bridges. Frequently, a major challenge for SoC designs is verification. Complex boards are often debugged by a logic analyzer, a strategy difficult to continue with superintegrated chips. This can be used to catch software bugs, particularly those sensitive to timing. It is a distinguishing ch aracteristic of good system-on-chip interconnect architectures that they are designed to enable debugging of the interaction of integrated components. In an effort to solve such problems, STMicroelectronics and Hitachi Ltd. have developed a new chip-level interconnection architecture to provide a clear separation between implementation concerns that affect performance-such as bus widths, arbitration, pipelining and so on- and the architectural issues of interfaces and compatibility. This approach allows systems to hit a particular cost/performance point but retain simple and verifiable interoperability. Dubbed SuperHyway, the interconnection architecture was designed within the Virtual Socket Interface Alliance standard so that it would act as the backbone for all superintegrated devices based on the SuperH and ST20 processor core families. The architecture uses a packet-based transaction protocol layered on top of a highly integrated physical communications infrastructure. Most imple mentations turn out to be richly connected networks that are efficient at controlling the delivery of bandwidth and latency; this contrasts sharply with the monolithic structures that are commonly found in many other on-chip interconnect schemes. The architecture of SuperHyway is a scalable general-purpose interconnect which can handle all of its attached modules in the same way. This approach insulates the module designer from the system issues over which he or she has no control, while allowing the system designer to tune the system specifics for performance to be built into the system topology and arbitration logic. Essentially, the new interconnection scheme grew out of our experience with an earlier protocol that STMicro has used for years, called the Request Grant Valid protocol. It also conforms to the VSIA's specifications, incorporating all the major features required for IP to be included in VSIA-level products without additional cost. Both the VSIA spec and SuperHyway implement a l ayered communications model in which a transaction defines the communication between two modules in a system. Most communications involve data accesses. A data access may involve the transfer of a single byte or a sequence of bytes. Packet pair The packet layer implements each operation as a "request response" pair. These packets carry a fixed quantum of data, which is directly mapping onto the interconnect implementation. The cell layer breaks these packets into a series of cells, each of the right width to match the route through the interconnect that the packet will take. Finally, the physical layer is responsible for physical encoding of these cells, adding framing and flow control information. Simple protocol verification may be performed at each layer in the hierarchy. But the crucial advantage of using a strongly layered model is that it allows designers to develop complex systems that can trade performance, functionality and implementation cost. These systems are able t o use a mix-and-match policy for devices that have differing concepts of word length or data path widths. Within these constraints, the new interconnection scheme supports four basic types of interface. The simplest peripheral interface supports a subset of the full transaction set and is targeted at small, low-data-rate modules with no requirements for features such as split transactions and large or complex operations. The next level up is the basic interface, which adds support for split transactions to the peripheral interface. To simplify module design, all request/response packets are symmetrical and ordering of all operation sequences is enforced for the module by the system. The advanced interface extends the basic implementation to a full split-transaction packet implementation with asymmetric request/response packets and full error support. It also lets the system relax request/response packet-ordering properties to allow modules to take advantage of concurrency in the syste m. Finally, the pipelined interface adds arbitration pipelining to the advanced interface, allowing arbitration delays to be hidden across large systems. The ports on the packet-based interconnection scheme provide access points to the interconnect for the various peripheral types. The ports standardize the interface to the bus, reduce system design risk and enable the reuse of previously qualified modules. These modules use the interface to communicate, using a predefined set of communication primitives or transactions, which are then mapped onto the interface. Each interface is tailored to a set of functions suitable for that type of module. These interfaces are standardized and usually vary due to port width, initiator or target, or with some advanced options available to that type of module. The current SuperHyway supports two to 32 ports in the system. Each of these ports supports data widths of 8 to 128 bits, with the capability to extend this in the future. In t his scheme, an initiator can initiate a transaction by creating a request packet and sending it to a target port. Not all ports are the same-some can be initiators and targets, while some are only targets. Typically the CPU, PCI controller and DMA controller are initiators in a system, whereas memory subsystems, slow peripherals and I/O will be targets. An MPEG decoder would be both initiator (for access to external memory) and target (for setup and decode packets). The target would respond with a response packet to close the transaction. Depending on the type of target, the initiator may send several requests before receiving the first response, and in some cases these may come out of order. This pipelined approach allows the system to maximize bandwidth and hide latency in various subsystems. This VSI-based interconnection scheme itself implements no expensive hardware for snooping or intercepting memory transactions to keep multiple copies of data coherent. However, it does have transactio ns that may be sent to a CPU or other device containing copies of data that can cause it to change the state of a line. This allows another memory user coherent access to main memory. Similar to what happens at the system and network levels, arbitration schemes are typically designed on a per-system basis since they may be closely related to the real-time data flow requirements of the applications. In an initial implementation of the scheme, the system designer may choose from a number of arbitration schemes that can be automatically generated. These include a fixed-priority scheme, a weighted-round-robin scheme and a weighted-fair scheme, along with other latency/bandwidth control systems. For many systems a weighted-fair scheme is selected, as it gives each port a guaranteed bandwidth and latency in a system while ensuring unused resources are still available to all devices in the system. Virtually all the data traffic on this interconnection architecture is split-transaction based. So, in a busy system, multiple transactions can be posted to a target module and queued, waiting for completion. This transaction pipelining provides for improved latency and maximizes the utilization of target modules. The pipeline queue may be different for each module. A particular advantage of this VSI-derived interconnection architecture is that it can support a wide set of operations and operation size ranging from simple memory primitives (load and store) to more complex operations such as cache coherency operations (flush and purge). These have been defined in such a way that a module is only required to support the operations it needs for correct operation. This ensures that a simple device such as a UART is not burdened with the overhead required for efficient cache coherency or burst operation. It also ensures future compatibility, since it is possible to extend the architectural operation set without changes to any existing IP, including both the module and interconnect implementations. The supported set of operations includes loads of between 1 and 32 bytes, stores of up to 32 bytes, read-modify-write and swap atomic primitives, cache operations and access aggregation transfers of known groups. Another advantage of this architecture is that it can be constructed with multiple levels, with the highest-performance modules connecting through ports directly on the SuperHyway, and slower peripherals grouped together on a special Peripheral Bus (Pbus), where the peripherals share access to the SuperHyway through a single port. A multilevel hierarchy can be constructed by dedicating data paths to high-bandwidth transactions and less to lower-bandwidth modules. This "wiring" is at the physical level, and the overall logical function of the architecture remains the same irrespective of this implementation hierarchy. We designed this architecture so that it could be tuned for performance, expanding bus width to increase bandwidth and bus clock speed. Bus widths up to 128 bits can easily be supported and clock speeds beyond 200 MHz are achievable. Current modules support peak bandwidths of 3.2 Gbytes/s. We also designed it so that it could be tuned for cost, optimizing the hierarchical structure so that the high-performance (wide and fast) part of the infrastructure is limited to specific sections of the chip, lower-performance peripherals are multiplexed through a single SuperHyway port and data paths are honed to the narrowest widths to suit the system load. The minimal system could be a high-performance segment of SuperHyway between the CPU and cache and the rest of the peripheral I/O on a smaller Pbus.
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |