|
|||||
Addressing System-Level Challenges in High-Speed Comm Chips
Addressing System-Level Challenges in High-Speed Comm Chips To realize the benefits of dense integration in high-speed communication chips for wireline applications, design teams must cope not only with a wide variety of functional and architectural complexities, but also with rapidly evolving standards. Integrating sub-systems that perform critical functions, such as packet processing, quality-of-service (QoS), framing, and mapping into a single chip requires expertise in these technologies and a thorough understanding of relevant standards. In addition, the volume of data processed in these types of chips often requires complex memory-management functions, and sizing and accessing the chip's memories for optimal performance must be done in the context of the functions. This complexity poses risks in terms of cost and schedule to the development of the systems-on-chip (SoC) designs that are typically the key components of high-speed communication systems. Designers can mitigate these risks by using sy stem-level design methods to evaluate architectural options and verify functionality and standards-conformance as early as possible in the design cycle. Partitioning Wireline Designs Historicall y, communication-system developers have implemented lower-layer functions in an SoC and higher-layer functions in programmable processors. However, this partitioning is becoming blurred as systems combine different standards and as more functions and memory can be integrated into single chips. There is a trend to merge protocol layers and sub-layers. In general, processing functions are candidates for integration when they use the same silicon process, operate at similar frequencies, perform similar or contiguous processing steps, operate on the same or easily mappable data types, or use the same resources or interfaces. This merging of functions reduces the number of partitions and interfaces in the system, the complexity of the system, and the complexity and time required for system verification. For example, in the 10 Gigabit Ethernet physical and data-link layers, the clock rates of the physical coding sublayer (PCS), reconciliation sublayer (RS), and media access control (MAC) sublayer are with in about 2X of each other. These sublayers are candidates for implementation in the same silicon technology, possibly on the same chip. Figure 2 shows a line-card design that integrates the dataplane functions into only three partitions:
Figure 3 shows examples of line-card designs that apply the above three partitions in progressively higher levels of chip integration and functional diversity or aggregation.
Integration trade-offs often compare per port cost, footprint, or power against the time and cost of development and the flexibility to alter partitioning in future revisions. Higher integration together with more ports usually results in lower per-port cost. For example, options C, D, and E in Figure 3 typically result in lower-cost chips, per port, than options A or B, and option C will have lower cost than four copies of option A. Some functions, such as QoS packet classification and VLANs, are only needed when a chip supports multiple ports or high-bandwidth ports that aggregate traffic from several lower-bandwidth streams. Frame Processing Some framing-related functions change as line speeds increase. For example, the carrier-sense multiple-access with collision detection (CSMA/CD) function used in half-duplex 10/100/1000 Ethernet MACs was removed by the IEEE in the 10 Gigabit Ethernet standard, which is full-duplex only. Other framing-related functions, such as mapping and FEC, become more complex as line speeds increase. FEC is used at high data rates to reduce the bit error rate. It is one of the largest framing-related blocks and, except for repeater systems, must be integrated with frame processing. Most FEC algorithms are implemented with cyclic or turbo (product) coding. An integrated cyclic coder can achieve up to 7 dB of gain while an integ rated turbo coder can achieve 11 dB or more of gain and support transmission over distances up to a few hundred kilometers. FEC functions use an encoder and decoder, with associated registers or memory (see the embedded memory section below), and turbo coders can be large (10,000s to 100,000s of gates) and power-hungry. Recent chip-manufacturing processes make it possible to integrate high-performance turbo coders and decoders. New variations on legacy protocols can require substantial framing-related processing. Some of these protocols, like virtual concatenation (Vcat) and the generic framing procedure (GFP), as well as mapping functions, imply a chip with multiple or aggregated ports (options C, D, and E in Figure 3). The Vcat protocol supports fine-grained aggregation of synchronous payload envelopes (SPEs). The line-capacity adjustment scheme (LCAS) protocol dynamically manages the resizing of Vcat bandwidth allocation and link connectivity in a way that does not disrupt traffic. Both Vcat and LCAS involve complex processing that benefits from highly integrated framing functions. GFP allows diverse types of traffic to be encapsulated into frames on a synchronous optical network. GFP supports diverse network topologies, channel sharing, and scalable bandwidth, making it well suited for various levels of integration. The resilient packet ring (RPR) protocol defines a frame that can be encapsulated into GFP frames and supports use of all bandwidth in both fibers of a ring-topology network. Mapping and demapping are closely related to framing, with which they can be tightly integrated. Mapping places a packet, frame, or cell into a different-sized payload, such as placing IP packets into ATM cells into Sonet virtual-tributary (VT) payloads, or placing many plesiosynchronous digital hierarchy (PDH) signals into one Sonet payload. Demapping does the complimentary function and is more complex than mapping because it must handle synchronization of the payload with the recovered clock and co rrection for signal jitter, wander, and related problems. Packet Processing Figure 3 above shows three options (B, D, and E) that merge packet processing with frame processing. For example, the upper and lower portions of Layer 2 protocols, such as the 10 Gigabit Ethernet's logical link control (LLC) and MAC sublayers, respectively, may benefit from merging frame and packet processing functions. This eliminates the interface between the partitions, which can result in higher speed and lower cost. However, these advantages must be weighed against the flexibility limitations of functional modifications in future chip revisions. Traffic management functions are needed when a chip supports multiple or aggregated ports (options C, D , and E in Figure 3). The functions can include packet classification-queuing writes to the switch-and scheduling reads from the switch. Algorithms such as virtual output queuing (VOQ) or combined input-output queuing (CIOQ), plus their associated buffers, can be implemented on the chip. Quality-of-service (QoS) tasks may be performed across queuing, switching, scheduling, and other traffic-management functions. Firewall processing may look deeply into packet payloads and thus benefit from integration with other packet-processing functions. Embedded Memory Embedded SRAM, content-address able memory (CAM), SDRAM, ROM, RAM, and register-file memories are available as IP blocks. SRAM is especially common in commun ication-system chips. Most memory compilers are capable of generating almost any SRAM size, data-type size, and row count. CAM is often used to improve packet-classification performance, although pipeline stages may be needed if there is not enough CAM to hold an adequate set of packet headers, or if the memory is not wide enough to hold entire headers. SDRAM can be used instead of SRAM when the RAM size makes the cost of SRAM prohibitive. Communication tasks often need memory with odd data-type sizes. For example, packet-classification algorithms use data types that are tens of bytes wide to store packet headers. In such cases, the use of commercial memory chips, which are designed for standard data types, may result not only in low bandwidth but also in unused parts of each memory chip. Embedded CAM memory may be a better choice. Embedded memories can be large (several megabytes) and fast (300 to 400 MHz, single cycle) with data paths in the hundreds of bits. When parts of packets are process ed only by one block, such as a mapping function, that block can have a dedicated memory, sized according to its task. For large packets, a single multi-port memory can service multiple processing blocks. FEC is a candidate for this configuration. Single-ported memories, however, run faster and use less power than equivalent-sized multi-ported memories. System-Level Design Methodology Based on a chip's functional specification, an executable specification can be developed in the SystemC language using the transaction-level modeling (TLM) technique, which models events (such as interrupts) and exchanges of data (such as register or memory accesses). Under this approach, an untimed specification is first developed, in which event ordering is specified but without timing values. Then, the untimed specification is refined to a timed specification, in which events are assigned latencies and initiation intervals. Running simulations with the timed executable specification reveals system bottlenecks and provides quantitative data on how to improve the system architecture. It also supports rapid exploration and validation of architectural alternatives, in the quest for optimal system performance. Existing RTL models can be verified together with the SystemC executable specification enabling the reuse of existing design s, which is often mandatory for successful management of the design effort. For embedded software development, the TLM-based executable specification provides a very fast hardware prototype. Software can begin running on the untimed specification and move to the timed specification, where it can achieve bit-accurate register and memory mappings with realistic hardware performance. From the timed executable specification, the design is refined to an RTL description. This is typically done using hardware description languages like Verilog and, in the future, SystemVerilog. Using co-simulation, each module is verified against its design representation within the context of the whole system. Until all modules are verified, verification progresses, module by module. This bottom-up RTL verification methodology maintains consistency between the design representations and provides simulation efficiency. Finally, the RTL modules are integrated to perform chip-level tests. The ability to validate f unctionality and architectural performance at the earliest possible phase results in a predictable project schedule with less risk of expensive and late scope changes or design iterations. However, if scope changes are needed, the executable specification methodology provides a means to quickly validate the impact of changes. The executable specification works particularly well with complex functions, such as packet classification, route look-up algorithms and data structures, data-buffer management, and traffic management. These functions need to be exercised with real-world test stimuli long before the architecture is frozen. Rapid system definition together with fast simulation provides the basis for realistic verification. Standards Conformance Verification Complex interactions, such as QoS functions in multi-protocol environments, can be particularly difficult and time-consuming to thoroughly verify and debug. Without good tools, the verification process can be very slow, with one minute of realistic traffic potentially requiring several days of simulation. It is common for broadband systems that support QoS to miss their performance specifications by 10 to 20 percent due to problems with verification. Verification requires chip-level simulation using stimulus generators and response checkers to verify worst-case and typical traffic conditions. Figure 4 illustrates a test bench configuration that can verify compliance in a standard simulation environment.
Figure 4 shows stimulus data streams generated by a simulation environment being compared with expected outputs provided by the testbench. Conformance testing can require extremely long test sequences, working across several protocol layers and hardware/software interfaces. The analysis of output allows conformance testing at all hierarchical design levels, including system specification, algorithm design, implementation, and system integration. Wrap Up About the Authors Vincent Thibault is a technical marketing manager for Synopsys Professional Services in Synopsys' IP and Design Services group. He received his master's degree in electronics from the University of Paris, Orsay. Vincent can be reached at vincent.thibault@synopsys.com.
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |