Going to the Max on Bus Efficiency

Going to the Max on Bus Efficiency
By Todd Dinkelman, Micron, CommsDesign.com
December 16, 2003 (10:10 a.m. EST)
URL: http://www.eetimes.com/story/OEG20031216S0015

While many designers will point to network processors and traffic management devices as the bottlenecks in high-speed networking design, the biggest culprit of data path slowdowns could be the memory interface. Applications such as metro/access, base station, and core network systems require maximized performance of their numerous memory buses. It is imperative therefore to understand how the available memory types perform under each unique command and data flow pattern.

This article provides an overview of the applicable memory device types—including Quad Data Rate (QDR) SRAM, double data rate (DDR) SRAM, DDR SDRAM, and DDR2 SDRAM, as well as reduced latency DRAM II (RLDRAM II)—with emphasis on understanding how the command and data buses are utilized to achieve the desired levels of performance. Building on an understanding of command and data flows, graphical tools are provided to enable selection of the optimal memory architecture f or a given access pattern. The article also discusses how different memory access patterns are optimized and concludes with ideas on how to make designs flexible enough to support a range of performance and cost targets.

SRAM Demands
It is important to understand the demands of SRAMs due to their use in numerous areas of network applications. SRAMs are classified in terms of the nature of traffic needing accommodation. Lookup-style accesses entail mostly READ cycles and occasional WRITEs. Packet buffering-style accesses entail balanced READ and WRITE operations. Packet classification or quality of service (QoS) processing touches the packet more than once resulting in an unbalanced ratio between READs and WRITEs, which vary depending on network traffic. In addition to the variety of bus traffic, higher bandwidths must be achieved to satisfy signal integrity requirements and maintain pin efficiency.

This diversity results in a solution requiring three seemingly different, yet complementary SRAM architectures. The first of the new architectures, QDR SRAM, was named to reflect how many data transfers occur during one clock cycle. This SRAM comprises separate data-in and data-out buses. Each bus runs concurrently and at double data rate (i.e. transfers two data bits per pin per clock cycle) resulting in quad data rate. It allows simultaneous READ and WRITE operations. This device is optimized in applications where READ and WRITE operations are balanced.

DDR SRAM and separate I/O (SIO) DDR SRAM are two other architectures developed. DDR SRAM is named to reflect the number of data transfers it produces during one clock cycle. It is comprised of a single common I/O (CIO) data bus. Therefore, it is more accurately referred to as the common I/O DDR SRAM. The bus runs at a double data rate, and the device is optimized for applications in which the bus operates in one direction for long periods of time. Unlike QDR SRAM, only one read or one write is in flight at one time. The goal of the DDR arch itecture is to minimize the required number of bus turnaround operations and to maintain the highest efficiency possible on the data bus.

SIO DDR SRAM devices feature individual data input and output buses, much like those found in a QDR device. As with CIO DDR SRAM, only one read or write is in progress at any one time. Unlike CIO DDR SRAM, there is no need to insert idle cycles for bus turnarounds, allowing the device to accept one random read or write for each clock cycle. All three SRAM architectures operate at either 1.5- or 1.8-V nominal I/O levels and are designed to meet the needs of higher frequencies, requiring them to sacrifice small die size and cost-effectiveness in favor of performance.

To illustrate the differences between the variety of SRAM architectures, Figure 1 shows the sustainable bandwidth per pin for each architecture in relation to the ratio of their respective READ versus WRITE operations.

Figure 1: Comparison of SRAM bus utilization .
The challenges inherent in most high-speed network designs include balancing raw bandwidth, latency, a wide variety of bus-use models, and, of course, cost and performance. For example, the highest-defined current OC specification stipulates a transfer rate of 40 Gbit/s. If a single OC-768 full-duplex port is bombarded with a stream of minimum-sized packets at 40 Gbit/s, one packet is received and one packet is transmitted every 10 ns. If the stipulation is made that the port cannot drop a single packet, the ideal memory for this application would have a latency (tRC ) of 5 ns.
Presently, the usual method of meeting these performance parameters is through the use of SRAM technology. However, depending on the density requirements, this may not be cost-effective. On the other hand, standard SDRAM provides the lowest cost per bit but does not meet the performance requirements (Figure 2). Therefore, the ideal solution combines SRAM's high performance with DRAM's high density and cost-efficiency. To that end, Micron Technology and Infineon Technologies developed a new DRAM architecture called reduced latency DRAM (RLDRAM) that does just that.

Figure 2: Diagram illustrating cost/bit vs. latency.
RLDRAM: What it is?
RLDRAM is a high-bandwidth, low-latency DRAM designed to enable network infrastructure systems up to OC-768 rates and provide the scalability needed to make it acceptable for future applications. The first RLDRAM generation features a top clock frequency of 300 MHz, a data rate of 600 Mbit/s per pin and a latency of just 25 ns at 300 MHz clock. The technology's second generation, RLDRAM II, raises the performance level dramatically with a data rate of 800 Mbit/s per pin, a latency of 20 ns, and, of course, the increased densities of its DRAM architecture .
RLDRAM II includes eight internal banks versus the four found in DDR SDRAM, enabling high bus efficiency with cyclic bank addressing. By pipelining these operations, a data request rate equal to the device clock rate is sustained when data ordering in the banks is permitted. With RLDRAM's small minimum packet size, this data rate is easily achieved for writes, and fewer stalls are experienced for reads than on any other type of DRAM due to the increased bank availability.
The RLDRAM II architecture offers both SIO and CIO options. The SIO device features separate read and write ports to eliminate bus turnaround cycles and contention and is optimized for near-term read and write balance, enabling it to achieve full bus utilization.
The RLDRAM II CIO device includes a shared read/write port, which requires one additional cycle to turn the bus around. Its architecture is optimized for data streaming, in which the near-term bus operation is either 100 percent READ or 100 percent WRITE, indepe ndent of the long-term balance. The version preferred by the customer should provide the optimal compromise between performance and utilization for the target application.
Figure 3 shows the differences in performance between the SIO and CIO devices when using different burst lengths. As illustrated, the RLDRAM II SIO part provides 100-percent bus utilization for 4- and 8-word bursts at a 1:1 ratio. At the extremes, this falls to 50 percent. On the other hand, the CIO device begins with 67 percent utilization for the burst of 2 with a read-to-write ratio of 1:1 and improves from there. The 4-word burst begins with 80-percent utilization, while the 8-word burst begins with 89 percent. Back-to-back operations resulting in 16 words and 32 words of data demonstrate even higher bus utilization (94 percent and 97 percent respectively).

Figure 3: Comparison of DRAM bus utilization. center>
The RLDRAM II device provides numerous benefits over other network-specific DRAM devices. The 9-, 18-, and 36-bit wide data bus options provide extra bits to implement parity/ECC. OC-768 systems need RLDRAM II's 400 MHz clock and 800 Mbit/s per pin data rate to run efficiently.
For applications needing deep buffers, a 9-bit wide configuration is available. Its latency, which can be set as low as 20 ns, enables a higher effective bandwidth.
The on-chip delay lock loop (DLL) provided on an RLDRAM II device removes a significant amount of on-chip skew between the clock and data. This DLL's 1.8-V core enables a more power-efficient design, while the 144-ball FBGA package possesses a small footprint of just 198 mm² and facilitates high-speed signaling. A multiplexed address mode is available for applications sensitive to additional pins on the controller, and a great deal of flexibility is also provided through availability of both 1.5- and 1.8-V I/O voltages and programmable ou tput impedance, which allows compatibility with both HSTL and SSTL I/O schemes.
Addressing Schemes and More
RLDRAM II has two addressing schemes: a multiplexed mode similar to an SDRAM's, as well as a non-multiplexed mode similar to that found on SRAM devices. In addition, other features—such as on-die termination (ODT), unidirectional data strobes for high speed and low-skew clocking, data mask capabilities that mask data on WRITE commands issued to the device, and a data valid signal that allows easy data capture and simple controller data buffer design—address both the memory requirements of today's applications, including graphics and L3 cache, and the future needs of the networking community.
Depending on the specific processing strategy, the memory bus interfacing a network processor or ASIC system in a seven-layer packet process may be required to run up to eight times faster than the link rate to enable full READ and WRITE operations and throughput. For an OC-768 appl ication, the data rate needed may reach as high as 40 Gbit/s.
Comparing RLDRAM II's write-read--write access at 300 MHz to that of DDR2 SDRAM reveals the data rate of each architecture is 20.95 Gbit/s and 4.15 Gbit/s, respectively. To meet the 40-Gbit/s requirement, designers would need 20 RLDRAM devices as compared with 97 DDR2 SDRAM devices. Obviously, using the RLDRAM devices results in significant power and space savings. At 400 MHz, the number of RLDRAM II devices is reduced even further.
Figure 4 compares an RLDRAM II CIO device with DDR2 as packet memory. In this example, four 4-word bursts are read, then another four 4-word bursts are written. Each scenario makes optimal assumptions for its respective device. For DDR, this means all data comes from one open page, while for RLDRAM II, it means no bank conflicts exist.

Figure 4: RLDRAM II and DDR2 SDRAM: Best case CIO c locking sequence as acket memory.
As shown in Figure 4, RLDRAM II requires only 33 clock cycles to perform these operations at 300 MHz clock (600 Mbit/s data), thus achieving a bus efficiency of 97 percent (33 cycles are required to transfer 32 pieces of data). DDR2 SDRAM, on the other hand, delivered a bus efficiency of only 43 percent. (74 clock cycles are required to transfer the same amount of data).
Figure 5 compares the command and data sequencing between QDR SRAM and RLDRAM II SIO. RLDRAM II has a longer initial latency than the QDR device, however it achieves full bus utilization when refresh commands are scheduled in a way that avoids bank conflicts. Bank conflicts can also be avoided by controlling the bank sequence of data, which is difficult in some applications. A refresh operation takes the same time as any other DRAM access, one full tRC period.

Figur e 5: RLDRAM II versus QDR SRAM: Best case SIO clocking sequence as packet memory.
Running RLDRAM II at 1.8V I/O makes it feasible to control either RLDRAM or DDR2 SDRAM devices. It is recommended the controller be designed with programmable impedance outputs that have either a continuously variable drive or several coarse points to select from. Also, the input receivers should be designed to operate with tight trip points. Table 1 shows a quick comparison between RLDRAM II and DDR2.

Table 1: Comparison of RLDRAM II and DDR2

Figure 6 shows the layout of a high-performance network line card. The look-up tables use either DDR SRAM or RLDRAM II CIO because of their high read-to-write ratios, with RLDRAM II providing the high density. Either QDR or RLDRAM II SIO is used for the link list and packet counting operations due to their 1:1 operations rat io. Finally, many options exist for the packet memory, including QDR SRAM for maximum performance, RLDRAM SIO for high performance and high density, DDR2 SDRAM for low cost, or RLDRAM CIO to allow higher performance than DDR2 SDRAM offers.

Figure 6: Example line card.
Wrap Up
As network line rates continue to increase, both the performance and the density of the memory devices become increasingly important in the network processor's ability to do packet processing quickly and efficiently. The RLDRAM II architecture addresses these higher performance requirements. It combines the performance-critical features networking and cache applications need, including high density, high bandwidth, and fast SRAM-like random access.
The RLDRAM device possesses an innovative circuit design minimizing the time between the beginning of the access cycle and the availability of t he first data. This produces latency less than half the 45 ns required by standard DDR SDRAMs. The RLDRAM architecture is designed to close the performance gap between DDR SDRAM and SRAM. In those cases requiring a higher density, more cost-effective solution, it can even replace SRAM.
About the Author
Todd Dinkelman is a DRAM applications engineer in Micron's Networking and Communications Group. He received a BSEE from Southern Illinois and is currently seeking his MSEE. Todd can be reached at tdinkelman@micron.com.