Network DRAMs Shine in Datapath Designs

Network DRAMs Shine in Datapath Designs
By Mike Pearson, Samsung Semiconductor, Inc., CommsDesign.com
October 16, 2003 (12:42 p.m. EST)
URL: http://www.eetimes.com/story/OEG20031016S0049

Choosing the right memory device can be a daunting task when building today's networking architectures. Engineers must choose between ternary content-addressable memories (TCAMs), multiple DRAM flavors, quad-data-rate (QDR) SRAMs, SDRAMs, and more. And, the task doesn't get any easier going forward. With network speeds moving into the 10-Gbit/s range and beyond, choosing the right member architecture will be even more challenging going forward.

Traditionally, designers have relied on TCAM/SRAM combos to meet the storage needs of networking architectures. While this combo still thrives today, many have questioned whether TCAM/SRAM combos can meet the performance demands of 10-Gbit/s networking boxes and beyond.

Seeing a potential opportunity, DRAM vendors have attacked the problem and created two flavors of devices: one being the Reduced-Latency DRAM (RLDRAM) and the other called the Network DRAM (also known as Fast-Cycle RAMs). While both archit ectures have their merits, this article will provide performance data that shows why the Network DRAM thrives in next-generation networking designs.

Some Basics
Network DRAMs, also known as network FCRAMs, are low-latency, high-bandwidth DRAM devices that are ideally suited for a number of high performance networking applications such as packet buffering and look-up storage in switches and routers, Level 3 cache for processors, and TCP offload engines (TOEs). These memory devices sport multiplexed addresses as well as I/Os that deliver similar performance levels similar to SRAMs (HSTL) and DRAMs (SSTL) operating in the 2.5-, 1.8-, and 1.5-V range. In fact, the Network DRAM devices are similar enough to standard DDR DRAMs that several network processor makers (Bay Microsystems, Broadcom, EZChip, Internet Machines) have made NPUs that will work with both types of DRAM, allowing their customers to choose the price and performance point that suits their needs.

Building on the capabilities deliv ered in the first-generation (discussed above), a second generation of network DRAM devices have been developed that deliver 288-Mbit performance, ECC-friendly widths of x9, x18, x36, on-die termination, a 20-ns tRC, and a simpler unidirectional pair of data strobes (DS, QS) for ease of high speed design. The second generation also supports a 576-Mbit device that delivers 8 banks, a target row cycle time (tRC) of 18 ns, and a better than 800-Mbit/s data rate.

The Network DRAM is similar to a DDR or DDR2 part in that it has a 4-bit pre-fetch architecture. To achieve the greatly reduced latency over a standard DDR part, the Network DRAM has shorter bit lines (resulting in fewer column addresses) and a faster sense amplifier compared to the standard DDR or DDR2 part. Table 1 gives us a look at how the Network DRAM compares to the standard DRAMs that designers are familiar with.

Performance Chara cteristics
Of all the benefits Network DRAMs provide, the most striking is a cycle time that is roughly one third of a standard DDR or DDR2 DRAM. This feature is matched in some instances by the RLDRAM II, though the constraint on that part of an even number of clock cycles makes the 20-ns tRC not always achievable in an RLDRAMII device (Table 2). For cycle times that are not even multiples of 20 ns, the tRC of the RLDRAM II may be as long as 24 ns (8 cycles of 3 ns at 333MHz/667Mbit/s). The Network DRAM II does not have this constraint and has a tRC that is simply the integer multiple of the tCK that is equal to or just greater than 20ns.

ASIC, ASSP, microprocessor, and network processor designers making memory choices for devices being designed today should consider that by late 2004 or early 2005, the time silicon in design today will come to market, DDR2 should be on the verge of mainstre am usage in PCs. As most people familiar with the DRAM industry know, a particular DRAM type achieves its lowest cost-per-bit when it becomes mainstream in the PC industry. To that end, we will take a short look at the performance characteristics of three DRAM types expected to be common then: DDR2, Network DRAM II, RLDRAM II.

When a designer is considering a choice among these parts it is useful to remember that any ECC width requirements will necessitate an additional DDR2 part to fill out the 72-/144-/288-bit bus, while the Network DRAM II and RLDRAM II support those widths by design. An additional point is that with a BL=4 part, any bus that is 18 bits wide will automatically support ECC, since each minimum transfer is 72 bits.

As another data point, many network systems currently in design are planned to operate at 250 MHz. The timing chart, shown in Figure 1, gives a comparison of how the Network DRAM II and the RLDRAM II perform in that situation.

Figure 1: Timing chart comparing the performance of Network DRAM II and RLDRAM II.
As Figure 1 points out, at 250 MHz frequency, the network DRAM has a higher effective bandwidth than the RLDRAM II. Note: information on the RLDRAM II is derived from Micron's 288M datasheet dated 5/3/2003.
Table 3 shows a bandwidth comparison under some worst case scenarios for DDR2, Network DRAM II, and RLDRAM II. As designers can see, in the worst case situation of repeated reads and writes to the same bank, Network DRAM has a slightly higher performance than the other parts in a couple of situations. And in every case, the network DRAM is 30 to 100 percent higher performance than DDR2. With this information, we can begin to evaluate the performance of Network DRAM in various applications.

Switch and Router Implemen tation
For routers and switches, both bandwidth and latency are crucial in packet processing and storage. The standard metric cited here is the ability to write the smallest packet—40 bytes for packet-over-Sonet (POS)—at wire speed the smallest packet. With a tRC of 20 to 25 ns, both version I and II Network DRAMs easily allow 40 bytes to be written within the 32 ns it takes for 40 bytes at OC-192c wire speed.
At 500-Mbit/s data rate and a 20 ns tRC, the bit striped bandwidth of a Network DRAM is 444 Mbit/s per pin. To write the packet with ECC, a memory device must write 360 bits in 32 ns, equating to a bandwidth of 11.25 Gbit/s. This is more than met by a 36-bit bus and the lowest-speed Network DRAM II. Standard DDR and DDR2 memory types can also meet this bandwidth requirement if the memory bus width is increased, but the memory controller design can be simplified if the tRC is shorter than the minimum packet rate.
Figure 2 indicates what DRAM bandwidth, frequency, and bus widths are needed for various line speeds. This graph takes an idealized bandwidth and then increases the required bandwidth by a factor of 4 to account for write-to-read turn around and other latency hits. If the worst-case scenario is considered (a read to the same bank is going on when the packet is being written) the 200 Mbit/s per pin bandwidth can achieve this with a 144-bit bus.

Figure 2: Required DRAM bandwidth, frequency, and bus widths for various line speeds.
Designers must remember that the bus bandwidth in case shown in Figure 2 is 100 Mbit/s of read and 100 Mbit/s of write. Thus, an 11.25-Gbit/s read is more than met by the network DRAM's 144-bit bus, which is a fairly common memory bus width in switches and routers.
IPv6 and Network DRAM
An emerging extension of the IP protocol is IPv6. This standard has a 40-byte header, compared to the 20-byte h eader for IPv4. The impetus for this new standard is the concern that as more nodes come online outside of the US, Europe and Japan, IPv4's 4 bytes worth of addresses may be insufficient.
For IPv6 over Ethernet, the minimum packet size is 64 bytes. For IPv6 on POS the minimum packet size is 60 bytes (the 40 bytes of IPv4 plus the 20 additional bytes of header). With this in mind, let's perform the same sort of minimum packet size at line speed analysis for the IPv6 that has historically been done for IPv4 memory performance analysis.
A burst of 64 byte packets over OC-192c would come over the line every 51.2 ns. Using a 144-bit bus, a burst of 4 write, the entire 64-byte packet plus ECC can be written into the packet buffer in 20 ns at even the slowest network DRAM II frequency. Or with a narrower 72-bit bus, two burst of four writes can be done even to the same bank in 40 ns, which is well under the 51.2-ns target. Within the 51.2 ns per smallest IPv6 packet window, two independent Network DRAM acc esses can occur, this would allow much more processing of both header and payload in an IPv6-based switch or router.
For a POS IPv6 packet of 60 bytes, the critical time is 48 ns. With even the smallest IPv6 packet, two memory accesses can occur within the critical packet time. Here is an area where the low latency of the Network DRAM allows for performance a step beyond anything achievable with even a very wide bus and standard DDR or DDR2 DRAMs.
Dealing with Multi-Threading and Multi-Cores
Several processors have been announced recently with innovative multi-threading technology—most with two threads. Some processors have also been announced with multiple cores and more than 2 threads. For performance, all of these processors have on-chip L2 caches. Some allow for L3 in particular configurations.
A common rule of thumb for cache size considerations is that as one goes from Level 1 to Level 2 to Level 3, each cache should be at least 8x larger than the level below. For a process or with 2 to 4 Mbytes of L2 cache, that means an L3 should be in the range of 16 to 32 Mbytes. That size of cache is very expensive if built in SRAM, but more reasonably priced if an appropriate DRAM solution is available.
The servers that these newer processors are designed for are also power sensitive. At the same frequencies, SRAMs are quite power-hungry compared to DRAMs. Drams offer an attractive solution in this case. Due to its low latency, the network dram ii is the most compelling dram solution for L3 caches in these multi-threaded, multi-core processors.
To show how Network DRAM II devices thrive in multi-threading and multi-core implementations, two interesting metrics can be evaluated: power-per-Mbyte-per-Mbit/s-per-pin, and power-per-Mbyte-per-latency.
The network DRAM has a power/Mbyte/Mbit/s of (2.5V*180mA)/36Mbyte/800 Mbit/s = 0.015. IN comparison, a high-speed cache-type DDR SRAM delivers a power/Mbyte/Mbit/s pf (2.5V*850mA)/2.25MByte/800Mbit/s = 1.18. Thus, the Network DRAM II device is two orders of magnitude better than a DDR SRAM in this metric.
When looking at the second spec, the Network DRAM delivers a power/Mbyte/latency equal to (2.5V*180mA)/36MB/20ns = 0.625. In the case of SRAMs, this spec is computed as (2.5V*750mA)/2.25MB/4ns = 208, again showing a significant improvement for the Network DRAM device.
The Network DRAM still only creates a compelling advantage in very large servers that feature clock speeds on the order of 2GHz. In these designs, off-chip latency is roughly 40 cycles for Network DRAM II (0.5-ns cycle time and 20-ns read latency yields). This is much better than the 80 cycles that designers would get with a DDR2 DRAM as an L3 based on its roughly 40-ns first access.
The penetration of Network DRAM into this area will of course depend ultimately upon the success of the multi-threaded, multi-core processors. A few large challenges loom for architects of processors. First they must prove in silicon that the concept of multi-threads and multi- cores actually achieve improved results over heavily pipelined single-threaded processors,. Processor vendors must also show that these architectures can meet the cost targets of the price-sensitive communication sector. Once these challenges get sorted out, designers can begin tapping Network DRAMs for other multi-thread/multi-core processor architectures.
Working with Offload Engines
TCP offload engines (TOEs) are an emerging class of products designed to free up server processors for general-purpose processing by offloading the packet processing. The win for communication and server developers is that for a modest investment in an add-in card, expensive processors that were overwhelmed doing packet processing can be returned to more value-added tasks.
The memory requirements for TOEs are not any different from the switch and router requirements outlined above. Therefore, as shown above, Network DRAM I and II devices will thrive in these applications.
Wrap Up
The Netwo rk DRAM was designed to be the most cost-effective DRAM-based method for achieving low latency and high bandwidth. The majority of users who have looked at the part understand the tradeoffs of die size, test time, and feature creep and choose the Network DRAM over competing devices because it is the least-costly method of achieving the targets of 400 MHz operation and 20-ns tRC. As process technology shrinks and feature overhead diminishes in higher densities, more features have been added.
Since the Network DRAM is conceived of as a "seed product" in the development of low-latency DRAM, manufacturers will continue to enhance the product. Some targets include a latency of less than 10 ns and the adoption of a multi-data clock such as QDR/ODR for higher bandwidth.
Reference
1. RLDRAM Overview, www.rldram.com
About the Author
Michael Pearson is the director of enabling for Samsung Semiconductor, focusing on the netwo rk, set-top box and flash memory card markets. Michael holds a bachelor's degree in physics from California Institute of Technology and can be reached at mpearson@ssi.samsung.com.