Micro-threaded Row and Column Operations in a DRAM Core

by Frederick A. Ware and Craig Hampel
Rambus Inc.

Abstract

The technique of micro-threading may be applied to the core of a DRAM to reduce the row and column access granularity. This results in a significant performance benefit for those applications that deal with small data objects.

I. INTRODUCTION

DRAM interface speeds have improved dramatically in the last decade. However, DRAM core speeds have seen much smaller improvements. As a result, the access granularity of DRAM components has steadily increased, limiting performance in some applications. Table I shows historical data that supports this observation.

TABLE I Historical Trend of DRAM Row Granularity

DRAM component	year	row granularity (bytes)
EDO (x8)	1994	4
SGRAM (x32)	1998	8
GDDR (x32)	2001	16
GDDR 3 (x32)	2004	32

“Row granularity” refers to the amount of data that can be transferred through the interface during a row cycle interval (t_RR). There is a corresponding “column granularity” metric that refers to the amount of data can be transferred during a column cycle interval (t_CC).

This granularity problem may be addressed with the architectural technique of “micro-threading”. This technique permits several small accesses to take the place of a single large access. This requires some modification to the interface of the DRAM, but does not add significant cost for many DRAM cores. In many cores multiple independent row decode circuits and multiple independent column decode circuits are already present.

Fig. 1 shows an example of the performance benefit of micro-threading. Two DRAM cores are compared in a graphics application in which a range of triangle sizes are accessed. The cores of the two DRAMs have identical speeds, but are organized differently. The interface circuitry of the two DRAMs have identical speeds.

The micro-threaded core has two to four times the effective triangle access rate. The remaining sections of this paper will describe how the performance benefits of Fig. 1 are achieved.

Fig. 1. Triangle size versus triangle access rate for typical and micro-threaded DRAM cores. The x-axis shows triangle size in units of 4-byte-pixels. The yaxis shows the access rate in 106 triangles per second. The upper curve is for a micro-threaded DRAM core, and the lower curve is for a typical DRAM core. The micro-threaded DRAM core has from 2x to 4x the performance (triangles/s) with the same core timing parameters.

II. TYPICAL DRAM CORE

Fig. 2 shows the internal details for a typical DRAM core. This component has eight independent banks. One bank (gray) consists of an “A” half and a “B” half, connected to the “A” data pins and the “B” data pins, respectively. The two bank halves operate in parallel in response to row and column commands. A row command selects a single row (white) within each bank half, and two column commands select two column locations “x” and “y” (black) within each row half.

Note that in this example, each group of four bank halves (a “quadrant”) has its own set of column and row decoder circuits. However, these resources are operated in parallel, with each transaction utilizing two diagonal quadrants and not using the other two quadrants.

Fig. 3 shows the timing of a transaction for this typical DRAM component. The “row pins” and “column pins” show the row and column commands directed to the DRAM. Note that most DRAM components multiplex the row and column commands onto the same interface pins – this timing diagram has separated them to make the sequence easier to follow.

After a row command (“row”) is received, the selected row is accessed (sensed and latched). A time t_RR must elapse before another bank can perform a row access – this time interval represents the time the bank’s row circuitry is occupied.

After a column command (“col x”) is received, the selected column is accessed. In the case of a read command, the data at location “x” is driven onto the data pins, and in the case of a write command the data on the data pins is stored into location “x”. A time t_CC must elapse before the bank can perform another column access (“col y”) – this time interval represents the time the bank’s column circuitry is occupied.

The t_CC interval in this example is 4ns, yielding a column access rate of 250MHz. The bit transport interval is 0.25ns, so that 16 bits are transported on each link during a column access. With 16 data links, the column granularity is 32 bytes. Because t_RR is twice t_CC, the row granularity is 64 bytes. Fig. 3. Transaction timing for a typical DRAM core. A row command selects a single row within a bank, and two column commands select two column locations “x” and “y” within that row. The row circuitry for the two affected half-banks are engaged for an interval of t_RR after the row command, and the column circuitry for the two affected half-banks are engaged for an interval of t_CC after each column command. In this example, 64 bytes are accessed in each t_RR interval, and 32 bytes are accessed in each t_CC interval – these values are the row and column access granularity, respectively.

Fig. 2. A typical 8-bank DRAM core. One bank (gray) is split into an “A” half and a “B” half in order to connect to all the data pins (left-center and rightcenter). The two bank halves operate in parallel in response to row and column commands, received through the pins in the center. A row command selects a single row within each bank half (white), and two column commands select two column locations (black) “x” and “y” within each row half.

Fig. 3. Transaction timing for a typical DRAM core. A row command selects a single row within a bank, and two column commands select two column locations “x” and “y” within that row. The row circuitry for the two affected half-banks are engaged for an interval of tRR after the row command, and the column circuitry for the two affected half-banks are engaged for an interval of tCC after each column command. In this example, 64 bytes are accessed in each tRR interval, and 32 bytes are accessed in each tCC interval – these values are the row and column access granularity, respectively.

III. MICRO-THREADED DRAM CORE

Fig. 4 shows the internal details for a micro-threaded DRAM core. This component has 16 independent banks. Each of the 16 banks is equivalent to a half-bank of the typical DRAM core (section II). Even-numbered banks connect to the “A” data pins and the odd-numbered banks connect to “B” data pins. Each bank quadrant operates independently in response to row and column commands (a quadrant is a group of four banks with dedicated row and column circuitry). Furthermore, a column access of an upper quadrant is interleaved with the corresponding column access of the lower quadrant.

Fig. 5 shows the timing of a transaction for this microthreaded DRAM component. After a row command (“r0”) is received, the selected row (in bank 0) is accessed. A time tRR must elapse before another bank in the same bank quadrant can perform a row access. However, banks in the other three quadrants may be accessed during the interval – row commands r1, r2, and r3 are directed to banks 1, 2, and 3, respectively.

After a column command (“c0x”) is received, the selected column is accessed (column 0x of row 0 of bank 0). A time t_CC must elapse before this bank can receive another column access command (“c0y”). However, banks in the other three quadrants may be column-accessed during the interval – column commands c1x, c2x, and c3x are directed to banks 1, 2, and 3, respectively.

As with the typical DRAM core example, the t_CC interval in this example is 4ns, and the bit transport interval is 0.25ns. However, each column access only transports data for half the t_CC interval, and each column access only uses 8 of the 16 data links, resulting in a column granularity of 8 bytes, one-quarter of the previous value. The row granularity is 16 bytes, again one-quarter of the previous value. The benefit of finer granularity comes at the cost of increased command bandwidth; there are four times as many row and column commands in each tRR and t_CC interval. In most DRAM components the command bandwidth is generally much lower than the data bandwidth, so accommodating an increase of the command signaling rate or an increase in the number of command pins will not impact the DRAM area significantly. In the DRAM cores that have been analyzed, the DRAM area increase has been under one percent. The benefit of finer granularity also comes at the cost of independent row and column circuitry for each quadrant of banks. In some DRAM implementations, this circuitry is already present, and utilizing it will not impact the DRAM cost. In other DRAM implementations this circuitry may not be present, and adding it may increase the area and cost of the DRAM component.

Fig. 4. A micro-threaded 16-bank DRAM core. Each of 16 banks is equivalent to a half-bank of the typical DRAM core. Even banks connect to the “A” data pins and the odd banks connect to “B” data pins. Each bank quadrant operates independently in response to row and column commands (a quadrant is a group of four banks with dedicated row and column circuitry). Furthermore, a column access of an upper quadrant is interleaved with the column access of the corresponding lower quadrant.

Fig. 5. Transaction timing for a micro-threaded DRAM core. A row command selects a single row within a bank, and two column commands select two column locations “x” and “y” within that row. The row circuitry for the two affected half-banks are engaged for an interval of tRR after the row command, and the column circuitry for the two affected half-banks are engaged for an interval of tCC after each column command. In this example, 16 bytes are accessed in each tRR interval, and 8 bytes are accessed in each tCC interval – these values are the row and column access granularity, respectively..

IV. GRAPHICS APPLICATION EXAMPLE

A. Assumptions

A graphics application example will illustrate the benefits of finer access granularity. Fig. 6 shows the assumptions that will be used in conjunction with this example.

It is assumed the column locations of each row of each bank will be mapped into a two dimensional rectangle of pixels in order to maximize the spatial locality of a small graphical object. This increases the probability that the pixels of such a graphical object (here a six-pixel triangle) lie within a single row. The rectangular map shown in the figure is appropriate for the typical DRAM core of fig. 2.

It is also assumed that the graphics application generates a transaction per triangle, and the transaction stream only accesses a particular row of a particular bank intermittently; i.e. the temporal locality is low. Consequently, the DRAM controller practices a closed page policy in which each transaction performs a precharge (close) operation after the last column access. Bank and row addresses are mapped to ensure that transactions have a low probability of accessing banks that are already busy (handling an earlier transaction).

B. Triangle Alignment

Fig. 7 shows the alignment cases for a six-pixel triangle for the micro-threaded DRAM core and the typical DRAM core. This analysis will permit triangle size (in pixels) to be converted to column accesses.

The typical DRAM core (right side of fig. 7) had a column access granularity of 32 bytes, or eight pixels. There are eight alignment cases for a six-pixel triangle, requiring from two to four column accesses. The average is 2.625 column accesses, meaning that (on average) 21 pixels are accessed in order to get the six pixels forming the triangle. 15 pixels are discarded, yielding an efficiency of about 29%.

The micro-threaded DRAM core (left side of fig. 7) had a column access granularity of eight bytes, or two pixels. There are two alignment cases for a six-pixel triangle, requiring either four or five column accesses. The average is 4.500 column

Fig. 6. Mapping of bank, row, and column addresses into a pixel address. Column addresses are mapped into a two dimensional pixel space to increase the probability that a graphical object (here a six-pixel triangle) lies within a single row. It is assumed that the graphics application generates a transaction stream that only accesses a particular row of a particular bank intermittently. Consequently, the DRAM controller practices a closed page policy in which each transaction performs a precharge (close) operation after the last column access. Bank and row addresses are mapped to ensure that transactions have a low probability of accessing banks that are already busy (handling an earlier transaction).

Fig. 7. Triangle alignment in a micro-threaded DRAM core and a typical DRAM core. The typical DRAM core (right) had a column access granularity of 32 bytes, or eight pixels. There are eight alignment cases for a six-pixel triangle, requiring from two to four column accesses. The average is 2.625 column accesses, meaning that (on average) 21 pixels are accessed in order to get the six pixels forming the triangle. The micro-threaded DRAM core (left) had a column access granularity of eight bytes, or two pixels. There are two alignment cases for a six-pixel triangle, requiring either four or five column accesses. The average is 4.500 column accesses, meaning that (on average) nine pixels are accessed in order to get the six pixels forming the triangle.

accesses, meaning that (on average) nine pixels are accessed in order to get the six pixels forming the triangle. 3 pixels are discarded, yielding an efficiency of about 67%

C. Object Size versus Object Access Rate

Fig. 8 shows the general relationship between the size of object and the rate at which it can be accessed. The x-axis shows the number of column accesses needed to access a graphical object (like a triangle). The y-axis shows the access rate of the graphical object in objects per second. The relationship is an inverse one, with an object size of “n” producing an object access rate of “1/(n*t_CC )”, where t_CC is the column cycle interval. The object access rate is also limited to a maximum value of “1/t_RR”, where t_RR is the row cycle interval. In the micro-threaded DRAM core, the effective t_CC and tRR intervals are one-quarter the values for the typical DRAM core, yielding object access rates that are up to four times greater.

V. GRAPHICS APPLICATION - PERFORMANCE

Fig. 9 shows the general relationship of triangle size versus triangle access rate for typical and micro-threaded DRAM cores. The upper-most x-axis shows triangle size in units of 4- byte-pixels. The y-axis shows the access rate in 106 triangles per second. The upper curve (8-bytes per column access) is for the micro-threaded DRAM core, and the lower curve (32-bytes per column access) is for the typical DRAM core. The middle x-axis is a scale that converts from triangle size to column accesses for the micro-threaded DRAM core (for example, a six pixel triangle requires 4.500 column accesses). Likewise, the lowest x-axis is a scale that converts from triangle size to column accesses for the typical DRAM core (for example, a six pixel triangle requires 2.625 column accesses).

The Fig. 9 graph provides a clear view of the performance advantage of the micro-threaded DRAM over the typical DRAM. This advantage is present even through the two DRAMs have interfaces with the same transfer bandwidth, and cores with the same access intervals. The performance advantage comes from the reduction of access granularity.

VI. OTHER APPLICATIONS

There are a number of other applications which are sensitive to the granularity size of memory accesses. The modeling of a continuous physical medium (a fluid, for example) on a massively parallel supercomputer is one of these applications. This class of problems typically utilizes a linear system of equations that is sparse - only a small fraction of the terms are non-zero. Because of this there will be limited locality of reference, like the graphics example already discussed. Reduced granularity memory accesses will make more efficient use of the memory system.

Another application is that of network switching. This involves the temporary storage of the small packets that form a network transfer. These packets are small and there is no locality of reference (the packet stream is mixed randomly with the packets from other simultaneous transfers). Again, reduced granularity memory accesses will make more efficient use of the memory system.

Fig. 8. Object size versus object access rate for a DRAM core. The x-axis shows the number of column accesses needed to access a graphical object (like a triangle). The y-axis shows the access rate of the graphical object in objects per second. The relationship is an inverse one, with an object size of “n” producing an object access rate of “1/(n*t_CC)”, where t_CC is the column cycle interval. The object access rate is also limited to a maximum value of “1/t_RR”, where t_RR is the row cycle interval.

Fig. 9. Triangle size versus triangle access rate for typical and micro-threaded DRAM cores. The upper-most x-axis shows triangle size in units of 4-byte-pixels. The y-axis shows the access rate in 106 triangles per second. The upper curve (8-bytes per column access) is for a micro-threaded DRAM core, and the lower curve (32-bytes per column access) is for a typical DRAM core. The middle x-axis is a scale that converts from triangle size to column accesses for the microthreaded DRAM core (for example, a six pixel triangle requires 4.500 column accesses). Likewise, the lowest x-axis is a scale that converts from triangle size to column accesses for the typical DRAM core (for example, a six pixel triangle requires 2.625 column accesses).

A final application is that of general purpose processing. The cache hierarchy utilized by advanced microprocessors chooses a line size (a replaceable block) that is equal to the row granularity; i.e. the product of the width of the memory system and the tRR parameter divided by the t_BIT interval. This permits fully random access at the maximum bandwidth of the memory system, but at the cost of a line size that grows over time as the t_BIT interval shrinks and t_RR stays relatively constant. Reduced granularity memory accesses would permit the line size to be set to a value that is appropriate for the processor, and not to a value that is convenient for the memory system.

VII. CONCLUSIONS

The historical trend of increasing row and column access granularity of DRAM components will increasingly limit the performance in certain classes of applications. The architectural technique of micro-threading may be applied to existing DRAM cores with relatively low incremental cost. This technique permits the DRAM component to perform several small-granularity accesses in place of a single large granularity access. Performance improvements of 2x to 4x are possible within some applications.

ACKNOWLEDGMENT

The authors wish to acknowledge the contributions of Wayne Richardson, Chad Bellows, and Larry Lai to the analysis of micro-threading in a DRAM component. All three are full-time employees of Rambus, Inc.

REFERENCES

[1] A Performance Comparison of DRAM Memory System Optimizations.for SMT Processors. Zhichun Zhu. The 11th International Symposium on High-Performance Computer Architecture Palace Hotel, San Francisco, February 12-16, 2005

[2] A Performance Comparison of Contemporary DRAM Architectures. Vinodh Cuppu, Bruce Jacob, Brian Davis, and Trevor Mudge. The 26th International Symposium on Computer Architecture (ISCA'99), Atlanta GA, May 3, 1999.

[3] Building DRAM-based High Performance Intermediate Memory System. Junyi Xie and Gershon Kedem Department of Computer Science Duke University Durham, North Carolina 27708-0129 May 15, 2002.

[4] Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. Wei-fen Lin, Steven K. and Doug Burger. The 7th International Symposium on High-Performance Computer Architecture, January 2001.

Micro-threaded Row and Column Operations in a DRAM Core

Contact Rambus Inc.