Practical Design and Implementation of a Configurable DDR2 PHY
Update: GigOptix, Inc. Announces Acquisition of ChipX (November 10, 2009)
By Lior Amarilio, ChipXAbstract :
As speed and design complexity increases, so does the need for more memory storage. System-on-a-Chip (SoC) designers can choose to embed more memory into the device, at the expense of silicon area and cost. Depending on memory requirements, a more economical approach might be to use off-chip memories. The evolution of Dynamic Random Access Memory (DRAM), which targeted the commodity market during the last couple of decades, provided high performance, cost effective, off-chip memory. However, the drive for simplicity and inexpensive DRAM for the commodity market has left most of the design and protocol complexity in the memory controller and physical interface (PHY) drivers. As a result, SoC physical design teams faced with the need to interface with Double Data Rate2 (DDR2) DRAM, have challenges to overcome in regard to their design expertise, design flow, and Electronic Design Automation (EDA).
To reduce the hassles presented to SoC designers by the DDR2 interface, many problems have been resolved by DDR2 PHY IP development. A DDR2 high speed PHY block is almost always developed as a full custom mixed signal design. There are many good reasons for implementing a full custom design, where every cell and every signal route is fully controlled. Such pre-defined, hard designs offer a way to deal with the tight timing budget of DDR2, which is in the range of a few tens of picoseconds. Another reason is the physical dimensions in which this block must fit.
This paper presents through examples of the methods selected while performing physical implementation of the IP.
The DDR2 interface supports a wide range of interface settings. Start with the type and number of external SDRAM devices, move on to the logical data or address bus widths, and of course the topology of the SDRAM, such as point-to-point or multipoint connections. Each of these options implies a unique physical implementation of the PHY. This paper demonstrates how, with an advanced design process, the same DDR2 PHY IP design can handle different configurations without area or power overhead costs.
Selecting Structured ASIC or Hybrid ASIC as the platform for implementing the PHY introduces great flexibilities from both a performance point of view and from silicon area costs for different configurations. However, the design effort spent upfront in order to meet required performance while limited to the Structured ASIC Fabric is significant. This paper expresses a view of the the Structured ASIC advantages and design challenges as they relate to DDR2 PHY design.
This paper highlights the basic architecture of the DDR2 interface from the SoC side, showing the controller, master-slave delay-locked loop (DLL), the PHY, and the SSTL I/Os. The paper expresses the different configurations of such an interface with a view of the Structured ASIC adventure. The paper discusses in depth the physical design considerations and provides examples of the adopted approach, and lists a summary of achieved performance.
About DDR2 :
DDR2 is part of a large family of synchronous DRAM (SDRAM) interface technologies, which in turn is one of many DRAM implementations. DDR2 SDRAM is an evolutionary improvement over its predecessor, DDR SDRAM. This family of memories is the leading off-chip memory solution in the market today, and is used on PC Motherboards.
The primary benefit of this type of memory is its ability to read or write two words of data over the wide parallel data bus every clock cycle -- one word for the rising edge of the clock strobe, and a second word on the falling edge of the clock strobe. Hence the name Double Data Rate (DDR) memory.
The DDR2 interface is specified by JEDEC to operate at rate of 400–800Mbps, where few suppliers support even higher rates of up to 1066Mbps. The interface has a wide parallel bidirectional data bus, using SSTL1.8 I/Os and a single bidirectional strobe signal per each group of 8 data bits. The strobe signal is not a free running clock, but is transmitted along with the relevant active data. Moreover, logically yet unfortunately, JEDEC has defined the system to shift design complexity into the memory controller and PHY, to keep DRAMs as inexpensive as possible. This mandate left complexity in the development of the DDR2 PHY, resulting in significant challenges in terms of design expertise, design flows, and EDA tools.
Figure 1 demonstrates the timing relationship between the DQ (data bus) and DQS (strobe) signals for Read and Write operation data. In the case of writing data, it is the responsibility of the DDR2 PHY to center-align the DQS with DQ while tracking PVT changes, and taking advantage of DLL circuitry and delay lines. In the opposite operation of reading, the DRAM transmits the data and strobe edge-aligned, while it is again the DDR2 PHY responsibility to shift the incoming DQS by 90 degrees relative to DQ, also while tracking PVT changes.
Figure 1 Write and Read Operation, Wave View
DDR2 PHY Configurability Options
Due to the nature of a high-speed interface and the clear desire to control signal integrity effects in order to increase product reliability and yield, many of the DDR2 PHY IP blocks are compiled into optimized hard macros. Typically, each DDR2 PHY is constructed from several hard macro cells.
Having a hard-macro library limits the ability to choose a different interface configuration, because a different set of hard macros is used for each configuration. In the case presented here, a set of predefined interface configurations was taken into account while designing the DDR2 PHY architecture. This feature of configurability forces a designer to divide the hard macro into several small hard macros that can be abutted with one another to form the desired configuration, and still meet the tight timing constraints which are imposed by the DDR protocol. In addition, in order to facilitate straightforward connections between on-chip PHY macro building blocks as well as off-chip signals, a physical dimension requirement exists on the hard macros.
Using the configurable set of hard macros, the following configuration options are available without any area, power, or speed penalties:
- Data bus width (DQ)—can be any multiple of 8 bits (byte).
- Number of strobes (DQS)—differential or single-ended, one set per each data byte
- Address width—can be 12 to 15 address signals.
- Number of CS, WE, ODT—in order to support rank topology and multipoint ordering.
- Number of differential clock outputs—best used in wide rank topology.
- Delay-Locked-Loop (DLL) type and frequency.
- On-Die-Terminations (ODT) values per IO groups are dynamically set.
Figure 2 DDR2 PHY Block Diagram
The colors in this figure represent the various hard macromacros that form the complete configurable DDR2 PHY. The hard macro PHY block includes:
- Three types of SSTL1.8V I/O, optimized for DDR2
- Multiple Data Byte macro-cell blocks, each with 8 DQ buses (the least Data Byte block is one) and their respective DQS and DM signals.
- A single configurable Address/Command macro-cell abuts to a Data Byte macro, and interfaces the address and control signals to the SDRAM.
- Additional single address bit macro-cell abut to the Address/Command macro and form a wider address bus, which allows the addition of a single address bit with no timing penalty.
- A similar minimal macro-cell is responsible for adding extra clock drivers.
- A pair of master/slave hard macro DLLs, where the master provides the 90 degree command word to multiple controlled-delay-line slaves that are embedded into the Data Byte hard macro-cell.
Physical Implementation Challenges
The physical implementation of the DDR2 Interface is divided into two levels. A high level integration is set by constructing a PHY using already built hard macro-cells and placing them adjacent to one another, providing the best power connections and signal integrity. A lower level implementation is the creation of the firmed macro-cells themselves.
The challenge of this design approach is implementing a configurable firmed macro-cell that meets the following requirements:
- The exact physical dimensions dictated by the I/Os and abutment macros.
- The tight timing requirement imposed by the DDR2 protocol.
- The design rules introduced by both the Structured ASIC and cell-based technology.
- High test coverage, using design for test (DFT) structures that do not impact the required performance.
For specific physical dimensions, the location of the I/Os impose, and the abutment of the macro-cells force, a very tight timing constraint to be met.
Figure 3 demonstrates one of the timing budget calculations the read path has to meet. Operating at a data transfer rate of 800 Msps does not leave much timing budget. It can be observed that a total theoretical data window of less then 200 ps is left for correctly capturing the data. This small window shrinks further due to the following parameters:
- SDRAM device skew (tDQSQ)
- Board trace skew
- DLL jitter
- Asymmetry of the I/O rise/fall times
- Setup/hold timing requirement
Figure 3 Demonstration of Read Data Window
In the next sections, specific steps of the design flow are discussed, and a description of each challenge and an example of the solution chosen to overcome it is presented.
Floorplan and Cluster Placement
The DDR2 PHY has strict physical dimensions, and the design is constructed from several different and repeatable modules. The designer knows the optimum location of each module inside the fabric. By following few simple steps, it is possible to allocate groups of cells to a cluster and to force the tool to place the cells related to each cluster in a desired location. These steps are:
- Identify a set of cells that have a close relationship.
- Collect the dimensions of the library cells in that group.
- Define a cluster attribute.
- Specify the best location of the specific cluster in the fabric, making sure the dimensions of the cluster are large enough to include all relevant cells.
- Link all the cells in that group to the specific cluster.
- Execute “fix cell” after the hard placement of the structured-placement.
set cluster [ data create cluster region $m central_cluster "336u 0u 252u 156u" ]
In the Data-Byte part of the DDR2 PHY, more than 20 different locations of clusters were defined and implemented. Figure 4 demonstrates the different locations of the different clusters. It can be observed that next to each I/O of DQ the exact same rectangular cluster is defined, in order to be able to repeat one implementation over and over in the same way. Other clusters can be observed on the top row below the I/Os, dedicated for the JTAG boundary scan topology. One cluster is observed for DQS generation and handling along with the masking mechanism. A few more serve for the delay line locations and more.
Figure 4 Cluster Locations
Abutment Concept
Forming the top level DDR2 PHY requires connection of several prebuilt firm-macro-cells. In order to meet the tight timing constraints every macro cell has signal pins that tightly connect to the adjacent cell. All interface signals have strict physical locations on the boundary of the firm macro. The location is defined in the X,Y orientation and also in the specific metal layer and design rules with which each signal is routed.
For each input or output signal, the designer has to specify a specific location on the boundary of the macro following these steps:
- Identify all interface pins to other blocks, according to their types.
- Calculate exact pin location.
- Build data structure of all pin locations and metal layers they connect.
- Execute a Tcl command that force all pins location, example “force plan pin”.
In addition a special attention has to be placed while designing the power mesh of the top level PHY. A prebuilt partial power mesh exists within each macro-cell, and the top level power mesh is formed by placing all firm macro-cells adjacent to one another. Take into account the width of the power rail, both for VSS and VDD core power, in order to meet the IR-drop requirements.
Figure 5 is a zoom-in view of two firm macro-cells, showing the pin locations on the boundary of each one. The location of the pins may be connected in abutment and form a continuous routing path. The wide rails are the power rails of VSS and VDD alternately, are also abutted by placing two hard-macro-cells, to form the top-level power-mesh.
Figure 5 Bottom and Top View of Two Hard Macros Showing Abutment Pin Locations
DFT
Design for testability is one critical factor in today’s SoC most designers tend to leave for last. Due to the delicate design of DDR2 and the tight timing requirements, a dedicated plan for both logic scan insertion and boundary (JTAG) scan has to be made early in the design cycle to ensure high coverage.
Two contradicting requirements exist for the firm macro-cell. One is to have the highest manufacturing test coverage, which leads to a high number of scan operability points and adds uncontrolled loads on different paths. The other is for test coverage to have no effect on critical timing paths.
In order to solve this dilemma, the scan-chain is not built automatically using the EDA tools, but instead, its order and the specific cells used is predefined by the designer, who performs the following actions:
- Identify the different clock domains in the design.
- Identify the logic group operating on each polarity of the clock (rise/fall).
- Build a data structure of all logic cells with respect to the clock type and polarity, and the cluster to which they belong, from the floorplan.
- Based on the floorplan and placement, set the order of the chain.
- Add lock-up latch between the two clock domains.
- Verify equal loading of all cells, to achieve the exact same timing effect.
- Fix the chain, by adding loads where needed, to equalize timing effects between the paths.
Clock Mesh, Zero Skew
In order to meet the timing requirements presented by the DDR2 interface, a zero skew clock topology is preferred. One effective approach to achieve zero skew on a relatively narrow clock tree is by forming a clock-mesh.
A clock mesh is constructed when two or more driver cells are connected in parallel (all inputs of drivers are shorted together and all outputs of drivers are shorted together) to drive a wide metal bus, achieving an extremely low skew (close to zero). Adopting such a topology provides the advantage of achieving very low clock skew. One big drawback caused by using such approach is the lack of ability of the common EDA tool to calculate the timing delay of a mesh accurately. This requires specific circuit simulations using stand alone analog simulation such as SPICE. Since all driver inputs are shorted together, and the same driver outputs are also shorted together, the timing engine is not capable of providing the correct path delay. Moreover, while performing the physical assertion of such a structure, all timing calculations have to be fed back to an external SPICE engine.
Inside the firm macro-cells there are several clock-mesh implementations. For each one the following steps are performed:
- Identify all cells that belong to the same clock and for which a zero skew is required.
- Extract the exact physical location of such cells.
- Generate an accurate Netlist, including parasitic values and input loads for the SPICE simulator.
- Analyze structure and form a mesh clock circuit using symmetric drive cells.
- Update netlist inside the generic EDA flow with a new clock mesh structure.
- Perform structured-placement of all cells in the clock mesh.
- Perform parasitic extraction of the netlist again, including the clock mesh,
- Simulate the clock mesh using SPICE to obtain:
- Exact path delay from root to each one of the cells’ clock pin
- Exact slew at input pin.
- Update the actual path delay and transition for all leaf pins.
Figure 6 Circuit example of one clock mesh
The firm macro-cells include such clock-mesh structures. All timing paths related to the clock mesh are hand coded and represented accurately in the timing model. Once this structured is implemented the designer doesn’t have to worry about the tools behavior and can freely benefit from this feature.
Conclusion
Implementing a DDR2 interface from scratch requires significant and delicate design work. The approach of implementing several small firm macro-cells that together can form a variety of DDR2 interfaces, as presented, reduce design time, cost and risk when building a new DDR2 interface.
This paper provided guidelines and several design methodologies for the designer who his about to implement such firm macro-cells. Particular attention was given to meeting the high performance required by the DDR2 interface.
Today firm macro-cells that are used to build a DDR2 PHY interface are available in the industry and one can relatively easily configure a required solution. Such advanced structures are available at ChipX for integration in its family of Structured ASIC, Hybrid ASIC, and in hard macro form also in Standard Cell products.
|
Related Articles
- Distributed Video Coding (DVC): Challenges in Implementation and Practical Usage
- Beyond DDR2 400: Physical Implementation Challenges in Your SoC Design
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
- Implementation basics for autonomous driving vehicles
- An 800 Mpixels/s, ~260 LUTs Implementation of the QOI Lossless Image Compression Algorithm and its Improvement through Hilbert Scanning
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |