Stepping Up to PCI-Express

LSI Logic

For more than a decade the PCI bus has been the backbone of personal computers. Other systems, such as telephony and networking, adopted the technology for its cost and performance advantages. But now the PCI bus is reaching its performance limits. No longer able to scale in either clock frequency or bus width, PCI bus has no more bandwidth gains to make. It is time for a replacement.

Personal computer (PC) clock speeds have pushed to multiple GigaHertz. Networking and telephony applications are merging and demanding system throughputs in the hundreds of Megabytes per second. Meanwhile, the PCI bus that connects processors and their I/O resources is dragging along at a mere 133MHz. Something must change, yet the large installed base of hardware and software for PCI bus-based systems prevents designers from simply switching to a faster bus architecture. Systems need more bandwidth, but it must come with backward compatibility.

The PCI Special Interest Group (PCI SIG) has solved this dilemma with a new bus that offers higher throughput and frequency scalability while retaining software compatibility. The new bus, PCI-Express (PCIe), uses serial communications with a switched point- to-point connection at the hardware level. From a software standpoint, however, PCIe looks the same as traditional PCI to applications, drivers, and operating systems.

The need for a new bus comes from the inherent limitations of the traditional PCI bus. The bus was originally designed for use on a PC motherboard to connect add-in cards to the processor. A supplement to, and eventual replacement for, the original ISA bus, PCI provided a comparatively high bandwidth connection to peripherals. Like its predecessor, PCI bus matched the word size of the PCâ€™s CPU, starting at a 32-bit width then moving to 64-bits. As processor clock speeds increased, so did the PCI bus speed. Starting with a 33MHz system clock, it moved through 66MHz and 100MHz to its present 133MHz speed.

Meanwhile, PCI bus found a home in other applications. The PCI bus serves as the backbone of network routers. Even though the routers do not follow the PC architecture, they use the PCI bus to take advantage of low-cost Ethernet-to-PCI adapters first developed for the PC. Similarly, PCI bus to Fibre Channel adapters have given rise to the storage area network (SAN), which uses PCI bus as its backbone.

PCI OUT OF STEAM

The problem is that the PCI bus is running out of performance headroom for these applications. At 133MHz, the 64-bit PCI bus achieves a throughput of 1 Gbyte/sec. Individual Ethernet and Fibre Channel links typically run at 1 Gbit/sec and system bandwidth must be able to aggregate several channels for networking to be effective. This means that the current PCI bus is operating just at the limit of application needs. With links in the process of transitioning to 10 Gbit/sec rates, the PCI bus becomes a bottleneck.

Yet, the PCI clock rate cannot be extended much further. Already it is difficult to meet the timing constraints of the PCI bus in a system design. Requirements for backward compatibility with older hardware imposes setup and hold conditions that occupy as much as 1.7nsec of the bus cycle period. The clock to valid transmit driver data output is a maximum of 3.3nsec. At 133MHz that leaves a window of only 2.5nsec to accommodate system timing variations such as skew and propagation. Increasing clock speed would only aggravate the problem.

Neither can the PCI bus width be readily extended. Designing to the timing constraint is already difficult because of skew across the bus width. Even with careful layout and restriction on the circuit loading to a fan-out of only two or three, developers have had to struggle to make the PCI bus work at 64-bit width. Many developers feel that a 128-bit width would require too many impractical restrictions to become a viable solution. In addition, the bi-directional nature of the PCI bus prevents concurrent read and write operations from being performed.

Because these limitations stem from the parallel structure of the PCI bus, the PCI SIG chose to develop the PCI Express bus around a full duplex serial structure. (See Figure 1) The serial link is similar to the XAUI interface defined for gigabit Ethernet hardware. It moves data across uni-directional serial â€œlanes,â€ which are stackable into 1-, 2-, 4-, 8-, 16-, and 32-wide configurations. Each lane carries one byte of a word, so the bus can handle 8-bit to 256-bit data widths, and data integrity is protected by the use of CRC (cyclic redundancy check) error detection.

Figure 1. Full Duplex Point-to-Point 16-Lane PCI-Express Connection

The serial bit streams move data at 2.5 Gbits/sec using an 8b/10b encoding scheme that is self-clocking. Because lanes are self-clocking and stacking the lanes requires data buffering in order to reassemble words, the scheme is highly tolerant of skew.

At the highest levels of system activity, PCIe retains compatibility with standard PCI. They use the same configuration register definitions and addressing scheme for accessing and controlling bus nodes. They use the same protocols for setup and acknowledgement of data transfers. This makes them look identical to the operating system and BIOS (basic input output system), which in turn makes them look identical to applications software. The interface hardware handles all the differences between the two bus structures and all the operating details, leaving the system level functions undisturbed.

PCI EXPRESS HAS PERFORMANCE HEADROOM

The PCIe bus structure gives it considerable headroom for increasing performance over time without straining design resources. The present definition achieves a top system bandwidth of 8 Gbytes/sec (16 lanes), but the definition is readily extended. Because each lane only uses two differential signal lines and their timing is independent from other lanes, the bus does not impose significant circuit layout restrictions. Self-clocking means that the bus is scaleable in frequency; nothing prevents the scheme from operating at the logicâ€™s maximum clock rates. The relative independence of lanes means that the scheme can be scaled in width to match system performance. The protocol provides a method to realign data striped across multiple lanes.

A serial bus structure does have the restriction that it can only operate point-to-point. Multiple devices cannot share the same physical connection as they do in the multi-drop PCI bus. PCIe allows multiple devices on the same logical bus, however, by employing a switched-fabric architecture. The fabric can establish a point-to-point link between any two bus nodes, giving that link sole control of the bus for the duration of its data transaction.

This switched fabric provides an opportunity to implement quality of service (QOS) controls on data traffic by controlling the switching. It also opens the possibility of peer-to-peer transactions across the bus without the need for bus contention and arbitration. Further, it may be possible for the fabric to support multiple simultaneous transactions across the bus when the two links do not have any part of the data path in common. This can further boost the system throughput.

With all the throughput advantages that PCIe offers, many high-bandwidth applications are migrating to the new standard. One example is a host bus adapter (HBA) for a storage area networks (SAN). The operating systemâ€™s (OS) file server provides the host adapter with commands, which identify the device, the transfer size, relative location and operation to be preformed. The job of the adapter is to take these file-oriented commands from the host and translate them into specific addressing and data transfer operations.

Once a file transfer request is made, the adapter is responsible for issuing the appropriate commands to the proper storage device. The adapter also moves the data between the storage device and the host system. Typically the adapter will try to maximize system throughput by having several transfer requests active at one time, each for a different storage device.

SAN adapters typically use a Fibre Channel (FC) link to the storage device and such links currently operate from 1 to 4 Gbits/sec full duplex, both read and write at the same time. Because it is communicating with several storage devices simultaneously, the host adapter will aggregate those separate links imposing a data rate requirement of as much as 20 Gbit/sec on the host interface. PCIe can achieve this bandwidth with an eight-lane design.

RAPIDCHIP IMPLEMENTS PCI EXPRESS INTERFACE

Such designs are already available. Figure 2 shows the functional block diagram of a possible Fibre Channel Host Bus Adapter (FC-HBA) design in LSI Logicâ€™s RapidChip Platform ASIC. RapidChip Platform ASICs are LSI Logic designs that contain both hardwired logic and a user-configurable area that allows customization of the base design without incurring the cost penalty of a full custom ASIC. Each of the hard cores is a proven design and their integration into the RapidChip Platform ASIC has resolved all layout, interface, and timing closure issues in their operation. All that remains for the developer is to determine the interconnections of the configurable logic blocks to create a customized part that is ready for production.

Figure 2. Fibre Channel Host Bus Adapter Stepping Up to PCI-Express

The example FC HBA RapidChip Platform ASIC consists of several main blocks, including the storage interface section, the PCIe interface section, and the user configurable logic block with control processor. GigaBlaze® serializer/ deserializer (SerDes) cores are on both sides of the chip to handle the serial interface signaling.

The GigaBlaze cores are multi-purpose SerDes devices that support a wide variety of speeds and standards. They can be configured to handle InfiniBand, PCI Express, Serial ATA (I and II), Serial Attached SCSI, Gigabit and 10 Gigabit Ethernet, and Fibre Channel from 1.0625 Gigabit to 4.25 Gigabit data rates. This flexible core is a well-proven design that has been in mass production for more than seven years, in five ASIC technology generations.

On the storage side a Fibre Channel core handles the interface protocols for connecting to the storage devices. This block is a proven design that has been tested at the University of New Hampshire Interoperability Test Lab. It has a demonstrated ability to operate with any Fibre Channel compatible peripheral. The GigaBlaze cores here are configured to operate at either 1.0625 Gbits/sec or 2.125 Gbits/sec or 4.25 Gbits/sec.

On the PCIe side are several cores, including the PCIe transaction core, the PCI Express x8 or x16 Lane Link core, and GigaBlaze SerDes cores. The GigaBlaze cores convert data between serial and parallel forms, and handle power-saving options for PCIe. The power saving options include the ability of the transmitter to detect when a remote receiver is connected. The transmitter can remain in an electrical idle mode, saving power, and reconnect quickly when data transfers are initiated. It will only transmit, however, when there is a receiver to process the data, conserving power when the PCIe bus is not connected with another device.

The PCI Express Link core handles the signaling details of the PCIe link. The link layer process PCIe signaling protocols, provides the link initialization and training (for clock recovery), performs lane alignment, and assembles 64-bit words from the byte-wide output of individual GigaBlaze devices. The link layer element also implements flow control as required by the transaction layer and provide the buffers and automatic retry needed for recovery from serial transmission errors.

The PCIe transaction core manages the movement of data blocks from the HBA to the host processor as well as processing memory, message, and configuration transactions. This is where the buffering and error-checking of data words takes place. The PCI Express Link core provides parallel data in word-wide format so that the transaction layer core can handle wordoriented error detection and correction before moving the next word in a block. The transaction layer core also manages flow control and traffic arbitration, allowing dynamic control of traffic priorities and bandwidth utilization over PCIe.

At the user configurable interface, the PCIe transaction core behaves as a DMA channel. This simplifies the developerâ€™s efforts when customizing the FC HBA RapidChip Platform ASIC. The developer does not have to be concerned with the operation of either the PCIe or the Fibre Channel interfaces. Both interfaces operate independently and automatically. Instead, the developer concentrates on the network management function of the FC HBA. The user-configurable block includes both an ARM processor and customizable logic that support this function. The management of individual storage devices, file-handling, fault tolerance, redundancy, and security are all functions to be implemented by the developer. They represent the value-added design effort that turns the RapidChip Platform ASIC into a unique design.

With the advent of PCIe cores and platform ASICs such as the RapidChip family, the move to PCI Express will gain momentum. The need is growing as applications increase to speeds beyond what the traditional PCI bus can handle. While the serial bus structure is new to many designers, the existence of proven cores that handle the bus details for the developer reduces the barriers to PCIe-based designs. The availability of platform ASICs such as the RapidChip significantly reduces the cost and risk of these designs, removing the last of the barriers to adoption. Now, it is up to the developers to move forward and stake out their share of the new market for PCI Express designs.

Industry Articles

Stepping Up to PCI-Express