HyperTransport reduces delays in some applications

EE Times: Latest News
HyperTransport reduces delays in some applications

Duncan Bees and Brian Holden
(08/02/2004 9:00 AM EDT)
URL: http://www.eetimes.com/showArticle.jhtml?articleID=26100714

The appropriate design choice of interconnect is critically tied to the end application. In applications that have an interactive or "slalom" nature, latency is a major factor in application performance.

An analysis of two specific usage scenarios compares the impact of latency on three major chip-to-chip interconnect standards. The analysis shows that in certain applications, HyperTransport (HT) delivers a measurably lower transaction latency than competing technologies such as PCI Express (PCIe) and serial RapidIO (sRIO), thanks in part to its cut-through architecture.

Latency, the time delay between when a message enters the link at one end and emerges at the other, is often a critical aspect in the performance of an application. Latency is a function of the link bandwidth, the protocol overhead and the interconnect logic design.

Latency in any of these interconnects can be compared to the responsiveness of the steering of an automobile. The time between the driver's initiation of a turn and when the car is actually headed in the new direction is directly related to the performance of a car in certain applications such as on a slalom course. A full-sized luxury car typically has a larger engine than does a sports sedan, but that sports sedan will consistently outperform it on such a course.

By analogy, many of today's embedded applications have the compute requirements of a slalom course. In such applications, a series of decisions may be made based upon data that is being updated in real time.

Another slalom-like scenario occurs when the selection of data is unpredictable enough to defeat caching strategies. In these cases, the latency of the link is key to application performance.

Inside the application At the most basic level, a CPU subsystem exchanges data with an application-specific device (ASD) via a point-to-point interconnect. The interconnect may be PCIe, HT or sRIO. With any of these interconnects, the ASD may read data from or write data to the CPU's memory. We'll look closely at reads initiated by the ASD.

Peering deeper into a more detailed breakdown of the system, we see the interconnect is implemented as a series of logic blocks within both CPU and ASD. The blocks correspond to the layers of that specification. Each of the interconnects has roughly a three-layered structure.

To read data from the CPU memory, the ASD begins by forming a memory read request packet. The interconnect end-point logic then constructs and appends headers to the packet at each of its three layers prior to the physical layer (PHY) transmitting the packet on the wire. A non-optimized implementation may require full packet buffers at the logical, transport and physical layers.

Within the CPU, these layers must be traversed in reverse order, and another potential buffer point is the internal bus shim within the CPU. A low-latency design would seek to reduce delays between these layers.

One way to reduce delays is to reduce the separate points or layers at which packets are buffered. Another approach allows packets to cut through between layers and buffers, rather than accumulating packets fully at each layer prior to transmission.

Cut-through can play a key role in reducing delays by eliminating the wait for packet accumulation. The degree to which it is safe to cut through depends upon the system, and particularly, upon the link reliability.

Clock-forwarded interfaces, such as those used on synchronous static RAMs (SSRAMs), System Packet Interface 4.2 (SPI-4.2) and HT, are often regarded as inherently reliable, as the chance of error in a well designed interface is much smaller than with other system error sources. In such interfaces, the link is typically reset upon detection of a bit error. Numerous HyperTransport designs use cut-through, even in the end point.

Serial/deserializer (serdes) based interfaces are not regarded as inherently reliable in this sense, although modern serdes designs are achieving 1015 or better bit-error ratios. Retry-on-error algorithms using per-packet error cyclical redundancy checks (CRCs) are available in all three technologies.

As the final CRCs are available only at packet-end, cut-through is inherently complex as the CRC may ultimately reveal that the packet being cut through is faulty. Cut-through is useful however, particularly at switches and intermediate buffer points, but some strategy must be employed to handle the bad cases of cyclical redundancy checks.

All three interconnect technologies support the strategy of so-called packet stomping. In addition, HT and PCIe support data poisoning.

Understanding the tables In our analysis we assessed the latency as the ASD read data from CPU memory using each of PCIe, HT and RIO interconnects. We compared cases with both short packets and long packets.

Link speed determined the lag to actually transmit the packet across the wire. For comparison, we used interconnects of similar link bandwidth, choosing the x4 link at the maximum available frequency in each case. The link speeds used:

* PCIe: x4 at 2.5 Gbits/s, yielding 8.0 Gbits/s throughput after line coding; * HT: x4 at 2.8 Gbits/s, yielding 11.2 Gbits/s throughput; * Serial RapidIO: x4 at 3.125 Gbits/s, yielding 10.0 Gbits/s throughput after line coding.

C controller speed determined the lag for the IP core to process data and header fields. For fair comparison, we operated each of the end points at 64-bit width and 250 MHz. The access is to an open page in a 333-MHz SDRAM.

The first case, that of short packets, is shown in table a. The ASD reads 8 bytes of data from the CPU's memory. In this case, the data read was very small, but for our slalom type of application, it is important to have the data returned very quickly. This table is meant to compare typical latencies among the different interconnects. Actual latencies will depend on the details of each implementation.

The Store-and-forward columns in the table show the cases where packets were buffered in each of the Tx application, Tx Data link layer and Rx data link layers. The columns entitled Cut-through (C-T) are optimized to have buffering only in the Rx data link layer, used to drive the retry algorithms.

For each column, the value of "Max_Payload_Length" was chosen that minimized the latency. This value had no effect on the short packet case, but it was key for the long packet case.

For the Store-and-forward columns, a smaller length was best. For the Cut-through columns, a longer length was best, although values above 256 increased the latency due to the buffering in the Rx data link layer.

The serdes+PMA+PCS+MAC rows capture the sum of the latencies experienced by those layers, including encoding, scrambling and lane de-skew in each technology.

For short packets, the table shows that the HyperTransport interface yields the best latencies under this set of assumptions. This is mainly due to the higher latency in the serdes layer used in PCIe and serial RapidIO for character alignment, lane alignment and other functions. The results also demonstrate the benefit of cut-through implementations.

A comparison of the three technologies in long-packet applications is also shown (table b). It illustrates the read request to read completion latency for a 2-kbyte read. This latency is important in bulk-transfer applications.

As expected, the latency for the long packet case is higher than for just an 8-byte read. In this case, again, HyperTransport provides a slightly lower latency than the other interfaces, and the value of cut-through is reiterated.

Duncan Bees is technical adviser and Brian Holden (duncan_bees@pmc-sierra.com) is principal engineer at PMC-Sierra Inc. (Santa Clara, Calif.). Holden is also chair of the Technical Working Group of the HyperTransport Consortium.