|
||||||||||
Bulletproofing PCIe-based SoCs with Advanced Reliability, Availability, Serviceability (RAS) Mechanisms1. Introduction As silicon manufacturing process nodes keep shrinking and transistors get smaller, System-on-Chip (SoC) are increasingly subject to failures due to changing external conditions such as temperature, EMI, power surges, Hot Plug events, etc. The transition to PCIe 4.0 and 5.0 with increasing PCIe signaling speeds (16GT/s and 32GT/s) also augments the risk of errors due to tightening timing budgets inside the SoC and electrical issues outside the SoC (e.g. crosstalk, line attenuation, jitter, etc.). Furthermore, the ever growing number of PCIe components and systems designed to different revisions of the Specification increases the risk for interoperability issues. As a result, chip designers who use PCIe as the main communication interface in their SoCs are looking for ways to bulletproof their designs by implementing advanced Reliability, Availability, and Serviceability (i.e. RAS) mechanisms that go above and beyond those included in the PCI Express Base Specification. We start this article by defining “RAS” in the context of PCIe interfacing and looking at the provisions for RAS mechanisms in the PCIe Specification. We then explore some potential PCIe hazards SoC designers can face and the RAS mechanisms that can be implemented to detect, recover, or prevent these hazards. We conclude with recommendations for choosing a PCIe silicon IP solution that helps mitigate these risks.
2. RAS in the context of PCIe In the context of a SoC’s PCI Express interface, the 3 components of RAS can be defined as follows:
3. RAS features in PCIe protocol The PCIe Base Specification defines a set of mechanisms primarily intended for link-level reliability. These include:
These mechanisms provide basic protection against PCIe link level hazards however prove insufficient for mission-critical deployments in markets such as automotive, HPC/Cloud computing, and enterprise storage/networking. Here are a few examples of potentially hazardous conditions that require more advanced RAS mechanisms to be detected, reported and/or corrected:
4. Best practices for enhanced reliability To improve the reliability of PCIe communications, PCIe designs have integrated a set of best practice mechanisms that are not part of the PCIe Base Specification. These include:
The combination of Parity and ECC do a good job at protecting PCIe payload in flight and at rest, however mission-critical applications demand for more advanced RAS mechanisms, which we discuss in the following section.
5. Proposed advanced RAS mechanisms In this section we describe some PCIe related issues observed in production environments that could quickly be detected, reported and/or corrected using advanced RAS mechanisms. We propose a specific solution to each problem and a generalization of the solution in term of a Reliability, Availability, and/or Serviceability mechanism to be implemented inside and/or around the PCIe interface logic. The terminology used throughout this chapter is illustrated in Figure 1. 5.1. Non-compliant component The following examples illustrate issues observed due to non compliance of either link partner. 5.1.1. Equalization timeout due to PHY specific FOM timing In this scenario, link should initialize at 16GT/s (Gen4) speed as both link partners support this line rate but instead initializes at 2.5GT/s (Gen1) speed, without triggering any errors. This is due to the time required by the SoC’s PHY Rx circuitry to compute a Figure Of Merit (FOM) for a given preset in Equalization Phase 2, which exceeds the Preset timer for EQ Phase 2 of the SoC’s PCIe controller, in turn preventing the next preset to be tested (possibly to obtain a better Bit Error Rate - BER) and forcing the link in some cases to fall back to Gen1 speed. The problem can be detected by monitoring the appropriate timeout condition, and can be resolved by increasing the EQ Phase 2 timeout counter (up to 50% as allowed by the PCIe Specification) to allow for multiple presets to be tested and achieve optimal BER. The EQ timeout counters can be further increased beyond PCIe Specification recommendations for even greater margin. The solution can be generalized with a Reliability mechanism that includes:
5.1.2. Excessive replays due to link partner’s ACK latency In this scenario, the SoC is able to transmit write packets and read requests however the throughput observed is lower than what is expected based on the link speed and active lanes. This is due to the link partner’s ACK latency that exceeds the recommended maximum latency defined in the PCIe Specification, resulting in transmission replays and affecting performance. The problem can be detected by monitoring the number of Replays initiated, and can be resolved by increasing the ACK timeout counter to accommodate the extra latency. The solution can be generalized with the Reliability mechanism proposed in the previous section. It should be noted however that the size of the Replay Buffer may limit the amount by which the ACK timeout can be increased, in order to avoid buffer overflow.
5.2. Tolerance to errors and error injection In this scenario, a deadlock is observed after a few days of normal operation with packets no longer being transmitted on the link due to insufficient Tx credit available to the SoC’s application logic. This is due to malformed TLPs transmitted on the link by the PCIe controller, possibly the result of an uncorrectable ECC or parity error at the Transmit Buffer level. The malformed TLP is discarded by the link partner’s receiver and associated credits are lost. Deadlock occurs as a consequence of this credit leakage when Tx credits are no longer available, which can occur over the course of several days depending on the frequency of the errors. The credit leakage can be identified with an indication from the PCIe controller of the number of malformed TLP transmitted, and can be corrected by ensuring that the PCIe controller does not update the associated credits. A generalization of the solution involves the implementation of a Reliability mechanism in which the PCIe interface logic is able to transmit errored TLPs on the PCIe link without incurring side effects such as credit leakage. This allows testing of the system hardware and system software response to errors. Similarly for the receive path, the ability for the PCIe interface to generate errored packets on the user interface without side effects allows testing the application logic and SoC’s response to errors. The mechanism typically requires a dedicated set of registers and interface to the PCIe interface logic allowing for:
The number of injectable errors is such that it is ultimately a tradeoff between gate count, implementation complexity, and likeliness of occurence. Common errors that may be supported include LCRC Error, Sequence # Errors, Nullified TLP, Malformed Packet Errors, Block DLLPs (e.g. ACK/NAK), Force DLLPs (e.g. NAKs), Symbol/Framing errors, Flow control errors (e.g. nonsensical values, blocked updates).
5.3. Layer-based monitoring and troubleshooting In one instance, the PCIe link does not link up and LTSSM does not progress past the Detect state possibly due to a problem during the Receiver Detect sequence. In another instance, the link is unstable with LTSSM frequently transitioning to the Recovery state. This issue could have multiple causes, originating from either the SoC’s PCIe interface or the link partner’s PCIe interface, including a problem at PHY/MAC level during the training (TSx) sequence, or a problem at Transaction level such as the reception of an Unsupported TLP. These issues can be detected by probing relevant signals and events at the various layers of the PCIe interface. Bringing out the PIPE interface out on the application layer interface can help with PHY/MAC layer issues. It should be noted that PIPE receive data (RxData) should go through 128B/130B unscrambling (for PCIe Gen3 and Gen4) to be readable. This mechanism can further be extended to the PCIe interface’s Physical Coding Sublayer (PCS), Data Link Layer (DLL) and Transaction Layer (TL) whereas relevant signals and events are brought out on the application layer’s interface for easy monitoring and troubleshooting. These signals/events may include:
6. Closing statement As PCI Express is being further deployed into mission critical applications in the Automotive, AI, and Enterprise markets, the need for higher level of reliability, availability, and serviceability is increasing. Whether they are using a homegrown PCIe interface solution or licensing a PCIe IP solution, SoC designers are looking for mechanisms that provide better visibility into the PCIe interface behavior and better control over its operation. Implementing programmable timers and timeouts inside the PCIe interface logic as well as mechanisms for generating errors without side effects improve the reliability of the system; having dedicated monitoring, status, and control interfaces for each of the PCIe interface’s functional layer (PHY, PCS, MAC, DLL, TL) allow SoC designers to flag specific events and errors, and improve the overall serviceability and availability of the system.
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |