|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PCIe error logging and handling on a typical SoCUmesh Pratap Singh, Truechip Solutions Pvt. Ltd. Introduction: In Today’s high speed systems PCI Express (PCIe-Peripheral Component Interconnect-express) has become the backbone. PCIe is a third generation high performance I/O bus used to interconnect peripheral devices in applications such as computing and communication platforms. It is used to provide the connections between motherboard peripherals like graphics card, Ethernet card to the CPU and main memory. The study of PCIe error handling on SoC has become crucial part because of PCIe’s applications. Here are the details for PCIe error handling on a typical SoC(system on chip).PCIe provides rich set of mechanisms for error logging and handling where error handling may involve only hardware, device-specific software, or system software. This paper describes the errors associated with the PCIe interface and error while delivery of transactions between transmitter and receiver. Here are details of errors associated with each layer of PCIe, advanced error reporting (AER), advisory errors and recommendations for multiple error handling. This paper details first PCIe errors, error logging and then the error handling on a typical SoC. An Itinerary to PCIe errors and handling mechanisms: Pcie errors corresponding to each layer: PCIe is a packet-based serial bus, provides a high-speed, high-performance, point-to-point, dual simplex, differential signaling link for interconnecting devices. PCIe has three layered architecture for communication between two devices. Here are the details of the errors found at each layer. Transaction layer errors: This is upper layer, where packet is formed .The transaction layer checks are done end to end device, i.e. only by the requestor and completer and no checks at switch or bridge for below errors.
Data Link Layer Errors: This is middle layer, which is responsible for packet error and response handling .The below errors are checked at DL layer of requester, switch and completer i.e. these errors are checked at requester, switch and completer.
Physical Layer Errors: This is third layer which is responsible for link training and transaction handling at interface level. These errors are checked at requester, switch and completer.
PCIe error Classification: Based on severity, PCIe errors are categorized as below
Correctable errors are the errors which may have an impact on performance (like latency, bandwidth), but no data/information is lost and PCIe fabric remains reliable. Such errors are corrected by hardware and no software intervention is required. Examples: Bad TLP (bad LCRC or incorrect sequencer number), Bad DLLP − Replay timer timeout, Receiver error (for example, Framing error). Uncorrectable Non-fatal errors are the errors which don’t have impact on integrity of the PCI Express fabric, but data/information is lost. Non-fatal errors are corrupted transactions that can’t be corrected by PCIe hardware. However, the PCI Express fabric continues to function correctly and other transactions are unaffected, only particular transaction is affected. Recovery from a non-fatal error may or may not, depends on device-specific software associated with the requester that initiated the transaction. Examples: Poisoned TLP received, Unsupported Request (UR), Completion Timeout (CTO), Completer Abort (CA), and Unexpected Completion. Uncorrectable fatal errors are the errors which have impact on integrity of the PCI Express fabric i.e. PCIe link is no more reliable and data/information is lost. Recovery from fatal errors is done by resetting the component and link. Examples: Malformed TLP Error, Link Training Error, DLL Protocol Error, Receiver Overflow, Flow Control Protocol Error. Such classification provides to related hardware or software, a method to recover the error without resetting the components on the link and disturbing other transactions in progress. Table1:PCIe error classification
Description of common PCIe errors:
PCIe defines the transaction rules at each layer. Any transaction/packet violating these rules considered as malformed TLP. Examples: Data payload exceeds max payload size, the actual data length does not match data length specified in the header, TC to VC Mapping violation/errors.
Data poisoning is optional and indicates that data in packet is corrupted .If data is corrupted then the “EP” bit in packet header is set. The data poisoning is used in conjunction with memory, I/O, and configuration transactions that have a data payload. Data poisoning is done at the transaction layer of a device. For example when requester performs a Memory write transaction, the data (to be written) fetched from local memory, can have parity error. In Such case requester send the memory write transaction with setting “EP” field in packet header. For corrupted data, the packet is sent to recipient with “EP” bit set. The recipient will drop or process the packet, depends on implementation.
This ECRC is termed as end-to-end (ECRC) and ECRC is checked and reported by the ultimate recipient of the transaction. ECRC generation and checking is optional. If any device or system supports ECRC, it must implement advanced error reporting (AER). Examples of ECRC error are: ECRC in request packet: The completer will drop the packet and no completion will be returned .That will result in a completion time-out within the requesting device and the requester will reschedule the same transaction. ECRC in completion packet: The requester will drop the packet and error reported to the function's device driver via a function-specific interrupt.
The TL layer of PCIe provides the credit based flow control feature i.e. the transaction layer checks flow control credits( before sending packet to RX,DL layer) to ensure that the receive buffers have sufficient space to hold the transaction. There can be flow control protocol errors which will prevent transactions from being sent. These errors reported to the root complex (RC) and are considered uncorrectable. For example:
Completion transaction errors: The completion packet header has the field “cmpl status” which indicates the status of completion transaction. There are the below errors in completion transactions.
Some time, the receiver may get the completion that was not expected as per the tag /id for the packet sent by it. The typical reason for this unexpected completion is that the completion was mis-routed on its journey back to the intended requester.
As per the PCIe, the completion must be returned in specified time for the request else there will be completion timeout. The completion time-out mechanism is implemented by any device that initiates requests and require completions to be returned. The reason for completion time out can be that the completion is wrongly routed or the PHY at completer side is drop the packet. PCIe Error reporting and handling mechanisms: How the errors are reported and handled Fig1:PCIe error handling flow PCIe error reporting: Pcie provides mainly two ways for error reporting:
Error reporting by Completion Status: The completion TLP have “compl status ” field to report the error from completer to requester. Error reporting by Message TLP: The message kind of TLP introduced in PCIe to serve many purpose such as error reporting, interrupt handling etc. For error reporting, this includes identification of the device that detected the error and an indication of the severity of each error. In message TLP, there is message “code field” which gives the information about the objective of message transactions.
NOTE: Message TLPs are always routed to RC. Pcie error handling: PCIe provides two mechanisms for error handling.
Base line error reporting is done by PCI-compatible registers and PCI Express Capability registers while advanced error reporting (AER) is done by the Advanced Error Reporting registers that are mapped into extended configuration address space i.e. error reporting is done through configuration registers which are mapped into three distinct regions of configuration space.
PCI-Compatible or legacy error handling mechanism: PCIe provides registers mapping to support PCI related error. The PCI error reporting mechanism involves the assertion of signals PERR# (data parity errors) and SERR# (unrecoverable errors). The PCI Express mechanisms for handling these events are via the split transaction mechanism (transaction completions) and virtual SERR# signaling via error messages. This involves enabling error reporting and setting status bits that can be read by PCI-compliant software. There is the configuration status and command registers, which have error related bits. Below are the details of some important registers required for PCI compatible error handling.
PCI Express /native devices Error handling mechanism This is PCI Express Baseline Error Handling mechanism which has PCI Express Capability Register Set. These registers include error detection and handling bit fields regarding the nature of an error that is supplied with standard PCI error handling. The baseline capability register space is different for RC and EP mode. Fig2: PCIe Baseline capability registers structure These registers provide support for:
Below are the details of some important registers required for baseline error handling.
Setting the corresponding bit in the device control register enables the generation of the corresponding error message which reports errors associated with each classification. Unsupported Request errors are specified as Non-Fatal errors and are reported via a Non-Fatal Error Message, but only when the UR Reporting Enable bit is set.
An error status bit is set any time an error associated with its classification is detected. These bits are set irrespective of the setting of the error reporting enable bits within the device control register. Because Unsupported Request errors are by default considered Non-Fatal Errors, when these errors occur both the Non-Fatal Error status bit and the Unsupported Request status bit will be set. Note that these bits are cleared by software when writing a one (1) to the bit field.
The physical link connecting two devices may fail causing a variety of errors. Link failures are typically detected within the physical layer and communicated to the Data Link Layer. Because the link has incurred errors, the error cannot be reported to the host via the failed link. Therefore, link errors must be reported via the upstream port of switches or by the Root Port itself. Also the related fields in the PCI Express Link Control and Status registers are only valid in Switch and Root downstream ports (never within endpoint devices or switch upstream ports). This permits system software to access link-related error registers on the port that is closest to the host. Advanced Error Reporting Mechanism (this is optional) Importance of AER: AER provides the granularity and pinpoint details of correctable and uncorrectable errors. There are registers to define the error severity, error logging, error mask ability and to identify source of error. Fig3: PCIe advanced error reporting register structure Below are the details of some important registers required for advanced error handling.
When a correctable error occurs the corresponding bit within the advanced correctable error status register is set, independent of the mask register setting. These bits are automatically set by hardware and are cleared by software when writing a "1" to the bit position.
The correctable errors can also be masked by setting the corresponding bit in the register. Only affects the error reporting not the status bits. The masked errors are not logged in header log register and are not reported to RC.
These errors can selectively cause the generation of an uncorrectable error message being sent to the host system. Those uncorrectable errors that are selected to be non fatal will result in a nonfatal error message being delivered and those selected as fatal errors will result in a fatal error message delivered. However, whether or not an error message is generated for a given error is specified in the advanced uncorrectable mask register.
When an uncorrectable error occurs the corresponding bit within the advanced uncorrectable error status register bit is set, independent of the mask register setting. These bits are automatically set by hardware and are cleared by software when writing a "1" to the bit position. Advanced Uncorrectable Error severity register: AER mechanism defines the error severity handling for uncorrectable errors whether which one error is the more severe.
The uncorrectable errors can also be masked by setting the corresponding bit in the register. The default condition is to generate error messages for each type of error. Only affects the error reporting not the status bits. The masked errors are not logged in header log register and are not reported to RC.
The root complex is the target of all error messages issued by devices within the PCI Express fabric. Errors received by the RC result in status registers being updated and the error being conditionally reported to the appropriate software handler or handlers.
When RC receives an error message, it sets status bits within the root error status register. This register indicates the types of errors received and also indicates when multiple errors of the same type have been received.
The root error command register enables interrupt generation for correctable or uncorrectable errors. Basic flow chart for error handling:
Fig4: Basic flow chart for PCIe error handling Note: in above diagram: ANF:-Advisory non fatal error and DC reg:- device control register Advisory Non-Fatal errors: The error are reported and signaled as ERR_COR, ERR_NONFATAL, ERR_FATAL or not signaled at all, depending upon the role of the agent that detects the error and whether the agent implements AER. But in some cases detecting agent is not the appropriate agent to determine the ultimate disposition of the error, than the detecting agent with AER can signal the non-fatal error with ERR_COR, which serves as an advisory notification to software. For example a receiver that’s not the ultimate destination for a TLP (detects a non-fatal error with the TLP and severity is non fatal), than this “intermediate” receiver, handle this case as an Advisory Non-Fatal error and receiver with AER, signals the error (if enabled) by sending an ERR_COR message. A receiver without AER sends no error message for this case. If the severity is fatal, the error is not an Advisory Non-Fatal Error and must be signaled (if enabled) with ERR_FATAL. Other case may be where, it is required to have continue operation for uncorrectable non fatal error, than such scenario is handled as advisory non-fatal error by sending ERR_COR. For example a poisoned TLP is received by its ultimate destination, if the severity is non-fatal and the receiver deals with the poisoned data in a manner that permits continued operation, the receiver handle this case as an Advisory Non-Fatal Error. The receiver with AER, signals the error (if enabled) by sending an ERR_COR message and without AER sends no error message for this case. If the severity is fatal, the error is not an Advisory Non-Fatal Error, and must be signaled (if enabled) with ERR_FATAL. Nullified packet: This feature also called switch cut through, is development in PCIe over it’s earlier PCI. Earlier the packet at ingress port (incoming port) of switch is not sent to egress port (out going port) of switch until the tail end of packet is received and checked for CRC. In PCIe, the packet is passed from ingress port to egress port without waiting for tail end. If there is CRC error is detected on receiving tail end of TLP, than the TLP’s END is replaced with EDB (bad TLP) at egress port of switch and CRC is inverted with what it should be. The switch sends NACK for this and when reaches to end point (EP), it is discarded by EP, this is nullified TLP, EP doesn’t send any NACK for this nullified TLP(TLP with EDB tail end). After receiving the NACK, the requester again send the same TLP. PCIe error handling on a typical SoC: A typical SoC(System on Chip) consists of a core(CPU), memory blocks(RAM/FLASH), timing sources, PLL, reset handling, external/off-chip interface, industry standards peripherals such as USB/Ethernet/SPI/PCIE/ UART etc, analog interfaces like ADC/DAC,s and voltage regulators and power management controllers. The core communicates (provides stimulus in hex/binary format) with the modules (slave like PCIe) through an interface as the application layer. Here is the typical case of PCIe error handling on SoC. Core generates a MRd transaction to EP and suppose for EP, this is an unsupported request. So EP will return the completion with status field “UR” to RC. EP may also return an ERR_NONFATAL message, if enabled in EP’s Device Control Reg . And the EP logs this error in its:
For this “UR” completion packet, RC terminates the MRd transaction and returns an internal completion to the requester i.e. core .The result of such transaction is marked as error and “Bad Data” to core. And RC logs this error in its: - Secondary Status Register( for received UR completion) and Root Error Status Register , if receiving an ERR_NONFATAL message Core will not complete the instruction with the error status/“Bad Data” and core’s instruction execution will paused and core’s execution pointer jumps to interrupt handler (corresponding to the error). Now how the core will proceed further with recovery options, depends on application and vendor/implementation. Similarly core jump to interrupt handler (corresponding to error) for other errors of PCIe and take the implementation dependent actions. Requirements and recommendations for reporting multiple errors: Error pollution can occur if error conditions or root cause of error for a transaction can’t be ensured. For example suppose the DL layer detects an error, subsequent errors which occur for the same packet will not be reported by the transaction layer or suppose physical layer detects a receiver error, to avoid having this error propagate and cause subsequent errors at upper layers (for example, a TLP error at the Data Link Layer), making it more difficult to determine the root cause of the error. For such case It is required and recommended that no more than one error is reported for a single received TLP, and the below precedence (from highest to lowest) is used:
Conclusion: PCIe provides the very descriptive error reporting and handling methods. There are the various registers for handling different kinds of errors. Here the error handling methods for legacy and native devices are detailed. The actions taken by a function when an error is detected is governed by the type of error and the settings of the error-related configuration registers. The resultant actions for PCIe errors on SoCs are application and implementation specific. References: https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt Book:PCI Express System Architecture, Ravi Budruk, Don Anderson, Tom Shanley, MindShare, Inc.,2006 If you wish to download a copy of this white paper, click here
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |