PCIe error logging and handling on a typical SoC

Umesh Pratap Singh, Truechip Solutions Pvt. Ltd.

Introduction:

In Today’s high speed systems PCI Express (PCIe-Peripheral Component Interconnect-express) has become the backbone. PCIe is a third generation high performance I/O bus used to interconnect peripheral devices in applications such as computing and communication platforms. It is used to provide the connections between motherboard peripherals like graphics card, Ethernet card to the CPU and main memory.

The study of PCIe error handling on SoC has become crucial part because of PCIe’s applications. Here are the details for PCIe error handling on a typical SoC(system on chip).PCIe provides rich set of mechanisms for error logging and handling where error handling may involve only hardware, device-specific software, or system software. This paper describes the errors associated with the PCIe interface and error while delivery of transactions between transmitter and receiver. Here are details of errors associated with each layer of PCIe, advanced error reporting (AER), advisory errors and recommendations for multiple error handling.

This paper details first PCIe errors, error logging and then the error handling on a typical SoC.

An Itinerary to PCIe errors and handling mechanisms:

Pcie errors corresponding to each layer:

PCIe is a packet-based serial bus, provides a high-speed, high-performance, point-to-point, dual simplex, differential signaling link for interconnecting devices. PCIe has three layered architecture for communication between two devices. Here are the details of the errors found at each layer.

Transaction layer errors:

This is upper layer, where packet is formed .The transaction layer checks are done end to end device, i.e. only by the requestor and completer and no checks at switch or bridge for below errors.

TL layer is responsible for checking the below errors at end to end level.
ECRC check failure (optional check based on end-to-end CRC and AER)
Malformed TLP (error in packet format)
Completion Time-outs during split transactions
Flow Control Protocol errors (optional)
Unsupported Requests
Data Corruption (reported as a poisoned packet)
Completer Abort (optional)
Unexpected Completion (completion does not match any Request pending completion)
Receiver Overflow (optional check)

Data Link Layer Errors:

This is middle layer, which is responsible for packet error and response handling .The below errors are checked at DL layer of requester, switch and completer i.e. these errors are checked at requester, switch and completer.

LCRC check failure for TLPs
Sequence Number check for TLP s
LCRC check failure for DLLPs
Replay Time-out
Replay Number Rollover
Data Link Layer Protocol errors

Physical Layer Errors:

This is third layer which is responsible for link training and transaction handling at interface level. These errors are checked at requester, switch and completer.

Receiver errors
Link errors

PCIe error Classification:

Based on severity, PCIe errors are categorized as below

Correctable errors — handled by hardware
Uncorrectable error –Classified as fatal and non-fatal errors
- Uncorrectable errors-nonfatal — handled by device-specific software
- Uncorrectable errors-fatal — handled by system software

Correctable errors are the errors which may have an impact on performance (like latency, bandwidth), but no data/information is lost and PCIe fabric remains reliable. Such errors are corrected by hardware and no software intervention is required.

Examples: Bad TLP (bad LCRC or incorrect sequencer number), Bad DLLP − Replay timer timeout, Receiver error (for example, Framing error).

Uncorrectable Non-fatal errors are the errors which don’t have impact on integrity of the PCI Express fabric, but data/information is lost. Non-fatal errors are corrupted transactions that can’t be corrected by PCIe hardware.

However, the PCI Express fabric continues to function correctly and other transactions are unaffected, only particular transaction is affected. Recovery from a non-fatal error may or may not, depends on device-specific software associated with the requester that initiated the transaction.

Examples: Poisoned TLP received, Unsupported Request (UR), Completion Timeout (CTO), Completer Abort (CA), and Unexpected Completion.

Uncorrectable fatal errors are the errors which have impact on integrity of the PCI Express fabric i.e. PCIe link is no more reliable and data/information is lost. Recovery from fatal errors is done by resetting the component and link.

Examples: Malformed TLP Error, Link Training Error, DLL Protocol Error, Receiver Overflow, Flow Control Protocol Error.

Such classification provides to related hardware or software, a method to recover the error without resetting the components on the link and disturbing other transactions in progress.

Table1:PCIe error classification

Type of error	Errors examples	Pcie layer at which error found
Correctable	Receiver Error	Physical
Correctable	Bad TLP	Link
Correctable	Bad DLLP	Link
Correctable	Replay Time-out	Link
Correctable	Replay Number Rollover	Link
Uncorrectable - Non Fatal	Poisoned TLP Received	Transaction
Uncorrectable - Non Fatal	ECRC Check Failed	Transaction
Uncorrectable - Non Fatal	Unsupported Request	Transaction
Uncorrectable - Non Fatal	Completion Time-out	Transaction
Uncorrectable - Non Fatal	Completion Abort	Transaction
Uncorrectable - Non Fatal	Unexpected Completion	Transaction
Uncorrectable - Fatal	Training Error	Physical
Uncorrectable - Fatal	DLL Protocol Error	Link
Uncorrectable - Fatal	Receiver Overflow	Transaction
Uncorrectable - Fatal	Flow Control Protocol Error	Transaction
Uncorrectable - Fatal	Malformed TLP	Transaction

Description of common PCIe errors:

Malformed packets :

PCIe defines the transaction rules at each layer. Any transaction/packet violating these rules considered as malformed TLP.

Examples: Data payload exceeds max payload size, the actual data length does not match data length specified in the header, TC to VC Mapping violation/errors.

Corrupted or poisoned data errors or also called error forwarding:

Data poisoning is optional and indicates that data in packet is corrupted .If data is corrupted then the “EP” bit in packet header is set. The data poisoning is used in conjunction with memory, I/O, and configuration transactions that have a data payload. Data poisoning is done at the transaction layer of a device.

For example when requester performs a Memory write transaction, the data (to be written) fetched from local memory, can have parity error. In Such case requester send the memory write transaction with setting “EP” field in packet header.

For corrupted data, the packet is sent to recipient with “EP” bit set. The recipient will drop or process the packet, depends on implementation.

ECRC error:

This ECRC is termed as end-to-end (ECRC) and ECRC is checked and reported by the ultimate recipient of the transaction. ECRC generation and checking is optional. If any device or system supports ECRC, it must implement advanced error reporting (AER).

Examples of ECRC error are:

ECRC in request packet: The completer will drop the packet and no completion will be returned .That will result in a completion time-out within the requesting device and the requester will reschedule the same transaction.

ECRC in completion packet: The requester will drop the packet and error reported to the function's device driver via a function-specific interrupt.

DL layer flow control-related errors:

The TL layer of PCIe provides the credit based flow control feature i.e. the transaction layer checks flow control credits( before sending packet to RX,DL layer) to ensure that the receive buffers have sufficient space to hold the transaction.

There can be flow control protocol errors which will prevent transactions from being sent. These errors reported to the root complex (RC) and are considered uncorrectable.

For example:

The maximum number of data payload credits that can be reported is restricted to 2048 unused credits and 128 unused credits for headers. Exceeding these limits is considered an FC protocol error.
During flow control (FC) initialization receivers are allowed to report infinite FC credits. FC updates DLLP (data link layer packet) follow the init FC. FC updates are allowed providing that the credit value field is set to zero, which is ignored by the recipient. If the data field contains any value other than zero, it is considered an FC protocol error.

Completion transaction errors:

The completion packet header has the field “cmpl status” which indicates the status of completion transaction. There are the below errors in completion transactions.

Unsupported Request error:
When the receiver at other end, receives a transactions that is not supported by it, it returns a completion transaction with unsupported request (UR) in the “completion status” field of the packet header.

Few possible cases of unsupported request are :
- Message request received with unsupported or undefined message code.
- Request does not reference address space mapped within device.
- Type 1 configuration request received at endpoint.
- Completer Abort error:
These are optional error and depend on implementation for completion abort. A completer that aborts a request may report the error to the root complex (RC) as a Non-Fatal Error message or returns the completion packet as completion abort in completion status field of packet header.

Possible scenario for completion abort condition can be:

A Completer receives a request, that can’t be completed by it because the request violates the programming rules for the device. For example, some devices may be designed to permit access to a single location within a specific Double Word, while any attempt to access the other locations within the same Double Word will fail.
Unexpected Completion:

Some time, the receiver may get the completion that was not expected as per the tag /id for the packet sent by it.

The typical reason for this unexpected completion is that the completion was mis-routed on its journey back to the intended requester.

Completion Time-out:

As per the PCIe, the completion must be returned in specified time for the request else there will be completion timeout. The completion time-out mechanism is implemented by any device that initiates requests and require completions to be returned.

The reason for completion time out can be that the completion is wrongly routed or the PHY at completer side is drop the packet.

PCIe Error reporting and handling mechanisms: How the errors are reported and handled

Fig1:PCIe error handling flow

PCIe error reporting:

Pcie provides mainly two ways for error reporting:

By completion status field: which are used by the completer to report errors to the requester, the completer or requester may be EP or RC.
By error message transactions: which are used to report errors to the host/RC.

Error reporting by Completion Status:

The completion TLP have “compl status ” field to report the error from completer to requester.

Error reporting by Message TLP:

The message kind of TLP introduced in PCIe to serve many purpose such as error reporting, interrupt handling etc. For error reporting, this includes identification of the device that detected the error and an indication of the severity of each error.

In message TLP, there is message “code field” which gives the information about the objective of message transactions.

Message Code	Name	Description
30h	ERR_COR	used when a PCI Express device detects a correctable error
31h	ERR_NONFATAL	used when a device detects a non-fatal, uncorrectable error
33h	ERR_FATAL	used when a device detects a fatal, uncorrectable error

NOTE: Message TLPs are always routed to RC.

Pcie error handling:

PCIe provides two mechanisms for error handling.

Base line error handling mechanism.
The PCIe baseline error handling mechanism can also be categorized as below:
- PCI-Compatible/legacy error handling mechanism: Supports the software or devices that have no knowledge of PCIe.
- PCI Express /native devices Error handling mechanism: Supports the software or devices that have knowledge of PCIe.
Advanced error reporting mechanism.

Base line error reporting is done by PCI-compatible registers and PCI Express Capability registers while advanced error reporting (AER) is done by the Advanced Error Reporting registers that are mapped into extended configuration address space i.e. error reporting is done through configuration registers which are mapped into three distinct regions of configuration space.

Error logging using PCI-compatible registers: This method provides backward compatibility with existing PCI compatible software and is enabled via the PCI configuration Command Register. These errors are mapped within PCI compatible error registers.
Error logging using PCIe capability registers: This method is error reporting of PCIe native devices .In this method error reporting is enabled via the PCI Express Device Control Register which are mapped within PCI-compatible configuration space.
Error logging using PCIe Advanced Error Reporting registers: This is optional method where error reporting is done by the registers which are mapped into the extended configuration address space. In this method PCIe enables error reporting for individual errors via the Error Mask Register.

PCI-Compatible or legacy error handling mechanism:

PCIe provides registers mapping to support PCI related error. The PCI error reporting mechanism involves the assertion of signals PERR# (data parity errors) and SERR# (unrecoverable errors). The PCI Express mechanisms for handling these events are via the split transaction mechanism (transaction completions) and virtual SERR# signaling via error messages.

This involves enabling error reporting and setting status bits that can be read by PCI-compliant software. There is the configuration status and command registers, which have error related bits.

Below are the details of some important registers required for PCI compatible error handling.

PCI-Compatible Configuration Command Register

Signal Name in PCI	Description in PCIe
SERR# Enable	Setting this bit (1) enables the generation of the appropriate PCI Express error messages to the Root Complex. Error messages are sent by the device that has detected either a fatal or non-fatal error.
Parity Error Response	This bit enables poisoned TLP reporting. This error is typically reported as an Unsupported Request (UR) and may also result in a non-fatal error message if SERR# enable=1b. Note that reporting in some cases is device-specific.

PCI-Compatible Status Register (Error-Related Bits): This provides the bits to indicate the type of error such as system error, target abort .

PCI Express /native devices Error handling mechanism

This is PCI Express Baseline Error Handling mechanism which has PCI Express Capability Register Set. These registers include error detection and handling bit fields regarding the nature of an error that is supplied with standard PCI error handling. The baseline capability register space is different for RC and EP mode.

Fig2: PCIe Baseline capability registers structure

These registers provide support for:

Enabling/disabling error reporting (Error Message Generation)
Providing error status
Providing status for link training errors
Initiating link re-training

Below are the details of some important registers required for baseline error handling.

Device Control Register :

Setting the corresponding bit in the device control register enables the generation of the corresponding error message which reports errors associated with each classification. Unsupported Request errors are specified as Non-Fatal errors and are reported via a Non-Fatal Error Message, but only when the UR Reporting Enable bit is set.

Device Status Register:

An error status bit is set any time an error associated with its classification is detected. These bits are set irrespective of the setting of the error reporting enable bits within the device control register. Because Unsupported Request errors are by default considered Non-Fatal Errors, when these errors occur both the Non-Fatal Error status bit and the Unsupported Request status bit will be set. Note that these bits are cleared by software when writing a one (1) to the bit field.

Link Errors: Link control and link status register

The physical link connecting two devices may fail causing a variety of errors. Link failures are typically detected within the physical layer and communicated to the Data Link Layer. Because the link has incurred errors, the error cannot be reported to the host via the failed link. Therefore, link errors must be reported via the upstream port of switches or by the Root Port itself. Also the related fields in the PCI Express Link Control and Status registers are only valid in Switch and Root downstream ports (never within endpoint devices or switch upstream ports). This permits system software to access link-related error registers on the port that is closest to the host.

Advanced Error Reporting Mechanism (this is optional)

Importance of AER: AER provides the granularity and pinpoint details of correctable and uncorrectable errors. There are registers to define the error severity, error logging, error mask ability and to identify source of error.

Fig3: PCIe advanced error reporting register structure

Below are the details of some important registers required for advanced error handling.

Advanced Correctable Error status register

When a correctable error occurs the corresponding bit within the advanced correctable error status register is set, independent of the mask register setting. These bits are automatically set by hardware and are cleared by software when writing a "1" to the bit position.

Advanced Correctable Error mask register:

The correctable errors can also be masked by setting the corresponding bit in the register. Only affects the error reporting not the status bits. The masked errors are not logged in header log register and are not reported to RC.

Advanced Uncorrectable Error handling registers:

These errors can selectively cause the generation of an uncorrectable error message being sent to the host system. Those uncorrectable errors that are selected to be non fatal will result in a nonfatal error message being delivered and those selected as fatal errors will result in a fatal error message delivered. However, whether or not an error message is generated for a given error is specified in the advanced uncorrectable mask register.

Advanced Uncorrectable Error status register:

When an uncorrectable error occurs the corresponding bit within the advanced uncorrectable error status register bit is set, independent of the mask register setting. These bits are automatically set by hardware and are cleared by software when writing a "1" to the bit position.

Advanced Uncorrectable Error severity register:

AER mechanism defines the error severity handling for uncorrectable errors whether which one error is the more severe.

Uncorrectable Error mask register:

The uncorrectable errors can also be masked by setting the corresponding bit in the register. The default condition is to generate error messages for each type of error. Only affects the error reporting not the status bits. The masked errors are not logged in header log register and are not reported to RC.

Root Complex Error Tracking and reporting

The root complex is the target of all error messages issued by devices within the PCI Express fabric. Errors received by the RC result in status registers being updated and the error being conditionally reported to the appropriate software handler or handlers.

Root Complex Error Status register:

When RC receives an error message, it sets status bits within the root error status register. This register indicates the types of errors received and also indicates when multiple errors of the same type have been received.

Root Error Command Register:

The root error command register enables interrupt generation for correctable or uncorrectable errors.

Basic flow chart for error handling:

Fig4: Basic flow chart for PCIe error handling

Note: in above diagram: ANF:-Advisory non fatal error and DC reg:- device control register

Advisory Non-Fatal errors:

The error are reported and signaled as ERR_COR, ERR_NONFATAL, ERR_FATAL or not signaled at all, depending upon the role of the agent that detects the error and whether the agent implements AER. But in some cases detecting agent is not the appropriate agent to determine the ultimate disposition of the error, than the detecting agent with AER can signal the non-fatal error with ERR_COR, which serves as an advisory notification to software. For example a receiver that’s not the ultimate destination for a TLP (detects a non-fatal error with the TLP and severity is non fatal), than this “intermediate” receiver, handle this case as an Advisory Non-Fatal error and receiver with AER, signals the error (if enabled) by sending an ERR_COR message. A receiver without AER sends no error message for this case. If the severity is fatal, the error is not an Advisory Non-Fatal Error and must be signaled (if enabled) with ERR_FATAL.

Other case may be where, it is required to have continue operation for uncorrectable non fatal error, than such scenario is handled as advisory non-fatal error by sending ERR_COR. For example a poisoned TLP is received by its ultimate destination, if the severity is non-fatal and the receiver deals with the poisoned data in a manner that permits continued operation, the receiver handle this case as an Advisory Non-Fatal Error. The receiver with AER, signals the error (if enabled) by sending an ERR_COR message and without AER sends no error message for this case. If the severity is fatal, the error is not an Advisory Non-Fatal Error, and must be signaled (if enabled) with ERR_FATAL.

Nullified packet: This feature also called switch cut through, is development in PCIe over it’s earlier PCI. Earlier the packet at ingress port (incoming port) of switch is not sent to egress port (out going port) of switch until the tail end of packet is received and checked for CRC. In PCIe, the packet is passed from ingress port to egress port without waiting for tail end. If there is CRC error is detected on receiving tail end of TLP, than the TLP’s END is replaced with EDB (bad TLP) at egress port of switch and CRC is inverted with what it should be. The switch sends NACK for this and when reaches to end point (EP), it is discarded by EP, this is nullified TLP, EP doesn’t send any NACK for this nullified TLP(TLP with EDB tail end). After receiving the NACK, the requester again send the same TLP.

PCIe error handling on a typical SoC:

A typical SoC(System on Chip) consists of a core(CPU), memory blocks(RAM/FLASH), timing sources, PLL, reset handling, external/off-chip interface, industry standards peripherals such as USB/Ethernet/SPI/PCIE/ UART etc, analog interfaces like ADC/DAC,s and voltage regulators and power management controllers. The core communicates (provides stimulus in hex/binary format) with the modules (slave like PCIe) through an interface as the application layer. Here is the typical case of PCIe error handling on SoC.

Core generates a MRd transaction to EP and suppose for EP, this is an unsupported request.

So EP will return the completion with status field “UR” to RC. EP may also return an ERR_NONFATAL message, if enabled in EP’s Device Control Reg . And the EP logs this error in its:

Device Status Register
Uncorrectable Error Status Register
Header Log Register

For this “UR” completion packet, RC terminates the MRd transaction and returns an internal completion to the requester i.e. core .The result of such transaction is marked as error and “Bad Data” to core. And RC logs this error in its:

- Secondary Status Register( for received UR completion) and Root Error Status Register , if receiving an ERR_NONFATAL message

Core will not complete the instruction with the error status/“Bad Data” and core’s instruction execution will paused and core’s execution pointer jumps to interrupt handler (corresponding to the error).

Now how the core will proceed further with recovery options, depends on application and vendor/implementation.

Similarly core jump to interrupt handler (corresponding to error) for other errors of PCIe and take the implementation dependent actions.

Requirements and recommendations for reporting multiple errors:

Error pollution can occur if error conditions or root cause of error for a transaction can’t be ensured. For example suppose the DL layer detects an error, subsequent errors which occur for the same packet will not be reported by the transaction layer or suppose physical layer detects a receiver error, to avoid having this error propagate and cause subsequent errors at upper layers (for example, a TLP error at the Data Link Layer), making it more difficult to determine the root cause of the error.

For such case It is required and recommended that no more than one error is reported for a single received TLP, and the below precedence (from highest to lowest) is used:

Uncorrectable internal error
Receiver Overflow
Flow Control Protocol Error
Malformed TLP
ECRC Check Failed
AtomicOp Egress Blocked
TLP Prefix Blocked
ACS Violation
MC Blocked TLP
Unsupported Request (UR), Completer Abort (CA), or Unexpected Completion
Poisoned TLP Received or Poisoned TLP Egress Blocked

Conclusion:

PCIe provides the very descriptive error reporting and handling methods. There are the various registers for handling different kinds of errors. Here the error handling methods for legacy and native devices are detailed.

The actions taken by a function when an error is detected is governed by the type of error and the settings of the error-related configuration registers. The resultant actions for PCIe errors on SoCs are application and implementation specific.

References:

https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt

Book:PCI Express System Architecture, Ravi Budruk, Don Anderson, Tom Shanley, MindShare, Inc.,2006

Industry Articles

PCIe error logging and handling on a typical SoC