Tutorial: The SA Forum's Hardware Platform Interface

Tutorial: The SA Forum's Hardware Platform Interface
By David Fick , Embedded.com
Sep 13 2005 (12:00 PM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=170702087

A standard that is gaining broad acceptance and adoption in embedded systems in a variety of markets is the Service Availability Forum’s Hardware Platform Interface (HPI). It is one of three it has proposed, which also include specifications for Application Interfaces and System Management Interfaces.

While embedded system designers may have a passing knowledge of HPI and may have even heard of customers requiring this of the products they purchase, designers likely still have many questions about HPI. Such as, what exactly is HPI and why is it important in the industry and the marketplace? What hardware management capabilities does HPI provide? And how does HPI fit into a typical system architecture and design?

As shown above in Figure 1, below, the Hardware Platform Interface defines a standard interface by which applications can discover, manage, and monitor the hardware resources in the system. The Application Interfaces, as defined by the Application Interface Specifications (AIS), defines interfaces by which applications can use the availability management and application services provided by the AIS interface implementer, such as a management middleware component.

Figure 1 - SA Forum Standard Interfaces

The System Management Interfaces, which will be defined by the forthcoming System Management Specification (SMS), cover standard SNMP and CIM based access to network management data related to the HPI and AIS capabilities in the system. In addition, SMS will define a Notification Service (NTF) API for sending and receiving system-level events using ITU X.73x style notifications.

The benefits of using HPI within a system design are numerous and can be grouped into two categories: functional and business.

From a functional perspective, HPI benefits system designers in three primary areas. First is allowing the hardware entities in the system to be discovered through HPI, thus facilitating the representation of the hardware entities in the System Model. This can lead to a cohesive availability management solution that takes into account the relationships between all system resources, including both hardware and software resources. The second functional benefit is that HPI enables more comprehensive system management policies by providing the facilities for fault, alarm, and hot swap management policies and actions. The third functional benefit is that by providing a standardized interface, HPI can be utilized across multiple platform form factors, such as AdvancedTCA, CompactPCI, and rack mounted servers.

By providing a standardized interface, HPI reduces the effort required to move platform management components from one HPI-enabled platform to another. This allows the system designers to focus on the information available through HPI rather than having to focus on redesigning platform management components to get the required data through another hardware management interface.

HPI Capabilities
The SA Forum initially released the Hardware Platform Interface (HPI) specification, version A.01.01 (“HPI A”), in 2002 to address the need for a standard hardware resource management interface to discover, monitor, and manage hardware resources. In 2004, an updated version of the HPI specification, version B.01.01 (“HPI B”), was released to extend the interface and to address issues in the A.01.01 version of the specification. It is important to note that implementations of the HPI B.01.01 release are not backwards compatible with the A.01.01 release. With the improvements and additional functionality provided by the latest HPI specification, most platform vendors have shifted their focus to providing support for HPI B. The HPI capabilities discussed in the remainder of this article are based on the HPI B specification.

The capabilities of HPI are focused on providing management applications with access to the information required to discover, manage, and monitor the hardware entities in a system from the point at which the system is powered on to when it is powered off. To meet these requirements, HPI is defined to provides nine critical capabilities in the areas of resource and entity discovery; reset state management; power state management; managed hot swap; alarm management; management instruments associated with HPI entities; event notification; configuration; system and resource event logging; and managing instruments associated with HPI entities.

Resource and entity discovery
HPI allows users to enumerate the set of hardware elements that are manageable within the system along with the set of management capabilities those elements have. Within HPI there are three essential concepts related to hardware elements: 1) Entity – A hardware element that has a unique identifier and a set of associated management capabilities.
2) Resource – Provides access to the management capabilities of one or more hardware entities. There is a primary entity associated with an HPI resource, and in some cases the resource may also provide management access to other HPI entities. For example, the management capabilities of a mezzanine card may be exposed through the HPI resource for the host card to which it is attached. An HPI resource that is not a FRU (see next hardware element type) is considered a fixed resource that cannot be inserted or removed from the system while the system is powered on.
3) Field Replaceable Unit (FRU)– A special type of HPI resource which can be inserted and extracted from the system while power is applied to the system, i.e., it can be hot swapped to/from the system. Through HPI, the hot swap state of FRUs can be monitored and even managed.

By allowing users to discover the hardware elements that are available in the system, HPI simplifies the task of creating a model of the system under management for the purposes of defining the appropriate availability and fault management policies for the system.

Managing reset and power states
Many HPI resources (especially FRUs) support the capability for their reset state to be monitored and controlled through HPI. For resources that support this capability, an HPI user can get the reset state of the resource (asserted or de-asserted) as well as perform a reset action on the hardware resource. Reset actions include asserting the reset line, de-asserting the reset line, and performing a cold or hot reset of the resource. This capability is critical for fault management policies that attempt to repair a fault condition by either resetting the resource or by using the reset capability as part of a component replacement procedure.

Similar to reset state management, many HPI-enabled resources support the capability for their power state to be monitored and controlled through HPI. For such resources, an HPI user can get the power state of the resource and can perform a power state action on the resource. Valid power state actions include powering the resource on or off, as well as power cycling the resource. This capability is critical for fault management policies that attempt to repair a fault condition by either restarting the resource by cycling its power, or by using the power control capability as part of a procedure to replace a faulty component.

Hot swap management
As described previously, in HPI there is a special class of HPI-enumerated resources called FRUs that support “hot swap”, i.e., they can be inserted and extracted from the system while power is applied. FRUs support one of two hot swap state models where the model defines the valid hot swap state and state transitions for the FRU. The first model is the simplified hot swap model that has only two states, one indicating the FRU is present and one indicating the FRU is no longer present. The user has no ability to control hot swap sequences for FRUs that support the simplified hot swap model. Examples of hardware elements that typically support the simplified hot swap model include fans, fan trays, and power supplies.

The second hot swap model in HPI is called the managed hot swap model. FRUs that support this model transition through additional hot swap model states allowing HPI users greater control over the insertion and extraction of such resources. When a hot swap insertion or extraction sequence begins for a FRU, the HPI service waits a configurable amount of time to see if any user wants to take control of the hot swap sequence. If no user requests control within that timeout period, the HPI service for the platform will perform the default hot swap sequence actions for the resource as it transitions through the managed hot swap model. Examples of actions that are taken during a hot swap sequence include changing the power and reset states of the resource, as well as updating the resource’s hot swap indicator (typically a LED) to signal the operator when the hot swap sequence has been completed.

But more significantly, HPI also allows a user to take control of the hot swap sequence of a managed hot swap FRU, thus giving the user greater control over the hot swap sequence to increase the manageability of the system. For example, if an operator attempts to remove a FRU from a system that has software resources running on it with no associated standby resources, the HPI user can reject the extraction sequence. Or if a system administrator attempts to insert a FRU of the wrong type or version into the system, the HPI user can detect the improper configuration and reject the insertion sequence.

Managing alarms
The HPI alarm management service maintains a list of the current active alarm conditions in the system. Alarm conditions are automatically added and removed from the alarm list based on sensor and resource failure event notifications where users can access and monitor the active alarm list through HPI. Additionally, users can add and remove their own alarm conditions to the active alarm list while associating the appropriate severity with the alarm. However, depending on the HPI implementation, there may be a limit to the number of alarms a user can add to the active alarm list. HPI users can also acknowledge alarm conditions that may affect the manner in which the alarm condition is announced.

An HPI service may optionally update the platform and individual entity alarm annunciation devices, devices such as LEDs, LCD displays, and audible indicators, to reflect the severity of the alarms currently in the active alarm list. When this functionality is provided by the HPI service, this eliminates the need for the user to determine how to annunciate alarm conditions unless they have specific requirements that are not met by the default annunciation policies.

The benefit of the HPI alarm management capabilities is to allow an HPI user to easily determine the current alarm conditions, e.g., at system startup, and to add their own alarm conditions to the list. This simplifies the need for a user to develop a custom alarm management service since it is already provided with HPI.

Performing event notifications
HPI defines an event delivery mechanism by which interested applications can receive event notifications through HPI that identify changes in the state of the system or individual entities. An HPI implementation can generate many different types of event notifications, but the most commonly utilized are: 1) Sensor events – Identify the change in the state of a sensor, such as a temperature sensor exceeding or dropping below one of its threshold levels
2) Hot swap events – Identify a change in the hot swap state of a simplified or managed hot swap FRU
3) Resource failure events – Identify whether a resource has failed, has been restored to a healthy state, or has been added to the system

The listed HPI event types are very useful in triggering system and availability management policies.

Configuration and event logging
For resources that provide the capability, HPI provides the ability to read and write configuration settings for a specified entity. HPI defines a set of mandatory configuration settings, such as the auto-insertion timeout period for hot swap extraction sequences, whether associated sensors are enabled, and sensor threshold values. There are also optional configuration settings that may be supported, such as the state and modes of associated controls, watchdog configuration settings, and the entity inventory data records.

Configuration settings have factory default values that can be modified through HPI’s configuration parameter API and then saved to non-volatile memory to override the factory defaults. Additionally, configuration settings can be reset to their factory default values after new configuration values have been stored.

HPI-enabled platforms that retain historical HPI events at the system or resource level can expose those event logs through HPI as system or resource-level event logs. These event logs can be used by an operator or developer to analyze a hardware entity failure and see the sequence of HPI events that occurred before and after the failure.

How HPI manages entity instrumentation
To provide the critical measurement and assessment capabilities any system will need, HPI is defined such that an entity may have one or more management instruments associated with it relating to sensors, controls, inventory data, watchdog timers and annunciators.

Sensor instruments provide information on an HPI entity through the measurement of a critical hardware entity attribute, such as voltage sensors that indicate the voltage level on critical power lines or temperature sensors that indicate the temperature level on different components in the system.

As the state of sensors change, HPI will also send event notifications to all interested subscribers identifying the change in sensor state, such as a temperature sensor exceeding a critical temperature threshold. Sensors serve as an essential mechanism for monitoring the health of entities in the system, and sensor related events can be utilized to drive fault and alarm management policies.

HPI also makes provisions for control instruments, which provide read and potentially write access to control devices associated with a hardware entity such as LEDs, dry contact closures, LCD display, audible alarm indicators, etc. Controls allow an HPI user to customize the manner in which information such as alarms are communicated to the system administrator.

It also makes allowances for inventory data instruments, which provide inventory management information about the hardware entity. This usually includes information such as the manufacturer ID, product name, product version, serial number, and part number for the chassis, product, or an individual entity.

In some HPI-enabled systems, the HPI user can also update or add to the inventory data associated with an entity. Inventory data accessible through HPI improves the manageability of the system by allowing management policies to incorporate inventory data checks to ensure the proper entities and versions of those entities are being used in the system.

Watchdog timers and annunciators
Within HPI, watchdog timers are used to provide mechanism to monitor the health of a system by ensuring that critical aspects of the system are progressing, such as BIOS operations or the loading of the operating system. Watchdog timers can have both an associated pre-timer interrupt action and timer expiration actions, such as power off, power cycle, and reset actions.

Pre-timer interrupt actions are applied effectively as a warning that the watchdog timer is nearing expiration to allow any additional management actions to be applied. Once the watchdog timer expires the associated action is applied to the entity. Watchdog timers are another means by which management applications can monitor and react to changes in the health of hardware entities in the system.

Under HPI annunciators are abstract control elements each of which can have a set of alarm conditions associated with it. They ensure, based on the severity of the associated alarm conditions, that the alarms are properly annunciated through the platform’s and the entity’s alarm indicators. This eliminates the need for a user to know about the alarm annunciation devices for a platform or entity, but instead they just need to add and remove alarm conditions to annunciator management instruments to allow the HPI service to apply the appropriate alarm annunciation for the system.

In general, the management instruments associated with a resource are related to the primary hardware entity with which the resource is associated. But in some cases, resources will have management instruments that are accessible through the resource that are actually associated with a different HPI entity. For example, a simple hardware entity that cannot be directly accessed by the platform management service, such as a mezzanine card attached to a single-board computer, may have its management instruments made accessible through the HPI resource that contains the simple entity.

Using HPI in a embedded system
While conceptually standard interfaces are always appealing, the true test of a standard interface is how well it fits into the designs of the systems, embedded and otherwise. Depending on system requirements, time available in the schedule, etc., there are innumerable ways to potentially design a system using a standard interface such as HPI. As an example, consider the system design shown in Figure 2 below, already being used in embedded telecommunications applications.

Figure 2 - System Management View using HPI

As this system design example will illustrate, HPI enables sophisticated system and availability management policies to be easily defined using a standard and portable hardware management interface. In this application, the platform management component illustrated is actually below the level at which HPI would be relevant. But it is shown here to illustrate that typically the platform management logic provided with the platform is usually responsible for managing the critical power and thermal subsystems.

The HPI service component provides an HPI client library that supports the HPI API functions and which can be linked into HPI user applications. The architecture of the HPI service may vary from platform to platform, but since the HPI user applications rely only on the HPI API functions, the HPI service architecture is not relevant.

In this design, the management middleware component plays a number of critical roles. Most importantly it provides a System Model to represent the state of both hardware and software resources along with the dependency and redundancy relationships between the various resources. The System Model is used to drive the availability management policies for the system, such as failing over to standby resources if a resource actively providing a service fails.

This component also represents the hardware entities discovered through HPI in the System Model, including configuration and dependency information for the hardware entities as well as updates the state of the hardware entities in the System Model as changes are identified through HPI. It also provides other services useful to embedded system applications, such as cluster management and distributed messaging services.

The Fault Management component is responsible for detecting and managing fault conditions as they occur in the system, whether they relate to hardware or software resources. When it detects or is notified that a fault has occurred, it applies the fault management policies that have been defined for the system. The policies performed by this component involved fault detection, isolation, recovery, repair and notification.

Initially, its’ main function is to analyze the HPI and other event system notifications to identify fault conditions and isolate faulty components from the remainder of the system, e.g., by powering off or indefinitely asserting the reset line of failed hardware components. It also drives fault recovery policies by updating the state of the failed resources in the System Model to allow the Management Middleware to implement the defined availability management policies for the resource. It then attempts to repair the failed component (e.g., power cycle or reset a failed hardware entity) and notify the appropriate system entities of the existence of failed components.

The Alarm Management component in this application maintains the active alarm list for the system, allows users to add and remove custom alarm conditions then to acknowledge alarm conditions. It also annunciates the current alarm conditions on the platform alarm annunciation devices. The component potentially builds upon the alarm list maintained within HPI to provide much of this functionality. But the component extends beyond the HPI alarm management capabilities by automatically adding alarms for failed software resources to the active alarm list and ensuring that all active alarms are properly annunciated on the platform.

The Hot Swap Management component manages the hot swap sequences of FRUs. As mentioned earlier, HPI allows a user to override the default hot swap sequence actions that would otherwise be performed by the HPI service. But when an HPI user takes control of a hot swap sequence, it is required to perform all of the default actions for the resource as well as any custom actions or policies. Under such conditions, this component will be responsible for three crticial operations: hot swap sequence grant policies, system, and resource actions.

Hot swap sequence grant policies, generally different for insertion and extraction sequences, determine whether a hot swap sequence should be allowed when it is first initiated. The hot swap management component provides default grant policies, while also allowing the user to provide their own custom policies.

The hot swap management component is designed to perform defined system actions in reaction to the hot swap sequence being granted. A prime example is that the system resources that depend on a FRU being extracted are gracefully switched over to their associated standby resources on other nodes. This allows the software resources running on the FRU being extracted to gracefully switch the service over to other software resources.

Once an HPI user takes control of a hot swap sequence it also needs to perform the appropriate actions against the entity. As a result, the hot swap management component also needs to take these hardware entity actions to complete the hot swap sequence.

More information about HPI and the other interface standards can be obtained at the Service Availability Forum web site. There you can also find information on products that are Service Availability Forum Registered (including HPI and others).

David Fick is a System Architect at GoAhead Software.