Optimizing System Management in the Platform SoC Era

Howard Pakosh, ChipStart
November 2010

Introduction

Consumer focused SoCs have evolved into platform architectures that are now being driven by requirements from operating systems such as Android, iPhone. Linux, and Windows and the thousands of applications they support. Overtime more of the system is moving into silicon . As a result, system management functions have moved into the SoC. Traditional feature based regression testing at the silicon level must now be increasingly complimented with complex system level testing in order to maintain a high level of system coverage across SoC road maps.

Balancing price-performance-power and high system level test coverage therefore creates complex system management design challenges that effect both hardware and software operation. System management must now be considered as a central feature and responsibility of the SoC architecture, not just as a tactical design consideration for the development of each individual SoC. System management should provide adequate synchronization of hardware state changes driven by software, maintain reasonable time to market and maximize system test coverage and support.

The remainder of this paper will discuss design considerations and compare and contrast three system management architectures. The first is an ad hoc system management, which is comprised of combinations of hardware and software elements that serve a dual purpose, one being normal operation, and one for system management. The second is including system management as part of the on-chip interconnects implementation. The third architecture introduces a control plane approach for system management which complements the data centric global interconnect.

Finally the paper will discuss the growing importance of integrated subsystem design and IP for SoCs and how system level partitioning will play a growing role in achieving efficient system management.

System management design considerations

One of the key challenges associated with designing SoC system management schemes stems from the growing number of programmable devices on-chip. Programmable devices exponentially increase the number of combinations of software operations that drive hardware state changes in real time. This in turn complicates system level testing in order to achieve reasonable test coverage. Optimizing the SoC design for a single operating system provides little relief , because the diversity of applications running on the SoC continues to multiply the testing complexities at the system level.

System level testing via traditional silicon level functional and data path regressions must now be augmented by system functional test suites include the programmable elements and their impact on hardware state changes. Each programmable core can be isolated and tested to achieve a high level of code coverage, and each execution path through the different cores combinations can be tested., but the combinations of hardware state changes they require as a result of application behavior makes it almost impossible to achieve adequate system level coverage solely from testing the cores and the buses in isolation or even pseudo random combinations.

It is at this point that compromises are often made in the SoC design. How much risk is affordable when trading off the cost and time to build these complex system level regression suites with the actual test coverage achieved? As volumes grow the answer is risk must be mitigated and therefore these tradeoffs become essential to minimize.

This paper challenges the increasing “tax” on the project costs to balance adequate system level test coverage, and risk, based on current system management architecture assumptions .

Specifically, instead of continuing to grow regression suites and make risk choices based on the assumption that the associations between the levels of hardware and system testing are tightly coupled, abstraction layers can be inserted into the architecture to decouple the hardware, operating system, and applications support functions. Furthermore, each of these components can tested through independent elements introduced into the SoC architecture.

In fact, this trend has already begun. The growing use of decoupled global interconnect structures, such as those that employ OCP or similar features, provides a proven example of how to ease chip architecture design as it evolves from single to multicore or multi-layer. By “abstracting” the data plane, and allowing the associations between the IP cores to become linked through the independent global interconnect structure, system performance at the hardware level becomes more predictable and tunable (CPU to off chip memory for example). This predictability affords opportunities to streamline the design process because these loosely coupled associations are less effected by specific design changes. This leads to more rapid timing closure even though the complexity of the data plane has grown significantly.

Similar abstraction techniques can be applied to system management. The software and hardware layers, the system management, and the functional operation of the SoC can be decoupled, making it easier to test each component of the system level architecture while considering the system level driven hardware state changes. This results in a system level design which is more easily understood and has better test coverage. This approach also abstracts the system management operational complexities between hardware and software even though the number of applications grows.

The next section of the paper will discuss three potential methods of abstraction that lead to varied degrees of optimizing system management.

System Management Scheme Comparisons

Given that the objective is to reduce overall system management complexity there are three baseline characteristics that system management schemes should be benchmarked by:

How well does the approach achieve independence between the silicon-operating system- and application layers?
How flexible is the approach to adapt to each derivative design in a SoC road map?
How much test coverage does the resultant system management scheme achieve for the SoC architecture?

By applying these benchmark criteria, three methods can be evaluated.

Method 1: Using a single operating system hosted on a “master” CPU. This has been a popular approach to perform system management because silicon elements already required for real time operation also execute system management functions.

When SoC complexities are relatively low, this scheme is very efficient. No extra silicon, some extra software development, but very containable.

However, the complexity growth associated with multicore SoC for consumer designs today have weakened the effectiveness of using this approach because as system tasks become distributed, that is more interdependent as more cores are added to the SoC, the visibility and control of any one core over any of the others is reduced with each new element added. The visibility and control becomes more dependent on the global interconnect as well as the cores, adding even more complexity to execute control functions. The addition of the global interconnect as part of the system testing is required in this case because it controls access to external memory, a key element in system operations.

If the master CPU can no longer manage and verify the hardware state changes of the other core elements, the number of possible states increasing results in unpredictable coverage and the methodology no longer has value. Extending the scheme then to add system test does not return meaningful dividends on the potentially massive investment of developing the tests and verification infrastructure.

Applying the criteria then to this method for today’s platform SoCs

This approach fundamentally breaks down for multicore SoCs because it will not adequately allow the economical construction of operating system and application level system test layers.
This criterion is considered inconsequential given that the criteria failed the first test.
This approach will yield extremely low system test coverage and therefore its usefulness is directly dependent on the complexity of the SoC.

Method 2: Introducing global interconnect structures and additional logic to support pseudo-control plane system management functions. This approach is an extension of method 1 because often the host CPU continues to act as the system management master. Side band signaling, either contained in the interconnect or designed separately is used for the control functions.

Mixing data plane and control functions introduces abstraction levels that aides in achieving higher system test coverage as long as the SoC does not drive the interconnect requirements to become so complex that the control functions become a small and lower priority in the overall mix of functions. When this occurs the control tasks are executed sub-optimally as delays occur from priority choices between functional operations and system management tasks because of complex arbitration sequences and delayed communication through blocked hierarchical buses.

Applying the criteria then to this method for today’s SoCs

This approach introduces levels of abstraction which makes the approach feasible for some multicore SoCs.
However, the approach also has a ceiling of usefulness which is normally reached when extra logic is required to manage “special” cases for each of the derivatives in the SoC road map as inefficiencies mount that are tolerated to minimize time to market. One area where this occurs is when the system management master, usually the host CPU, requests that another core should power down. Inefficiencies sometimes occur when complex arbitration schemes and blocked requests delay the actual action of powering down the core. These delays can often be measured in thousands of cycles, which is power consumed for no useful system function, and is therefore power wasted.
As a result of the ceiling in the benefits of the approach, overall coverage is directly dependent on the complexity of the SoC and as such is useful only within a range of SOC complexity.

Method 3: Introducing a control plane that compliments a data plane global interconnect.

This approach differs from the first two methods because it does not extend the traditional host CPU system master approach. Rather, it introduces a separate control plane and an independent system controller to perform system management tasks.

An independent control plane essentially abstracts the system management tasks from any one entity. As such, it can be controlled by any-or all SoC elements as required, and therefore offers multiple layers of abstraction. System testing can be developed by software, hardware, verification, and system engineers and applied using a common framework with equal effectiveness.

This approach is also advantageous because it separates targeted control tasks ideally executed with low latency from longer more complex and often performance sensitive data plane tasks. This separation is often necessary when complexity is high, because traditional approaches reach the ceiling of effectiveness discussed during method 2.

Applying the criteria then to this method for today’s SoCs

This approach creates maximum levels of abstraction for system management but introduces control plane functionality.
This approach introduces high levels of flexibility as both control and data plane functions can be tuned for each SoC derivative without changing the base architecture.
This approach also maximizes the coverage achievable because any source can direct the system management and as such operations (applications) can be isolated and tested within the approach without compromising overall coverage.

Summary:

While method 3 introduces new control plane functionality, it also enables SoCs of virtually any complexity to be tested and operated with maximum efficiency achieved using the same approach. As such it is best suited for roadmaps that contain a wide variety of complexity or when extreme flexibility is required for the SoC architecture. The ability to direct the system controller using any SoC core is especially noteworthy because it allows multiple applications to directly control the hardware states in real time when needed and without the overhead of channeling its requests through other entities, thus avoiding inter-function dependencies, complexities and delays.

The Impact of SoC Subsystems on System Management.

The basic theme to achieving better system management is successful partitioning in order to increase adequate levels of system test coverage. This is why method 3 was chosen as the most effective for today’s system management needs.

It stands to reason, then, that the impact of subsystem utilization further abstracts the system management tasks. However, creating systems within systems also introduces hierarchies of complexity and as such, further pushes traditional methods of system management useless.

The growing use of subsystems over the next generations of SoC design will therefore accelerate the adoption of control plane based system management as the preferred method of architecture so that hierarchical levels of complexity can be absorbed into the system management architecture while maintaining a common architecture that provides the flexibility and scalability while minimizing risks and costs of expensive architecture redesigns that will accelerate as system requirements continue to become more complex.