Using dynamic run-time scheduling to improve the price-performance-power efficiency of heterogeneous multicore SoCs
Oxford, UK
Abstract
In embedded systems, scheduling policies control the behaviour of systems which exhibit genuine hardware parallelism and/or software concurrency. These policies may: control the order of execution of tasks in a safety critical hard real-time system; may ensure quality of service and fairness in a multi-user environment, or; may simply ensure that a system appears responsive to a user through a graphical user interface.
In response to the increasing demands created by multicore architectures, this paper defines “platform-centric scheduling” and “application-centric scheduling” as distinct system management tasks that must be applied when manage software running on heterogeneous hardware. Platform-centric scheduling includes the scheduling policies that define how an application is bound to a heterogeneous multicore hardware platform at runtime. Platform-centric scheduling is therefore concerned with the efficient management of hardware parallelism and heterogeneity. “Application-centric” scheduling deals with the management of concurrent tasks based on application-level dependencies and priorities.
This paper demonstrates how a system-wide scheduling approach, combining application-centric and platform-centric scheduling, may be adopted which ensures that scheduling policies apply coherently to both hardware and software based devices.
The paper goes on to describe the opportunities presented by platform-centric scheduling in terms of dynamically managing performance and power efficiency at runtime. Having established the benefits of such an abstraction the runtime performance is then explored using a traditional software implementation and then through an innovative client-server operational model implemented in silicon as an IP core.
1 Introduction
There are numerous methods of analysing systems to ensure that real-time deadlines are met, examples of which are EDF (earliest deadline first), RMS (Rate monotonic scheduling) and various other stochastic methodologies. These approaches tend to be application-specific and possibly proprietary in nature. However, in all cases, the result of such scheduling analysis is a set of scheduling policies (i.e. priority, FIFO, round-robin, weighted fair queuing), which must be deployed efficiently at runtime.
Heterogeneous multicore architectures are increasingly being deployed in flexible ASSPs and “off the shelf” programmable SoCs. The software programming models (which in general encapsulate scheduling and synchronisation) for such devices, which have been inherited from those deployed in simpler architectures, fail to deliver the potential abstraction and flexibility required by the application programmer. Approaches which deliver abstraction and flexibility to software developers traditionally compromises the efficiency obtained from the underlying hardware platforms.
Any long-term viable solution to multicore programming must acknowledge that runtime system management, abstraction and the programming model are inextricably linked in multicore architectures and must seek to meet these challenges without compromising efficiency.
2 Multicore programming models
The following section reviews the migration and scalability challenges of traditional programming models when faced with the complexity of multicore hardware platforms.
Figure 1 uses a pipelined application example to illustrate how software architectures may be mapped onto an underlying hardware platform. Figure 1 (a) illustrates a type of architecture typically found in end equipment today, comprising a heterogeneous mix of processing instances; including a RISC processor, DSP and IO devices. Software is typically mapped onto such an architecture using software executives associated with the instances of instruction set (ISA) based architectures. Any communication between heterogeneous cores is generally facilitated by some link handling facilities embodied within the application space. Processing resources which cannot run an executive (hardware accelerators, IO devices etc) are managed by proxy using those which can, typically through device drivers.
Click to enlarge
Figure 1: Multicore programming models
When faced with complex multicore designs, a number of questions must be asked of this programming model:
- Does the model allow the application to scale across multiple platforms containing different numbers of such cores as a chip vendor iterates through a device family? This is property will be referred to as static scalability.
- Does the model allow for a varying number of cores to be employed for a given type of processing according to presented load? This property will be referred to as dynamic scalability.
- How efficient is the model at runtime? A measure of efficiency is the percentage of the processing power in the underlying architecture which is made available to the application through the programming model.
Figure 1 (b) shows a more contemporary architecture, typical of those being designed today. Note that this architecture has both heterogeneous processing elements and homogeneous resource.
Figure 1 (b) shows the result if contemporary programming models are simply redeployed onto these architectures – the so called “communicating islands” approach. Each processing resource executes a unique executive and any communication between these resources is, once again, handled using link handlers at the application layer. In general two forms of design-time static partitioning are adopted in these scenarios:
- Static load distribution, where multiple users are statically distributed across multiple cores.
- Static application partitioning, where the application is split across multiple cores.
When measured against the three key criteria established previously, without further middleware, the static allocation paradigm of the “communicating islands” model fails to deliver sufficient abstraction to enable a given application to scale across multiple platforms without application re-partitioning. It also precludes transient dynamic deployment of additional resource to a particular feature of the application which is very demanding at a particular instant in time. Conversely, the number of active processing resources cannot be reduced to save power in low load scenarios.
The efficiency of the approach is also questionable. Certainly additional memory footprint will be consumed to support the data structures, if not images, of multiple executives. Furthermore, it is unlikely that the application partition will result in an equally distributed load, which ultimately results in bottlenecks in one processor where spare capacity may exist in others. If additional middleware is deployed to address the scalability challenges, runtime efficiency is further compromised.
3 The SystemWeaver abstraction layer
The following section introduces a new platform abstraction which, by abstracting the programming model from the hardware architecture, seeks to address the scalability issues of the traditional approaches.
When considering the resource management of multicore architectures, a system-wide “unit of currency” must first be established with which desired application behaviour may be expressed. To be truly system-wide, this “unit of currency” must form a key part of the software programming model whilst providing sufficient simplicity for hardware implementations – indeed, the abstraction should effectively manage both hardware and software operation from a holistic system-level.
The simple concept of the computer science “task” (defined as a single sequential flow of control within a program, containing a unique combination of context, instructions and data) can provide a suitably abstract “unit of currency” for such a model.
Figure 2: Task state diagram
An alternative to the approaches of Figure 1 (a) and (b) is shown in Figure 1 (c). In this case an additional layer of abstraction is inserted beneath the traditional executives. This platform abstraction virtualises multiple processing resources into pools and makes a runtime (dynamic) binding decision of a given instance of an application task onto a given processing resource instance, according to scheduling rules developed by the system designer.
The remainder of this paper considers the features of this platform abstraction, as embodied in Ignios’ SystemWeaver™ product. The presented abstraction uses simple tasks, or threads , to effectively manage systems at runtime from a holistic platform level. The state diagram of such a task is shown in Figure 3.
Figure 3: Application and platform scheduling
The SystemWeaver abstraction layer provides a control plane only virtualisation of the underlying hardware platform upon which further software abstractions may be supported or built (operating systems, middleware abstractions, language based abstractions etc). It also enables system designers and application programmers to describe system behaviour in a manner decoupled from the hardware or software embodiment of the task execution.
Although conceptually similar to traditional multi-tasking abstraction, the SystemWeaver abstraction layer differs in two crucial ways:
- It accommodates explicitly defined scheduling hierarchies to enable tasks to be properly distributed among the various processing resources. The definition of the scheduling hierarchies and policies is orthogonal to the specification of the application concurrency. The scheduling definition may therefore be modified – both statically (boot time) and dynamically (run-time).
- It also makes no assumptions about the uniformity or otherwise of system memory.
These two differences make the SystemWeaver abstraction layer uniquely suited to the functional demands of heterogeneous multicore architectures.
4 Multicore scheduling
In a fully flexible system, the platform based abstraction of the previous section must allow tasks created or synchronised by any core to be issued for execution to any other core. Furthermore, to avoid the inefficiencies of unnecessary synchronisation, the platform based abstraction should be available simultaneously to any or all of the cores in the system. Finally, the system may develop a backlog of tasks for any or all of the processing resources during periods of high load. All of these factors, as well as the high event frequency of multicore systems, place onerous requirements on the scheduling within multicore systems.
In an ideal multicore system, a system-wide scheduler provides tasks at the optimum time to the optimum resources according to a set of predefined rules (the “schedule”). The scheduling hierarchy must be defined explicitly by the system designer and should be flexible, both at design and runtime.
Scheduling is required by both the application and the underlying hardware platform:
- Application scheduling comprises synchronisation and eligibility. Synchronisation ensures that resources within the system can be shared without compromising system integrity. Eligibility ensures that ready tasks are issued to the processing resources in a manner consistent with the needs of the overall application, expressed in the scheduling policy. This hierarchy of scheduling policies is referred to here as the scheduling cone and is application specific and must therefore be defined as part of the application development process.
- Platform based/distribution scheduling defines a policy in which application tasks are distributed amongst the appropriate processing resource instances. This may imply sharing a processing resource (or resources) amongst multiple users and/or multiple different algorithms. This hierarchy of scheduling policies is referred to here as the distribution cone and is platform specific. Platform specific scheduling may be defined statically (as part of the hardware design), but increased flexibility results from defining (or refining) the platform centric scheduling policies during the software development process.
Figure 4 shows a representation of the scheduling points with respect to the queuing points which they engender. From left to right the first queue point encountered is the pending queue. Blocked tasks are stored in the pending queue according to their priority and liberated by synchronisation or timer events . The second queuing point is the ready queue, which comprises both the application and distribution scheduling. Conceptually, between these two scheduling stages is a point where all the currently ready application tasks have been sorted according to their eligibility (expressed in user defined metrics and scheduling policies). This point is referred to as the distribution node.
4.1 Scheduling and distribution cones
In this paper, two types of “cones” are used to describe heterogeneous multicore scheduling behaviour; scheduling and distribution cones.
4.1.1 Scheduling Cones
Scheduling cones (shown within the application-centric hierarchy in Figure 4) define the “application decision node” hierarchies, which are driven by the needs of the application. Scheduling cones are many-to-one mappings (from many “entry” points to a single aggregated point), which define the rules by which multiple classes of task and multiple instances of task classes compete for system resources.
4.1.2 Distribution Cones
Distribution cones (shown within the platform-centric hierarchy in Figure 4) define the “distribution decision node” hierarchies, which are driven primarily by the properties of the underlying hardware platform. Distribution cones define the rules by which the most eligible candidates of the scheduling cones are distributed amongst the available and appropriate processing resources, describing a hierarchy of schedulers which diverge from a single aggregated point to multiple “dispatch” points – a one-to-many mapping.
Figure 4: Heterogeneous multicore scheduling structures
4.2 Primary Scheduling Nodes
There are three primary nodes used to describe scheduling configuration; entry nodes, distribution nodes and dispatch nodes.
4.2.1 Entry Nodes
Entry nodes define the point at which new tasks are queued. Entry nodes typically map many-to-one onto distribution nodes as the two extremes of a scheduling cone. Entry nodes may be associated with specific classes of task or according to other application derived policies. A given entry node may only map onto a single distribution node.
4.2.2 Distribution Nodes
Distribution nodes define the delineation between scheduling and distribution cones. They are typically representative of a class of processing resource . Scheduling cones typically map one or more entry nodes onto a single distribution node, distribution cones typically map a single distribution node onto multiple dispatch nodes and therefore ultimately processing resource instances.
4.2.3 Dispatch Nodes
Dispatch nodes define an exit point associated with an individual processing resource instance. They will typically map one to one with the processing resources which exist within the hardware platform. Multiple distribution cones may map onto individual dispatch nodes.
4.3 Application programmer’s view
When issuing a thread within the system the application programmer refers to an entry node within the function call which creates a thread. Provided the entry node reference does not change, the processing resource class to which the entry node (and hence the element of the application) is bound may be altered without affecting the source code. Although this generally only applies to core types which already exist in the system, the feature is useful in two key areas:
- As a chip vendor creates derivative devices and changes the number of processing resources, the application remains source code compatible – no application re-partitioning is required.
- Since the absolute binding of an application task to a processing resource instance is left to the scheduler, an appropriately chosen scheduling policy can match the number of “active” processing resources to the presented load at any instant in time.
5 Behaviour and Performance
Having established the basic principles of operation of the SystemWeaver abstraction layer, the behaviour and performance should be measured against the key criteria established previously; static scalability, dynamic scalability and efficiency.
Both static and dynamic scalability are provided by the abstraction of the binding decision of an application task and a platform resource to a runtime scheduler. Static scalability enables chip vendors to offer seamless application migrate between derivative products in a given roadmap. Furthermore, by selecting an appropriate scheduling algorithm, the scheduler can select available processing resources in a uniform manner starting from a single point – hence processing resource instances furthest from the start point will observe lower load and consequently will have more power-saving opportunities, enabled by dynamic scalability.
Conceptually then, there are compelling advantages of the SystemWeaver adaptation layer, however, these benefits must be weighed against the price of this abstraction, in terms of runtime system efficiency.
5.1 Measuring performance
To measure the performance of the embodiment of the SystemWeaver adaptation layer, Ignios built a cycle accurate model using ESL (Electronic System Level) technology.
To simplify the test, and since the technology is agnostic to processing resource type, we used a homogeneous environment comprising up to 4 ARM920T processors with instruction and data caches enabled. All processors shared a common 32bit AMBA AHB bus and a common memory which exhibited a 3 cycle latency.
The test application comprised multiple instances of a simple “producer”, “processor”, “consumer” task structure, each of which had a variable load granularity.
Two cases were considered; one where the SystemWeaver abstraction layer was implemented in software running on the ARM processors and another, where “native” hardware support was provided (the Ignios SystemWeaver server core). A single resource management entity, which implements the SystemWeaver abstraction layer, existed in both cases. In the software case, a vectored interrupt controller was also present.
Figure 5: SystemWeaver abstraction layer implemented in software
The “measure of goodness” (expressed on the y-axis of Figure 5and Figure 6) defines the proportion of the potential platform performance (instructions) made available to the application (as opposed to the resource management activity) over the runtime of the test - this is, therefore, a measure of efficiency.
Figure 5 shows the results for the software implementation. For a single processor with coarse task granularity, around 75% of the potential processing power is made available to the application (25% of processing power consumed in system management). At the other extreme, less than 10% is available for four processors with fine task granularity. The rapid roll-off of efficiency as the number of processors is increased at any given task granularity is also notable.
Figure 6: SystemWeaver abstraction layer implemented in hardware
Figure 6 shows the same test run using the SystemWeaver server IP core. In contrast to the software implementation, efficiency ranges from 65% to 90% and the non-linear roll-off of efficiency is eliminated with the scaling of the processors.
6 Conclusion
This paper has discussed the features of a platform based abstraction, comprising a programming model and a scheduling paradigm which uses a class based abstraction to abstract the application programmer from the underlying hardware platform. The abstraction supports binding of software components, expressed as tasks, to hardware cores, dynamically at runtime.
This abstraction enables chip vendors to vary the number of cores within a given product family whilst maintaining application source code compatibility with previous versions of the platform. The abstraction also enables the number of active cores to be matched dynamically to the presented load, hence power can be efficiently managed at runtime.
The paper then goes on to explore whether the benefits of the abstraction can be delivered through a software embodiment and concludes that the inefficiencies of such an approach make it untenable. In contrast a hardware assisted abstraction layer, using the SystemWeaver IP core, matches the benefits of the abstraction layer with runtime efficiency.
|
Related Articles
- SamurAI: a 1.7MOPS-36GOPS Adaptive Versatile IoT Node with 15,000x Peak-to- Idle Power Reduction, 207ns Wake-up Time and 1.3TOPS/W ML Efficiency
- Programming heterogeneous multicore embedded SoCs
- Using drowsy cores to lower power in multicore SoCs
- DSPs with PCI Express interface extend connectivity while improving performance and power efficiency
- Module Threading Technique to Improve DRAM Power and Performance
New Articles
Most Popular
E-mail This Article | Printer-Friendly Page |