Choosing between dual and single core media processor configurations in embedded multimedia designs

Embedded multimedia - choosing between a dual or a single core media processor
By David Katz and Rick Gentile, Embedded.com
Aug 3 2005 (2:50 AM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=167100412

Selecting a processor for an embedded multimedia application is a complex endeavor. It involves a thorough analysis of the processor's core architecture and peripheral set, a solid understanding of how video and audio data flows through the system, and an appreciation for what level of processing is attainable at an acceptable level of power dissipation.

Until recently, the standard approach to this problem was to split it into a "control domain" handled by a microcontroller (MCU) chip, and a "computation domain" handled by a digital signal processor (DSP). Both RISC MCUs and DSPs have traditionally served media-rich embedded applications. However, they are not used interchangeably; rather, they work in concert. MCU architectures are well suited for efficient asynchronous control flow, while the bread-and-butter of DSP architectures is synchronous, constant-rate data flow (for example, filtering and transform operations).

Because both sets of functionality are necessary in today’s media processing applications, engineers often use separate MCU and DSP chips. This combination offers a good processing engine for a wide variety of multimedia applications, but with the added complexity of multi-processing design, multiple development tools sets, and heterogeneous architectures to learn and debug.

To attempt to alleviate these problems, chip vendors have tried different solutions. Various MCU makers have integrated some signal processing functionality, such as instruction-set extensions and multiply-accumulate (MAC) units, but this effort often lacks the essential architectural basis required for advanced signal processing applications. Similarly, DSP manufacturers have included limited MCU functionality, but with inevitable compromises in the system control domain.

Recently, another option has emerged - the single and dual core Embedded Media Processor (EMP) architecture, which presents MCU and DSP functionality in a unified design that allows flexible partitioning between control and signal processing needs. If the application demands, the EMP can act as 100% MCU (with code density on par with industry standards), 100% DSP (with clock rates at the leading edge of DSP technology), or some combination in-between.

Evaluating a single core EMP architecture
An example of a single core EMP architecture is shown in Figure One, below, which combines a 32-bit RISC instruction set, dual 16-bit MAC units and an 8-bit video processing engine. Its variable-length instruction set extends up to 64-bit opcodes used in DSP inner loops, but is optimized so that 16-bit opcodes represent the most frequently used instructions. Thus, compiled code density figures are competitive with industry-leading MCUs, while its interlocked pipeline and algebraic instruction syntax facilitate development in both C/C++ and assembly.

Figure One

Like MCUs, EMPs have both protected and unprotected operating modes that prevent users from accessing or affecting shared parts of the system. In addition, they provide memory management capabilities that define separate application development spaces while preventing distinct code sections from being overwritten. They also allow both asynchronous interrupts and synchronous exceptions, as well as programmable interrupt priorities. Thus, an EMP is well suited as a target for embedded operating systems (heretofore the MCU’s domain).

On the DSP side, the EMP is structured for efficient data flow and very high performance, with a peripheral set that supports high-speed serial and parallel data movement. In addition, the EMP contains an advanced power management feature set that allows system architects to craft a design with the lowest dynamic power profile.

Single Core Development Approaches
In today’s design paradigm, MCU and DSP programmers often partition into two totally separate groups, interacting only at the “system boundary” level where their two functional worlds meet. This makes some sense, as the two groups of developers have evolved their own sets of design practices. For instance, the signal processing developer may relish the nitty-gritty details of the processor architecture and thrive on implementing tips and tricks to improve performance.

On the other hand, the MCU programmer might well prefer a model of just turning on the device and letting it do all the work. This is why the EMP supports both DMA and cache memory controllers for moving data through a system. Multiple high-speed DMA channels shuttle data between peripherals and memory systems, allowing the fine tuning controls sought by DSP programmers without using up valuable core processor cycles. Conversely, on-chip configurable instruction and data caches allow a hands-off approach of managing code and data in a manner very familiar to MCU programmers. Often, at the system integration level a combination of both approaches is ideal.

Another reason for the historical separation of MCU and DSP development groups is that the two processors have two separate sets of design imperatives. From a technical standpoint, engineers responsible for architecting a system sometimes hesitate to mix a "control" application with a signal processing application on the same processor. Their most common fear is that non-real-time tasks will interfere with hard real-time tasks. For instance, the programmer who handles tasks such as the graphical user interface (GUI) or the networking stack shouldn’t have to worry about hampering the real-time signal processing activities of the system. Of course the definition of "real-time" will vary based on the specific application. In an embedded application, the focus is on the time required to service an interrupt. For this purpose, we assume a time frame of less than 10 microseconds between an interrupt occurring and when the system context is saved at the start of the service routine.

While MCU control code is usually written in C and is library-based, real-time DSP code is typically assembly-based and is handcrafted to extract the most possible performance for a given application. Unfortunately, this optimization also limits the portability of an application and therefore propagates the need for divergent skill sets and tools suites between the two programming teams on future projects.

With the introduction of the EMP, however, a C/C++-centric unified code base can be realized. This allows developers to leverage the enormous amount of existing application code developed from previous efforts. Because the EMP is optimized for both control and signal processing operations, compilers can generate code that is both “tight” (from a code density standpoint) and efficient (for computationally intensive signal processing applications). Any gap in compiler performance is closed by the EMP’s high operating frequencies - in excess of 750 MHz - which are at the leading edge of mainstream DSP capability today. Additionally, targeted assembly coding is still an option for optimizing critical processing loops.

While moving to an EMP can greatly reduce the requirement of writing code in assembly, this fact alone doesn’t necessarily justify the switch to this unified platform. Operating system (OS) support is also key. By supporting an operating system or real-time kernel, several layers of tasking can be realized. To ensure targeted performance is still achievable, an interrupt controller that supports multiple priority levels is a necessity. Context switching must be attainable through hardware-based stack and frame pointer support. This allows developers to create systems that include both worlds - control and real-time signal processing - on the same device.

In addition, the EMP’s memory management facility allows OS support for memory protection. This allows one task, via a paging mechanism, to block memory or instruction accesses by another task. An exception is generated whenever unauthorized access is made to a "protected" area of memory. The kernel will service this exception and take appropriate action.

The high processing speeds achievable by the EMP translate into several tangible benefits. The first is time to market - there can be considerable savings in reducing or bypassing the code optimization effort if there’s plenty of processing headroom to spare. A second key benefit is reduced software maintenance, which can otherwise dominate a product’s lifecycle cost. Finally, for scalable EMP architectures, there’s the possibility of designing a system on the most capable processing family member, and then "rightsizing" the processor for the computational footprint of the final application.

When a single core EMP is not enough
As processing demands continue to increase, there comes a point at which even a 600 MHz EMP will not suffice for certain applications. This is one reason to consider using a dual-core EMP, such as the ADSP-BF561 (see Figure Two, below). Adding another processor core not only effectively doubles the computational load of which the processor is capable, but also has some surprising structural benefits that aren’t immediately obvious.

Figure Two

The traditional use of a dual-core processor employs discrete and often different tasks running on each core. For example, one core might perform all control-related tasks - like graphics and overlay functionality, networking, interfacing to bulk storage, and overall flow control. This core is also where the operating system or kernel will likely reside. Meanwhile, the second core can be dedicated to the high intensity processing functions of the application. For example, compressed data packets might be transferred over a network interface to the first core, which preprocesses them and hands them over to the second core for audio and video decoding.

Figure Three

This model is preferred by developers who employ separate software development teams. The ability to segment these types of functions allows a parallel design process, eliminating critical path dependencies in the project. This programming model also aids the testing and validation phases of the project. For example, if code changes on one core, it does not necessarily invalidate testing efforts already completed on the other core.

Symmetric versus Asymmetric Multiprocessing
To understand what makes this dual-core approach exciting, we need to first discuss "Symmetric Multiprocessing (SMP)." This refers to an architecture where two similar (or identical) processors are connected via a high-speed path and share a common set of peripherals and memory space. This is in contrast to "Asymmetric Multiprocessing (AMP)" approaches, which combine two dissimilar processors, usually an MCU and a DSP, into a hybrid architecture. A limitation to the AMP approach is that the designer must partition to a "50/50" share of control and DSP functions; once the DSP is "maxed out," for instance, the MCU will be unable to take up the computational slack. The SMP structure has no such limitation, since the two processor cores are identical and can be partitioned as the application requires, even to the point of 100% DSP or 100% MCU operation. What’s more, the symmetric processor enjoys the advantage of providing a common, integrated design environment. Only one set of development tools is re quired, and there’s less of a burden on training the development team for a single platform.

As an SMP-friendly device, the BF561 contains high-speed L1 instruction and data memory local to each core, as well as a shared 128KB L2 memory between the two cores. Each core has equal access to a wide range of peripherals - video ports, serial ports, timers, and the like. This arrangement expands the configurability of the device, allowing it to function in one of several equally valid frameworks. These models (see Figure Four, below)can be summed up as "MCU/DSP Separation," "Serial Processing" and "Split Processing."

Figure Four

Choosing the appropriate data model
The "MCU/DSP Separation" model involves discrete and often different tasks running on each core. One core is assigned all "MCU-type" activities, such as graphics overlays, networking management, and flow control. Additionally, this core houses the operating system, if one is used. Meanwhile, the second core is dedicated to the high-intensity DSP functions of the application. For example, compressed data may transfer over the network via the first core. Received packets can then feed the second core, which in turn performs audio and video processing.

This model is well suited to companies that employ separate task-based teams for their software development. The ability to have a "Control Team" and a "DSP Team" allows development to be accomplished in parallel, reducing critical-path dependencies in the project. This programming model also aids the testing and validation phases of the design. For example, code changes on one core do not necessarily invalidate testing efforts already completed on the other core. In addition, having identical cores available allows for the possibility of re-allocating any "unused" processing bandwidth on either core across different functions and tasks.

In the "Serial Processing" usage model, the first core performs a number of control and computational stages on the input data set, and then passes an intermediate data stream to the second core for final processing. A variation of this method partitions tasks across each core in such a way that intermediate data actually moves between cores several times before the final result is achieved. An algorithm well suited to this usage model is MPEG encoding or decoding.

The "Split Processing" model provides a more balanced usage of each core. Because there are two identical cores in a symmetric processor, traditional compute-intensive applications can be divided equally across each. Architectural features like copious on-chip memory, wide internal data paths and high-bandwidth DMA controllers all contribute to the success of systems based on "split processing." In this model, code running on each core is identical; only the data being processed is different. In a channel-streaming application, this would mean that half of the channels are processed by the first core, while the other half are processed by the second core. As another example, in a video or imaging application, alternate frames may be processed by each core.

Even when an application fits on a single core processor, the dual-core system can be exploited to reduce overall energy consumption. As an example, if an application requires a 600 MHz clock speed to run on a single-core processor like the ADSP-BF533, it must also operate at a higher voltage (1.2V) in order to attain that speed. However, if the same application is partitioned across a dual-core device like the BF561, each core can run at around 300 MHz, and the voltage on each core can be lowered significantly - to 0.8V! Because power dissipation is proportional to frequency and to the square of operating voltage, this voltage reduction from 1.2V to 0.8V (coupled with the frequency reduction from 600 MHz to 300 MHz) can have a marked impact, actually saving energy compared with the single-core solution.

David Katz and Rick Gentile are senior DSP applications engineers at Analog Devices, Inc.