Getting the most from multiprocessor SoC design

EE Times: Design News
Getting the most from multiprocessor SoC design

Ashish Dixit
(07/20/2005 1:00 AM EDT)
URL: http://www.eetimes.com/showArticle.jhtml?articleID=166400864

For many applications, allocating performance among all of the tasks in a system-on-chip (SoC) design is much easier and provides greater design flexibility with multiple CPUs than with just one control processor and multiple blocks of logic. Just using bigger control processors will not satisfy the widely varying computational demands of many of today’s designs because bigger processors often require too much power, especially for consumer devices.

Multiple processor design changes the role of processors, allowing programmability to be designed into many functions while keeping power budgets under control. Multiple processor designs are now found in a number of applications ranging from cellular phones and ink-jet printers all the way up to huge network routers.

The biggest advantage of using multiple processors as SoC task blocks is that they’re programmable, so changes can be made in software after the chip design is finished. This means that complex state machines can be implemented in firmware running on the processor, significantly reducing verification time. And one SoC can often be used for multiple products, turning features on and off as necessary.

Multiple processor design promotes much more efficient use of memory blocks. A multiple processor-based approach makes most of the memories processor-visible, processor-controlled, processor-managed, processor-tested, and processor-initialized. Additionally, this reduces overall memory requirements while promoting the flexible sharing and reuse of on-chip memories.

But how do you pick the right embedded processors for multiple CPU designs? How do you partition your design to take maximum advantage of multiple processors? How do you manage the software between all of the processors? How do you connect them and manage communications in the hardware?

Partitioning the multiple processor SoC design

At the conceptual level, the entire system can be treated as a constellation of concurrent, interacting subsystems or tasks. Each task communicates with other subsystems and shares common resources (memory, shared data structures, network points). Developers start from a set of tasks for the system and exploit the parallelism by applying a spectrum of techniques, including four basic actions:

These methods interact with one another, so iterative refinement is often essential, particularly as the design evolves.

When a system’s functions are partitioned into multiple interacting function blocks, there are several possible organizational forms or structures including:

Heterogeneous tasks — Distinct, loosely coupled subsystems that can be implemented largely independently of each other. Figure 1 shows a system where networking, video and audio processing tasks are implemented in separate processors, sharing common memory, bus and I/O resources.

Figure 1- Simple heterogeneous system partitioning.

Parallel tasks — Communications equipment, for example, often supports multiple communications ports, voice channels, or wireless frequency-band controllers, as shown in Figure 2. Even when the parallelism isn’t obvious, many system applications still lend themselves to parallel implementation. For example, in an image-processing system the operations on one part of a frame may be largely independent of operations on another part of that same frame. Creating a two-dimensional array of sub-image processors may achieve high parallelism without substantial algorithm redesign.

Figure 2- Parallel task system partitioning.

Pipelined tasks — Phases of the algorithms can naturally be performed on one block of data while a subsequent phase is performed on an earlier block (also called a systolic-processing array). Figure 3 shows a pipelined architecture with multiple steps to produce the final decoded video stream.

Figure 3- Pipelined task system partitioning.

Hybrids — Real systems usually require a mixture of these partitioning styles.

Early system modeling

If the tasks are represented as algorithms in a programming language such as C, early system modeling can verify the functionality and measure the data transfers between tasks. At this stage, tasks have not been allocated to processors, and communications among tasks is still expressed abstractly, either through a message passing or a shared memory programming paradigm.

An early abstract system simulation model serves as the basis for sizing the computational demands of each task. This information is not exact, but can yield important insights into both computational and communications hot spots.

Using system simulation throughout the design process has two advantages: 1) an early start to simulation provides insight into bottlenecks, and 2) the model’s role as a performance predictor gradually evolves into a role as a verification test bench. To test a subsystem, a designer replaces the subsystem’s high-level model with a lower-level implementation model.

Assigning tasks to processors

The mapping of tasks to a SoC implementation raises some complex issues. Choosing to implement a specific task in a processor, in logic, or in software is a very important decision. There are two guidelines for mapping tasks to processors:

The process of determining the right number of processors cannot be separated from the process of determining the right processor type and configuration. Traditionally, a real-time computation task is characterized with a “MIPS requirement” how many millions of execution cycles per second are required. Figure 4 shows a set of tasks with the initial rough estimate of the MIPS requirements for a 3G wireless SoC platform.

Figure 4- Baseline task performance requirements.

A control task needs substantially more cycles if it’s running on a simple DSP than a RISC processor. A numerical task usually needs more cycles running on a RISC CPU than a DSP. However, most designs contain no more than two different types of processors because mixing RISC processors and DSPs requires working with multiple software development tools.

Configurable processors can be modified to provide 10 to 50 times higher performance than general-purpose RISC processors. This often allows configurable processors to be used for tasks that previously were implemented in hardware using Verilog or VHDL. Figure 5 shows the acceleration possible with a configurable processor, reducing MIPS requirements. Staying with a single configurable processor family allows sharing the same software development tools for all the processors.

Figure 5- Task requirements after Processor configuration.

Application acceleration — a common problem

Many standard, general-purpose 32-bit RISC processors aren’t fast enough to handle critical parts of some applications. The standard approach partitions the application between software running on a processor and a hardware accelerator block, but has serious limitations:

Ironically, the promise of concurrency between the processor and the accelerator is also often unrealized because the application, by nature of the way it is written, may force the processor to sit idle while the accelerator performs necessary work. In addition, the accelerator will be idle during application phases that cannot exploit it.

Configurable and extensible processors offer two big advantages to accelerator design:

Processor interface and interconnect

Four questions capture the most essential system-communications performance issues:

There are three basic interfaces:

Traditional processor cores provide only the block-oriented, general-bus interface. Configurable and extensible processors allow faster, more flexible communications using direct processor-to-processor connections to reduce cost and latency.

Choosing the right communications structure

Once the rough number and types of processors is known and tasks are tentatively assigned to the processors, basic communication structure design starts. The goal is to discover the least expensive communications structure that satisfies the bandwidth and latency requirements of the tasks.

When low cost and good flexibility are most important, a shared-bus architecture, in which all resources are connected to one bus, may be most appropriate. The glaring liability of the shared bus is long and unpredictable latency, particularly when a number of bus masters contend for access to different shared resources.

A parallel communications network provides high throughput with flexibility. The most common example is a crossbar connection with a two-level hierarchy of buses, shown in Figure 6.

Figure 6- General-purpose parallel communications style: on-chip mesh network.

Figure 7 shows the direct connections that can be made when the communications between the processors is well understood and will not change.

Figure 7- Optimized direct parallel communications.

Communications = software mode + hardware interconnect

Intertask communications are built on two foundations: the software communications mode and the corresponding hardware mechanism. The three basic styles of software communications between tasks include message passing, shared memory and device driver.

Message passing makes all communication between tasks overt. All data is private to a task except when operands are sent by one task and received by another task. The send/receive model implies a queue; messages cannot be sent if the output queue is full and cannot be received if the input queue is empty. Hardware queues give the lowest latency and processor overhead, especially for small, fixed-length messages such as simple operands. Message passing is generally easier to code than shared memory when the tasks are largely independent, but is often harder to code efficiently when the tasks are very tightly coupled.

With shared-memory communications, only one task reads from or writes to the data buffer in memory at a time, requiring explicit access synchronization. Embedded software languages, such as C, typically include features that ease shared memory programming.

The hardware-device-plus-software-device-driver model is most commonly used with complex I/O interfaces, such as networks or storage devices. The device-driver mode combines elements of message passing and shared-memory access. The principles of the device driver can be applied to almost any pair of communicating tasks, especially where the interface between tasks looks like a series of requests and responses.

Non-processor building blocks in SoCs

Processors must interface with memories, I/O interfaces and RTL blocks. These guidelines may help designers take better advantage of RAMs:

Watch for contention latency in memory access. Increase memory width or increase the number of memories that can be active to overcome contention bottlenecks. Pay particular attention to tasks that must move data from off-chip memory, through the processor, and back to off-chip memory; these tasks can quickly consume all available bandwidth.

High-bandwidth interfaces such as LAN and WAN network connections and video ports present a design challenge. There are two approaches to efficient integration of high-bandwidth I/O:

Even though processors make a potent alternative to hardwired logic blocks, often RTL blocks have already been designed and verified, so reuse them if appropriate. Two interface mechanisms to RTL blocks include:

Processors as building blocks — it’s real

The move toward multiple processor SoC designs is very real. Multiple processors are used in consumer devices ranging from low-cost ink-jet printers to cellular phones. Most of the newest network processors are based on multiple processor designs. The CRS-1, the world’s fastest router, designed by Cisco Systems, employs 188 processors on a single chip and multiple chips within the system. The Cisco CRS-1 can seamlessly scale up to 92 terabits per second. At its highest capacity, and with the appropriate infrastructure, the CRS-1 system could run an 850 kilobit-per-second (Kbps) connection to every household in the U.S., transfer the entire collection of the US. Library of congress in 4.6 seconds, or simultaneously connect three billion telephone calls. For more information on this powerful chip design, see the Cisco Web site.

As designers get comfortable with a processor-based approach, processors have the potential to become the next major building block for SoC designs, and SoC designers will turn to a processor-centric design methodology that has the potential to solve the ever-increasing hardware/software integration dilemma.

NOTE: Most of these concepts involved in multiple processor design are explained in greater detail in the book, “Engineering the Complex SoC: Fast, Flexible Design with Configurable Processors,” by Chris Rowen, published by Prentice Hall, 2004.

Ashish Dixit is vice president of hardware engineering at Tensilica Inc. He joined Tensilica in early 1998. Previously he held numerous positions ranging from design engineer to director of engineering at Silicon Graphics, working on MIPS VLSI chip development. From 1983 to 1989 Ashish was a quality and reliability engineer as well as a logic design engineer at Intel Corp. He has been issued eight patents related to configurable processors and memory management in RISC and CISC processors.