For many applications, allocating performance among all of the tasks in a system-on-chip (SoC) design is much easier and provides greater design flexibility with multiple CPUs than with just one control processor and multiple blocks of logic. Just using bigger control processors will not satisfy the widely varying computational demands of many of today’s designs because bigger processors often require too much power, especially for consumer devices. Multiple processor design changes the role of processors, allowing programmability to be designed into many functions while keeping power budgets under control. Multiple processor designs are now found in a number of applications ranging from cellular phones and ink-jet printers all the way up to huge network routers. The biggest advantage of using multiple processors as SoC task blocks is that they’re programmable, so changes can be made in software after the chip design is finished. This means that complex state machines can be implemented in firmware running on the processor, significantly reducing verification time. And one SoC can often be used for multiple products, turning features on and off as necessary. Multiple processor design promotes much more efficient use of memory blocks. A multiple processor-based approach makes most of the memories processor-visible, processor-controlled, processor-managed, processor-tested, and processor-initialized. Additionally, this reduces overall memory requirements while promoting the flexible sharing and reuse of on-chip memories. But how do you pick the right embedded processors for multiple CPU designs? How do you partition your design to take maximum advantage of multiple processors? How do you manage the software between all of the processors? How do you connect them and manage communications in the hardware? Partitioning the multiple processor SoC design At the conceptual level, the entire system can be treated as a constellation of concurrent, interacting subsystems or tasks. Each task communicates with other subsystems and shares common resources (memory, shared data structures, network points). Developers start from a set of tasks for the system and exploit the parallelism by applying a spectrum of techniques, including four basic actions: 1. Allocate (mostly) independent tasks to different processors, with communications among tasks expressed via shared memory and messages. 2. Speed up each individual task by optimizing the processor on which it runs using a configurable processor. 3. For particularly performance-critical tasks, decompose the task into a set of parallel tasks running on a set of optimized, inter-communicating processors. 4. Combine multiple low-bandwidth tasks on one processor by time-slicing. This approach degrades parallelism, but may improve SoC cost and efficiency if the processor has enough available computation cycles. These methods interact with one another, so iterative refinement is often essential, particularly as the design evolves. When a system’s functions are partitioned into multiple interacting function blocks, there are several possible organizational forms or structures including: - Heterogeneous tasks — Distinct, loosely coupled subsystems that can be implemented largely independently of each other. Figure 1 shows a system where networking, video and audio processing tasks are implemented in separate processors, sharing common memory, bus and I/O resources.
Figure 1- Simple heterogeneous system partitioning. - Parallel tasks — Communications equipment, for example, often supports multiple communications ports, voice channels, or wireless frequency-band controllers, as shown in Figure 2. Even when the parallelism isn’t obvious, many system applications still lend themselves to parallel implementation. For example, in an image-processing system the operations on one part of a frame may be largely independent of operations on another part of that same frame. Creating a two-dimensional array of sub-image processors may achieve high parallelism without substantial algorithm redesign.
Figure 2- Parallel task system partitioning. - Pipelined tasks — Phases of the algorithms can naturally be performed on one block of data while a subsequent phase is performed on an earlier block (also called a systolic-processing array). Figure 3 shows a pipelined architecture with multiple steps to produce the final decoded video stream.
Figure 3- Pipelined task system partitioning. - Hybrids — Real systems usually require a mixture of these partitioning styles.
Early system modeling If the tasks are represented as algorithms in a programming language such as C, early system modeling can verify the functionality and measure the data transfers between tasks. At this stage, tasks have not been allocated to processors, and communications among tasks is still expressed abstractly, either through a message passing or a shared memory programming paradigm. An early abstract system simulation model serves as the basis for sizing the computational demands of each task. This information is not exact, but can yield important insights into both computational and communications hot spots. Using system simulation throughout the design process has two advantages: 1) an early start to simulation provides insight into bottlenecks, and 2) the model’s role as a performance predictor gradually evolves into a role as a verification test bench. To test a subsystem, a designer replaces the subsystem’s high-level model with a lower-level implementation model. Assigning tasks to processors The mapping of tasks to a SoC implementation raises some complex issues. Choosing to implement a specific task in a processor, in logic, or in software is a very important decision. There are two guidelines for mapping tasks to processors: 1. The processor must have sufficient computational capacity to handle the task. 2. Tasks with similar requirements should be allocated to the same processor as long as the processor has the computational capacity to accommodate all of the tasks. The process of determining the right number of processors cannot be separated from the process of determining the right processor type and configuration. Traditionally, a real-time computation task is characterized with a “MIPS requirement” how many millions of execution cycles per second are required. Figure 4 shows a set of tasks with the initial rough estimate of the MIPS requirements for a 3G wireless SoC platform. Figure 4- Baseline task performance requirements. A control task needs substantially more cycles if it’s running on a simple DSP than a RISC processor. A numerical task usually needs more cycles running on a RISC CPU than a DSP. However, most designs contain no more than two different types of processors because mixing RISC processors and DSPs requires working with multiple software development tools. Configurable processors can be modified to provide 10 to 50 times higher performance than general-purpose RISC processors. This often allows configurable processors to be used for tasks that previously were implemented in hardware using Verilog or VHDL. Figure 5 shows the acceleration possible with a configurable processor, reducing MIPS requirements. Staying with a single configurable processor family allows sharing the same software development tools for all the processors. Figure 5- Task requirements after Processor configuration. Application acceleration — a common problem Many standard, general-purpose 32-bit RISC processors aren’t fast enough to handle critical parts of some applications. The standard approach partitions the application between software running on a processor and a hardware accelerator block, but has serious limitations: 1. Methods for designing and verifying large hardware blocks are labor-intensive, error-prone and slow. 2. Requirements for the portion of the application running on the accelerator my change late in the design or after the SoC is built, mandating new silicon design. 3. Adaptation of the application software to hardware and verification of the combined hardware/software solution may be awkward and difficult. 4. Moving data back and forth among the processor, accelerator, and memory may slow total application throughput, offsetting much or all of the benefit derived from hardware acceleration. Ironically, the promise of concurrency between the processor and the accelerator is also often unrealized because the application, by nature of the way it is written, may force the processor to sit idle while the accelerator performs necessary work. In addition, the accelerator will be idle during application phases that cannot exploit it. Configurable and extensible processors offer two big advantages to accelerator design: 1. Incorporating the accelerator function into the processor eliminates the processor-accelerator communication overhead and often reduces the total silicon cost. It makes the accelerator functions far more programmable and significantly simplifies integration and testing of the total application. It also allows the acceleration hardware to have intimate access to all of the processor’s resources. 2. Converting the accelerator to a separate processor configured for application acceleration allows the second task to run in parallel with the general-purpose processor, receiving commands through registers or through shared data memory. Processor interface and interconnect Four questions capture the most essential system-communications performance issues: 1. Required bandwidth — What sustained input data and output data bandwidths are necessary? 2. Sensitivity to latency — What response latency (average and worst case) is required for a functional block’s requests on other memory or logic functions? 3. Data granularity — What is the typical size of a request a large data block or a single word? 4. Blocking or non-blocking communications Can the computation be organized so that the function block can make a request and then proceed with other work without waiting for the response to the request? There are three basic interfaces: 1. Memory-mapped, wide interface typically implemented as a local-memory connection, ideal where high bandwidth and low latency data access is required. 2. Memory-mapped, block-sized connection typically implemented as a bus connection (the most popular, traditional processor interface). 3. Instruction-mapped, arbitrary-sized connection implemented as a direct point-to-point connection. Instruction mapped connections can range from a single bit to thousands of bits. This connection allows RTL-equivalent data transfer speeds, but few processors support this type of connection. Traditional processor cores provide only the block-oriented, general-bus interface. Configurable and extensible processors allow faster, more flexible communications using direct processor-to-processor connections to reduce cost and latency. Choosing the right communications structure Once the rough number and types of processors is known and tasks are tentatively assigned to the processors, basic communication structure design starts. The goal is to discover the least expensive communications structure that satisfies the bandwidth and latency requirements of the tasks. When low cost and good flexibility are most important, a shared-bus architecture, in which all resources are connected to one bus, may be most appropriate. The glaring liability of the shared bus is long and unpredictable latency, particularly when a number of bus masters contend for access to different shared resources. A parallel communications network provides high throughput with flexibility. The most common example is a crossbar connection with a two-level hierarchy of buses, shown in Figure 6. Figure 6- General-purpose parallel communications style: on-chip mesh network. Figure 7 shows the direct connections that can be made when the communications between the processors is well understood and will not change. Figure 7- Optimized direct parallel communications. Communications = software mode + hardware interconnect Intertask communications are built on two foundations: the software communications mode and the corresponding hardware mechanism. The three basic styles of software communications between tasks include message passing, shared memory and device driver. Message passing makes all communication between tasks overt. All data is private to a task except when operands are sent by one task and received by another task. The send/receive model implies a queue; messages cannot be sent if the output queue is full and cannot be received if the input queue is empty. Hardware queues give the lowest latency and processor overhead, especially for small, fixed-length messages such as simple operands. Message passing is generally easier to code than shared memory when the tasks are largely independent, but is often harder to code efficiently when the tasks are very tightly coupled. With shared-memory communications, only one task reads from or writes to the data buffer in memory at a time, requiring explicit access synchronization. Embedded software languages, such as C, typically include features that ease shared memory programming. The hardware-device-plus-software-device-driver model is most commonly used with complex I/O interfaces, such as networks or storage devices. The device-driver mode combines elements of message passing and shared-memory access. The principles of the device driver can be applied to almost any pair of communicating tasks, especially where the interface between tasks looks like a series of requests and responses. Non-processor building blocks in SoCs Processors must interface with memories, I/O interfaces and RTL blocks. These guidelines may help designers take better advantage of RAMs: 1. Off-chip RAM is much cheaper than on-chip RAM, at least for large memories. 2. Use caches and shared memories when on-chip RAM requirements are uncertain. 3. Perform a system-performance sanity check by assessing memory bandwidth. Look at the memory-transfer requirements of each task to ensure that the processor’s local memories can handle the traffic. Watch for contention latency in memory access. Increase memory width or increase the number of memories that can be active to overcome contention bottlenecks. Pay particular attention to tasks that must move data from off-chip memory, through the processor, and back to off-chip memory; these tasks can quickly consume all available bandwidth. High-bandwidth interfaces such as LAN and WAN network connections and video ports present a design challenge. There are two approaches to efficient integration of high-bandwidth I/O: 1. Use an autonomous DMA (direct memory access) engine. 2. Couple the I/O interface tightly to a processor, allowing the function to be software controlled. Even though processors make a potent alternative to hardwired logic blocks, often RTL blocks have already been designed and verified, so reuse them if appropriate. Two interface mechanisms to RTL blocks include: 1. Map hardware registers into local memory space, which makes the hardware block look much like an I/O device, and makes the controlling software look much like a standard device driver. 2. Extend the instruction set to directly stimulate hardware functions. With configurable processors, the designer can specify new processor instructions that take hardware block outputs as instruction-source operands and use hardware block inputs as instruction-result destinations (thus avoiding the use of intermediate registers and greatly accelerating the task by eliminating I/O overhead). Processors as building blocks — it’s real The move toward multiple processor SoC designs is very real. Multiple processors are used in consumer devices ranging from low-cost ink-jet printers to cellular phones. Most of the newest network processors are based on multiple processor designs. The CRS-1, the world’s fastest router, designed by Cisco Systems, employs 188 processors on a single chip and multiple chips within the system. The Cisco CRS-1 can seamlessly scale up to 92 terabits per second. At its highest capacity, and with the appropriate infrastructure, the CRS-1 system could run an 850 kilobit-per-second (Kbps) connection to every household in the U.S., transfer the entire collection of the US. Library of congress in 4.6 seconds, or simultaneously connect three billion telephone calls. For more information on this powerful chip design, see the Cisco Web site. As designers get comfortable with a processor-based approach, processors have the potential to become the next major building block for SoC designs, and SoC designers will turn to a processor-centric design methodology that has the potential to solve the ever-increasing hardware/software integration dilemma. NOTE: Most of these concepts involved in multiple processor design are explained in greater detail in the book, “Engineering the Complex SoC: Fast, Flexible Design with Configurable Processors,” by Chris Rowen, published by Prentice Hall, 2004. Ashish Dixit is vice president of hardware engineering at Tensilica Inc. He joined Tensilica in early 1998. Previously he held numerous positions ranging from design engineer to director of engineering at Silicon Graphics, working on MIPS VLSI chip development. From 1983 to 1989 Ashish was a quality and reliability engineer as well as a logic design engineer at Intel Corp. He has been issued eight patents related to configurable processors and memory management in RISC and CISC processors. |