For many applications, allocating performance among all of the tasks in a system-on-chip (SoC) design is much easier, and provides greater design flexibility, with multiple CPUs than with just one control processor and multiple blocks of logic. Multiple-processor design changes the role of processors, making it possible to design programmability into many functions while keeping power budgets under control. The biggest advantage of using multiple processors as SoC task blocks is that they're programmable, so changes can be made in software after the chip design is finished. This means that complex state machines can be implemented in firmware running on the processor, significantly reducing verification time. And one SoC can often be used for multiple products, turning features on and off as necessary. Multiple-processor design promotes much more efficient use of memory blocks. A multiple-processor-based approach makes most of the memories processor-visible, processor-controlled, processor-managed, processor-tested and processor-initialized. Additionally, this reduces overall memory requirements while promoting the flexible sharing and reuse of on-chip memories. But how do you pick the right embedded processors for multiple-CPU designs? How do you partition your design to take maximum advantage of multiple processors? How do you manage the software among all the processors? How do you connect them and manage communications in the hardware? Four techniques At the conceptual level, the entire system can be treated as a constellation of concurrent, interacting subsystems or tasks. Each task communicates with other subsystems and shares common resources (memory, data structures, network points). Developers start from a set of tasks for the system and exploit the parallelism by applying a spectrum of techniques, including four basic actions: - Allocate (mostly) independent tasks to different processors, with communications among tasks expressed via shared memory and messages.
- Speed up each individual task by optimizing the processor on which it runs using a configurable processor.
- For particularly performance-critical tasks, decompose the task into a set of parallel tasks running on a set of optimized, intercommunicating processors.
- Combine multiple low-bandwidth tasks on one processor by time-slicing. This approach degrades parallelism, but may improve SoC cost and efficiency if the processor has enough available computation cycles.
These methods interact with one another, so iterative refinement is often essential, particularly as the design evolves. When a system's functions are partitioned into multiple interacting function blocks, there are several possible organizational forms or structures, including: - Heterogeneous tasks-distinct, loosely coupled subsystems that can be implemented largely independently of each other.
- Parallel tasks: Communications equipment often supports multiple communications ports, voice channels or wireless frequency-band controllers.
- Pipelined tasks: Phases of the algorithms can naturally be performed on one block of data while a subsequent phase is performed on an earlier block (also called a systolic-processing array).
Assigning tasks to processors Some complex issues arise when tasks are mapped to an SoC implementation. Choosing to implement a specific task in a processor, in logic or in software is a very important decision. There are two guidelines for mapping tasks to processors: - The processor must have enough computational capacity to handle the task.
- Tasks with similar requirements should be allocated to the same processor, as long as the processor has the computational capacity to accommodate all of the tasks.
The process of determining the right number of processors cannot be separated from the process of determining the right processor type and configuration. Traditionally, a real-time computation task is characterized with a "Mips requirement"-how many millions of execution cycles per second are required. A control task needs substantially more cycles if it's running on a simple DSP rather than a RISC processor. A numerical task usually needs more cycles running on a RISC CPU than a DSP. However, most designs contain no more than two types of processors, because mixing RISC processors and DSPs requires working with multiple software development tools. Configurable processors can be modified to provide 10 to 50 times higher performance than general-purpose RISC processors. This often allows configurable processors to be used for tasks that previously were implemented in hardware using Verilog or VHDL. Staying with a single configurable processor family allows the same software development tools to be shared for all the processors. Once the rough number and types of processors are known and tasks are tentatively assigned to the processors, basic communications structure design starts. The goal is to discover the least expensive communications structure that satisfies the bandwidth and latency requirements of the tasks. When low cost and flexibility are most important, a shared-bus architecture, in which all resources are connected to one bus, may be most appropriate. The glaring liability of the shared bus is long and unpredictable latency, particularly when a number of bus masters contend for access to different shared resources. A parallel communications network provides high throughput with flexibility. The most common example is a crossbar connection with a two-level hierarchy of buses. Also, direct connections can be made when the communications among the processors are well-understood and will not change. Intertask communications are built on two foundations: the software communications mode and the corresponding hardware mechanism. The three basic styles of software communications among tasks are message passing, shared memory and device drivers. Message passing makes all communications among tasks overt. All data is private to a task except when operands are sent by one task and received by another. Message passing is generally easier to code than shared memory when the tasks are largely independent but often harder to code efficiently with tightly coupled tasks. With shared-memory communications, only one task reads from or writes to the data buffer in memory at a time, requiring explicit access synchronization. Embedded-software languages, such as C, typically include features that ease shared-memory programming. The hardware-device-plus-software-device-driver model is most commonly used with complex I/O interfaces, such as networks or storage devices. The device driver mode combines elements of message passing and shared-memory access. Processors must interface with memories, I/O interfaces and RTL blocks. These guidelines may help designers take better advantage of RAMs: - Off-chip RAM is much cheaper than on-chip RAM, at least for large memories.
- Use caches and shared memories when on-chip RAM requirements are uncertain.
- Do a system performance sanity check by assessing memory bandwidth. Look at the memory transfer requirements of each task to ensure that the processor's local memories can handle the traffic.
Watch for contention latency in memory access. Increase memory width or increase the number of memories that can be active to overcome contention bottlenecks. Pay particular attention to tasks that must move data from off-chip memory through the processor, and back to off-chip memory; these tasks can quickly consume all available bandwidth. The move toward multiple-processor SoC designs is very real. Multiple processors are used in consumer devices ranging from low-cost inkjet printers to cell phones. As designers get comfortable with a processor-based approach, processors have the potential to become the next major building block for SoC designs, and SoC designers will turn to a processor-centric design methodology that has the potential to solve the ever-increasing hardware/software integration dilemma. Ashish Dixit (adixit@tensilica.com), vice president of hardware engineering at Tensilica Inc. (Santa Clara, Calif.) |