Adding multiple processors to a system-on-chip may seem like an obvious way to boost performance, save power and leverage hardware acceleration where it's most needed. But multiprocessor SoCs raise design challenges that may not have obvious solutions. On the software side, the challenge is partitioning the design and assigning tasks. On the hardware side, it's finding the right communications infrastructure and memory hierarchy to ensure high-bandwidth communications among processors, memories and peripherals. "For most people, the biggest challenge of an MPSoC is, 'Where do I get my tasks to run?' " said John Goodacre, program manager for multiprocessing at ARM Ltd. "With large heterogeneous designs, you are asking the programmer to split up all the code and run it in the right place at the right time and not waste silicon." "Heterogeneous MPSoCs are a solution to the design productivity challenge," Alain Arteri, director of engineering at STMicroelectronics, said in July during a keynote speech at the MPSoC Conference in Margaux, France. "But SoC design then becomes a memory hierarchy and bus design issue." Many MPSoC designs today are heterogeneous. For example, a general-purpose CPU might be combined with a DSP or a video or graphics accelerator. An example is STMicroelectronics' Nomadik multimedia platform, which combines an ARM CPU with DSP subsystems and hardware accelerators. There is, however, a trend toward MPSoCs that include multiple iterations of identical, general-purpose processing elements. These open the door to symmetric multiprocessing (SMP) and, arguably, a simpler software and hardware design challenge. The ARM11 MPCore, which includes up to four tightly coupled CPUs, provides a building block for an SMP approach. Some heterogeneous approaches use multiple iterations of identical processors as well. IBM's Cell architecture, for example, includes one power processing element and eight identical synergistic processing elements. Cradle Technologies' multiprocessing DSP architecture comprises "quads" that include up to four general-purpose processing cores and up to eight DSP engines with their associated memory. Behind all MPSoC strategies is a drive to higher levels of performance. For many designers, however, saving power may be an even stronger motivation. Using what ST's Arteri called "multiple noninterfering domains of intense activity," designers can turn on subsystems as needed and can offload tasks that would otherwise run on power-hungry CPUs. But to take full advantage of such benefits, said Kourosh Amiri, director of marketing at Cradle Technologies, "system developers need an architecture that is designed with flexible application partitioning; intelligent resource sharing among the processing cores; and high bandwidth between the compute engines, memories and I/Os." It starts with software With a heterogeneous MPSoC design, designers start with a set of tasks and exploit parallelism by allocating tasks to different processors. Communications between tasks are typically expressed through shared memory and messages. Synchronization primitives determine such matters as how memory is shared at any given moment. "I believe the worst [MPSoC problem] is the software one-the heterogeneous way programmers have to partition their code, the limitation of what they can scale to, and the portability and migration of code between designs," Goodacre said. "I'm hoping the MPCore and the trend to SMP will give these designs a road map." With a heterogeneous or "asymmetric" MPSoC, Goodacre said, "you write a bit of code for this processor, a bit of code for that processor, and determine what runs where and how they communicate. With SMP, you write your tasks and just give them to the operating system." Drew Wingard, CTO at Sonics Inc., agreed that "uniform" computing fabrics are nice from a software perspective because they make it easier to schedule resources. But they're not as optimal from a performance, power or die size perspective, he said, because there's no dedicated hardware accelerator or processor to make tasks run faster. "People think multiple processors will give you the benefit of more power, but the technical challenge is how you would split a task that would run on one processor into four," said Simon Davidmann, president and CEO of MPSoC tool startup Imperas Inc. "If we want to use multiple processing elements, we're going to have to write software in a more parallel way. There are going to be new programming models." Once the processors are identified and the tasks are tentatively assigned, designers can build the underlying communications structure that links processors, memories and other on-chip resources. The goal is the least-expensive architecture that satisfies the bandwidth and latency requirements of the tasks. Many MPSoCs use shared buses, which provide an inexpensive solution but can have latency problems. Parallel buses that allow more than one communication path between tasks can provide an advantage. Some designs also use crossbar switches, which can provide high-bandwidth dedicated paths. Then there's the "network-on-chip," generally a switched interconnect fabric in which messages or packets are sent over dynamically chosen paths. The MPSoC communications problem, Sonics' Wingard said, "is not just about bandwidth; it's about the right bandwidth to the right processor at the right time." When you have six programmable processors and distributed direct-memory access, you can end up with 40 or 50 logical data flow sources on-chip, all needing to multiplex to the same DRAM. With its "smart interconnect" intellectual property (IP), Wingard said, Sonics provides "predictable interconnect fabrics that offer arbitration and provide hardware guarantees and quality-of-service, so customers can allocate bandwidth to specific data flows." The Sonics MX product offers features that help with MPSoC designs, including system-level power management that shuts off power to processors when they don't need to run. In addition to a shared link, Sonics MX has a crossbar switch, which can be used to minimize latency. ARM is stretching beyond the shared-bus approach of its widely deployed Amba AHB bus. The new Amba AXI (advanced extensible interface), Goodacre said, allows multiple transactions on a single bus. It also supports a crossbar switch, separate address/control and data phases, separate read and write channels to enable DMA, and out-of-order transaction completion. Configurable processors' role Beatrice Fu, senior vice president of engineering at Tensilica Inc., said configurable processors can greatly simplify the comms problem. She said Tensilica processors offer options with respect to bandwidth requirements, memory interfaces and additional load/store units. Further, she noted, Tensilica processors offer the option of direct processor-to-processor communications without getting on a bus or going through memory. These connections can potentially involve hundreds of wires, Fu said. "Having direct wires into the execution unit can certainly unburden your memory system-or eliminate the need to design memory in some cases," she said. One possible direction for MPSoCs is the network-on-chip. With its ability to reuse wire resources based on packet switching, the NoC can provide dramatically faster data transmission, more flexibility and easier IP reuse, its advocates say. At the MPSoC Conference, researchers described NoC lab implementations that offered higher bandwidth than shared buses, albeit with greater silicon area and latency. Marcello Coppola, head of ST's Grenoble, France, research lab, defined NoC as "a flexible and scalable packet-based on-chip micronetwork designed according to a layered methodology." "With an MPSoC, you potentially get into a situation where everybody needs access to everybody's L1 cache," said Wayne Burleson, associate professor of electrical and computer engineering at the University of Massachusetts at Amherst. "That's where the network people come in. The idea is to pipeline these communications and provide lots of bandwidth, potentially at the expense of latency." French startup Arteris SA offers a commercial tool that generates configurable IP for its NoC implementation, a three-layer packet network that claims maximum throughput of more than 750 MHz. "There is a fairly natural match between NoC and multiprocessing," said Phillippe Martin, product-marketing director at Arteris. "If you have two or three processors that don't run too fast, a bus can make it. But if you're trying to scale up to dozens of processors running at the same time, traditional buses do not scale." Martin said Arteris provides users with choices so they can make the proper trade-offs. Sonics calls its technology an NoC. "We use multithreading in our interconnect fabrics, and we share the same set of wires for many different conversations," Wingard said. "It's essentially the same technology as is used in a switched network." Cache and carry Whatever communications structure is used, MPSoC designers must decide how to organize their memory system. One question is whether and how to use caching. In his conference keynote, Arteri described a "smart caching" scheme implemented for ST's Nomadik processor that involves the allocation of memory to a shared L2 cache under software control. But do you even want caches? "You could build an MPSoC like a PC and put in lots of caches, but you might pay too much to support an inappropriate programming model," said Kees Goosens, principal research scientist at Philips Research. "Or you could focus on run-time performance-no caches, no concurrency-do some kind of streaming, keep the data local and resolve the memory bottleneck." "With asymmetric [MPSoCs], you need very fast communications and movement of data between cores, and that means today's designs are not cached," said ARM's Goodacre. With MPCore's support for coherent caching, "you can turn caching back on." Arteris' Martin noted that most processors have local caches but that most MPSoCs don't have a shared cache and therefore don't have to manage cache coherency. That's just as well, he said, because cache coherency is difficult to manage with a crossbar switch or NoC. For one thing, snoopy caches-which involve the parallel interrogation of individual caches-require a traditional bus and won't work with crossbar switches or NoCs. "It is an approach that is not scalable and is running out of steam," Martin said. What will work with a crossbar or NoC is directory-based caching. That may be the wave of the future, but it is typically more complex to design. L2 caches are used in some high-end MPSoCs, Sonics' Wingard said, but they are "going very close to the main processor and are not treated as a shared resource, except in homogenous clusters like the ARM MPCore." Or cache management is done in software, as in the ST Nomadik. As for memory access and hierarchy, MPSoC designs pose inevitable complexity. "As a system becomes more distributed, everybody can't be zero cycles away from memory, and everyone can't have zero latency," Wingard said. "People have to start thinking about prioritizing-and that choice ripples through the architecture." See related chart See related chart |