IP cores crowd SoCs

IP cores crowd SoCs
By Ron Wilson, EE Times
June 16, 2003 (12:16 p.m. EST)
URL: http://www.eetimes.com/story/OEG20030616S0083

SAN MATEO, Calif. — History may be repeating itself as chip designers head down the same path their computer brethren once trod—toward multiprocessing. Like the engineers who designed big, multiple-CPU minicomputers a decade or two ago, designers building embedded system-level ICs are finding that multiprocessing can be powerful, but it's rarely as easy as it sounds.

The growing trend toward putting several processor cores on the same chip is enabled by the expanded transistor budgets of 0.18-and 0.13-micron processes. Fueling this drive is the attractiveness of using standard, already-validated processor intellectual property (IP) and easily alterable software in place of application-specific hardware. The trend can be spotted in a number of technical papers to be delivered at this week's Embedded Processor Forum in San Jose, Calif.

The argument goes like this: CPU cores are now quite small and fast, and relatively power-effic ient. By adding specialized instructions-and hence, usually, special-purpose data paths-these cores can be made to perform individually at nearly the speed of dedicated hardware. So the problem of designing a complex data path is replaced by the ostensibly simpler one of creating multiple instances of a block of standard IP. All the application-related risks-such as misunderstanding of the requirements, changes in specs or simply implementation errors-move to the software team, where an engineering change order only means a new software build, not a mask spin.

These arguments have attracted attention for a spectrum of designs, from network processing engines to communications signal processors to digital cameras. But as design teams gain more experience with multiprocessor chips, they are rediscovering a truth that nearly died with the last of the big multiprocessor minicomputers: Easier said than done.

In its pure form, the use of multiple processor cores in embedded design is delightfully e legant. Chris Rowen, president and CEO of Tensilica Inc. (Santa Clara, Calif.), said the technique works hand in glove with one of the most important trends in system-on-chip (SoC) design. "I think we are seeing a definite shift to earlier involvement in the design process by the software people," he said.

Finding parallelism

Rather than waiting for a hardware specification to arrive, Rowen suggested, the applications software team in many embedded designs is involved in system modeling from the beginning. That makes it possible to explore the design as a set of abstract tasks with functional and performance requirements, delaying the commitment to a particular implementation. And that, in turn, leaves open the possibility of decomposing the tasks into software that can be executed on a collection of embedded-processor cores.

This approach contains the implicit assumption that in some way the tasks can be decomposed to expose a good deal of parallelism in the application. If, to take the extreme case, the entire application were a single sequence of events that could not be separated into parallel paths, there would be no advantage to having more than one data path, and the design problem would boil down to making that one path as fast as possible.

But there are opportunities for parallelism in most applications. Primarily, said Rowen, these are found in two sources: entirely independent functions within the application, or parallelism within the data.

If there is enough independence-at either the task or the data level-between parallel tasks in the application, then multiprocessor-chip design in fact can be fairly simple. Figure out a memory architecture, probably with at least some shared external memory in which to pull together results, and plop down processor and memory instances. Hook it all together with buses as defined by the processor IP providers, add some peripheral hardware and you are home free.

But often there is no obvious way to partition the prob lem to make the tasks that independent. If so, the traffic level between tasks goes up. Generally that means that simple, dedicated memory architectures will start to merge into a shared-memory scheme in which all the CPU cores have their own caches, but share a common memory hierarchy below that level.

Communication between tasks now becomes a major part of the design effort, first for the software team and, as things get more intense, for the hardware team. The overhead of copying data structures around, keeping track of ownership and permissions, and synchronizing things with semaphores, all in software, quickly gets too complex and too slow. The call goes out for a hardware coherency protocol to keep copies of the data in caches and main memory coherent. It's at this point the SoC design begins to resemble the multiprocessor minicomputer architectures of the '80s.

A striking example of the complexity-and the value-of multiprocessing is seen in the DirectPath architecture of Xiran (Irvine, Calif.). The DirectPath chip is intended to be a high-speed pipe between storage clusters and networks, supporting streaming data, video, voice and other time-critical formats that tend to saturate typical CPU-centric network connections.

Rather than provide dedicated hardware engines for each combination of protocols the chip might face in any mix of packet streams, Xiran opted for a symmetric-multiprocessing architecture based on several instances of the ARCTangent core from ARC International. The individual cores were modified to support extremely fast data transfer, and less than 10 cores-Xiran is rather cagey on the exact count-were instantiated on the chip.

"We extract parallelism by partitioning the protocol tasks across processors," said Mehran Ramezani, senior director of software development for Xiran.

Arduous programming

But the team found that multiprocessing made for arduous programming. Hardware support was necessary for task synchronization and data cohe rency. The memory architecture chosen included a shared very high-bandwidth data cache and dedicated instruction caches, with an external SDRAM pool. Hardware snooping and invalidation were designed into the cache controller, mainly to snoop direct-memory-access transactions.

"We added some special features to the [system] bus to increase bandwidth," said Wei (Kevin) Tsai, Xiran technologist. "In a way, the way we matched bus bandwidth to the data and snooping requirements of the application kind of defines the platform." He likened it to "impedance matching between the control and data planes."

"The cache design was the most challenging part of the design overall," said Vahid Ordoubadian, vice president of advanced technology development. "We depended heavily on several supercomputing experts on the team, including Tsai."

The link to supercomputer architectures will become more apparent at the Embedded Processor Forum, as MIPS Technologies Inc. (Mountain View, Calif.) describes its m ost recent cut at a CPU core for the embedded market. Architecturally, Topaz looks a lot like the chips that were going into multiprocessor servers not too long ago. Topaz uses a scalar architecture without the modern complexities of superscalar pipelines, speculative execution and the like, but it is very much designed with complex multiprocessing architectures in mind.

Thomas Petersen, director of product strategy for the synthesizable core, said that MIPS chose to include a full hardware coherency protocol, specifically for applications that would use multiple instances of the core. The architecture permits not only snooping and invalidation, but also forwarding. That is, if one data cache detects a read operation from another processor for an address for which it has the current data, that cache controller will forward the data directly from its cache to the requesting CPU. This avoids forcing the owner cache to write the current value back to memory, and forcing the requesting cache to wait unti l after the write is complete to perform its read.

Including such complexity in an otherwise straightforward embedded-processor core shows the importance MIPS attaches to the cache coherency issue. This reflects the input MIPS is getting from its licensees.

In a multiprocessing app built around MIPS' new core, multiple Topaz cores share a system bus.