Initiatives in parallel processing, instruction sets and novel architectures pose power/performance trade-offs The chance to start a processor design with a clean sheet of paper doesn't come often. But IBM fellow Jim Kahle is getting his shot at revolutionizing the way computing systems work. As chief architect of the Cell processor, Kahle is among the handful of designers at top chip makers worldwide who are working to bring parallelism into devices designed for the consumer sector. Back in 2000, engineers from IBM Corp., Sony Corp. and Toshiba Corp. sat down to discuss a possible collaboration. IBM had been among the first to get to a gigahertz-class processor, and the engineers around the planning table were "looking at more traditional organizations of machines," Kahle said. "But what we found was that they didn't give us the computational efficiencies that our partners-Sony and Toshiba-needed." Then the trio went to that proverbial clean sheet, drawing upon the symmetric-multiprocessing experience within IBM. The planning team "looked at the whole gamut of how to get to new levels of efficiencies," said Kahle. With a few dozen workstations based on the Cell processor now in the hands of key videogame companies, expectations are high. The plan is that the experienced software developers in the videogame market, accustomed to parallel graphics processing, will exploit the multiprocessing Cell design. Cell's target has broadened from just the Playstation to the digital entertainment center and even nonconsumer applications that call out for graphics-processing power, including life sciences. And the Cell team has multiplied. Besides the IBM, Sony and Toshiba engineers housed on three floors of IBM's Austin, Texas, complex, the Cell project has drawn upon engineers at remote IBM sites, including Rochester, Minn.; Yorktown Heights, N.Y.; Boebingen, Germany; and Haifa, Israel. In Oregon, meanwhile, Stephan Pawlowski, director of the microprocessor technology lab at Intel Corp.'s Hillsboro Facility, is working his own attempt at a revolution. Pawlowski's lab is challenged, among other things, to take the basic Intel architecture platform and figure out ways to solve digital consumer problems. Consumer gear has more-limited thermal budgets than mobile computers or gaming systems, both of which can include small fans. "Our assumption is that a lot of new applications are coming out that could take advantage of the infrastructure of the X86 platform," Pawlowski said. But just dropping, say, 125 of Intel's most power-efficient mobile X86 processors on the same die would result in a 250-watt thermal budget-not acceptable. "There is nothing inherent in the X86 architecture that prevents it from being used in power-limited applications. There are some structures, some differences in terms of registers, but there is nothing inherently different from a power standpoint between what you can do in an ARM core and what you can do in an X86," Pawlowski said. "Technically, there is no reason it could not be in that space." If Kahle at IBM started with a nearly clean sheet, Pawlowski starts with an instruction set. "We are trying to establish a new paradigm to run those instructions," he said, adding that "it's a personal mission, it's a pride thing for us. Developing for these power-restrictive environments, with highly threaded apps, is a hard problem. Going from an eight-way core to a 32-processor core is a challenge, starting with how to do a compiler that could extract the parallelism." At Texas Instruments Inc., principal Fellow Gene Frantz takes a different view of ISAs. "We are beginning to see the slow death of instruction-set architectures," said Frantz, with the rise of "heterogeneous multiprocessing"-that is, the use of cores optimized for their particular tasks, be it a DSP, an MCU or a hardwired circuit for a specialized job. "We have enough ISAs. What we need to do is take the expert CPUs that do one thing very well and figure out how they can work together," Frantz said. Now, a system-on-chip (SoC) design team will divide itself into digital designers, analog designers, perhaps an RF group and so on. By rigidly defining the boundaries between those functions and how they will interact, each team can go do its own little box. The problem is, this approach doesn't take advantage of the synergies available from modern process technologies, in which digital transistors can tackle analog and RF functions, Frantz said. Better system design languages are needed, so that design teams can do better "what if?" investigations before tackling a complex chip design. Engineers tend to be conservative in their design methodologies. And the design tool industry is "a little bit confused right now" about how to proceed with system design tools, Frantz said. In his view, the goal should be to become "more object-oriented, so that a design team can take an MP3 block and drop it into the system. The tools from Matlab, or LabView from National Instruments, are what I'm thinking of." David Steer, director of segment marketing at ARM Inc. (Los Gatos, Calif.), would argue that ARM is well down the path that Frantz describes. In August, ARM purchased Axys Design Automation Inc., a provider of integrated processor and system-modeling and simulation solutions. "We do need to get a bigger head start on system design, and that means real co-development of hardware and software," Steer said. And with the recent acquisition of Artisan Components, the ARM Group's "strategy is to be a one-stop shop of IP [intellectual-property] providers, with much more than MPU cores to offer," he said. | Cell teammates: Masakazu Suzuoki, Sony Computer Entertainment vice president; Jim Kahle, IBM fellow; and Yoshio Masubuchi, director of engineering at Toshiba America. | As multiple processing cores become common, ARM has moved to symmetric multiprocessing. Its MPCore product allows as many as four ARM cores to work together in one CPU subsystem, Steer said. The tool makes it simpler to write software and increases both peak and real-time performance. For video processing and other data-intensive tasks, ARM has developed a configurable data engine called OptimoDE. The product allows a programmable core to be combined with a hardwired data engine. According to ARM, the design engineer can define the type and number of data path resource units, the data path widths, the instruction widths and the number of I/Os to the exact requirements of the application domain. For its part, MIPS Technologies Inc. is trying to bring multithreading technology to the consumer space, working with its licensees to implement in silicon the multithreaded application-specific extension (MT-ASE) announced earlier this year. The MT-ASE defines how multithreading works in the MIPS environment and attempts to ensure software compatibility among the various MIPS vendors as they implement multithreading. Tom Petersen, a former PowerPC design engineer who now works for MIPS in technology marketing, said system designers, faced with strict cost limits, are looking to multithreading for efficiency. "On many applications, the core processor is stalled something like 50 percent of the time based on load latencies and cache misses," Petersen said. "Customers are telling us that that level of inefficiency is really a problem that adds cost. By adding multithreading, we can recover that dead space on a single core." Petersen said Intel's implementation of what it calls hyperthreading, as well as multithreading implementations from Sun Microsystems and IBM, have done an "excellent job educating the market. Those high-end processors are way ahead of us. What we want to do is bring those efficiencies that are present today at the high end down to the consumer space, drafting on the good work that others have done and making it available to the consumer market." Petersen said that multithreading is "a big investment for us. It is truly a disruptive technology that I believe is going to really shake things up." Depending on how inefficient a programmable processor is at a certain task, such as running an audio or networking algorithm that is prone to cache misses, multithreading can deliver twice the performance or better. "Some people like to think of MT as a hardware operating system," Petersen said. "Operating systems are responsible for swapping multiple tasks on a single context. Multithreading moves that into the hardware, so that task swapping is based on knowledge within the CPU pipeline." In some cases, multithreading will complement the multiprocessor approach already taken with MIPS-architecture products such as the Nexperia from Philips Semiconductors, the Xilleon digital television processor from ATI and various products within the VR processor family of NEC Electronics. Moving to meet the needs of the mobile market, Philips Semiconductors plans to offer a lower-power version of its Nexperia pnx1500 processor, which runs at 300 MHz at 1.2 volts. Characterizing the VLIW processor at 0.9 V while slowing down the processing speed to about 166 MHz brings the power draw down to about half a watt, rather than 1.5 W, at 1.2 V, said Chris Day, general manager and senior director of marketing for media processing at Philips' San Jose, Calif., operation. "For battery-operated systems, 1.5 W is a bit steep. The user could run at 0.9 V on the road and then, when they get back home, operate flat out," said Day. One thrust for Philips is in the nascent market for portable media players, with a reference design based on a storage bank of two or more high-density NAND flash devices, an LCD controller and an optimized SoC solution based on the Nexperia pnx1500. The innovative Nexperia design is a five-slot very long instruction-word engine that offers more performance for media-processing applications than traditional architectures running at the same clock speed, Day said. "With a superpipelined core like ours, we can both meet the high-performance demands of these media player markets and stay within the power budget," he said. Overall, the processor design community has "reached the practical limits of bumping up the power consumption," said Dan Bouvier, PowerPC design manager at Freescale Semiconductor Inc. Even multithreading, as wonderful as it may be in terms of improving efficiencies, still causes more transistors to turn on, which burns more power, he noted. As scaling proceeds and additional hundreds of millions of transistors are added, "the industry faces a real challenge in the trade-offs between power and performance. There's a handful of tricks, but most of those being played out now are one-time techniques. During the next 10 years, we will need to be even more creative," Bouvier said. |