High-Performance DSPs -> Processor boards: Architecture drives performance
Processor boards: Architecture drives performance
By Rodger H. Hosking, Vice President, Pentek Inc., Upper Saddle River, N.J., EE Times
November 15, 2001 (3:36 p.m. EST)
URL: http://www.eetimes.com/story/OEG20011115S0055
Striving to deliver solutions for maximum overall system performance, embedded board vendors are now offering multiprocessor products featuring the most recent generation of DSP and RISC processors. Because of stunning computational speeds and I/O data transfer rates, it becomes more difficult to provide adequate memory, interprocessor communication and I/O data channel bandwidth as the number of processors on a board increases. Cutting back on such vital board resources can impose bottlenecks that leave these new processors starved for data and program code, thereby crippling computational speeds. By not taking into account board architectures, relying instead on processor benchmarks only, system designers can experience a significant and unexpected shortfall in overall performance of the end application. Contemporary DSPs such as Texas Instruments TMS320C6415, the Motorola MPC7410 AltiVec G4 PowerPC and the Analog Devices ADSP TS-101-S Tiger Sharc all offer buses capable of peak I/O transfer rates of more than 1 Gbyte/second. While this rate may appear to be more than adequate, these new devices support multiple parallel operations during each processor clock cycle by virtue of enhancements to on-chip processing architectures: specifically, the VLIW engine on the C6000, the AltiVec vector coprocessor on the 7410 and the superscalar engine on the TS101-S. With peak processing rates of 4.8 billion fixed-point operations per second for 32-bit integers on the C6415, and 1 or 2 billion 32-bit IEEE floating-point operations per second on the TS101-S or 7410, respectively, these devices can clearly challenge the I/O capabilities of any processor. Since the cycle time for external devices is restricted by board layout, memory speed and the electrical characteristics of I/O interface drivers, other strategies have been used to improve data transfers to peripherals. These include moving to wider buses of 64 bits or more, and using multiple buses, lik e the three found on the C6415. Further, since code fetches to external memory can seriously impact the availability of the external bus for I/O transfers, most of these new processors incorporate L1 and/or L2 cache memory within the chip, and make these resources as large as possible. The 7410 employs a separate 64-bit bus to support an external 2-Mbyte L2 cache, while the C6415 embeds its 1-Mbyte L2 cache right on the chip. These optimized compute engines with their enhanced peripheral interfaces can demonstrate some incredible benchmarks when code and data are sitting in just the right place. However, unless they can be adequately coupled to the real-time environment at the board and system level, actual performance of these new processors can be quickly sacrificed to data path bottlenecks to and from external memory and peripheral devices. To make matters worse, many of these new devices were developed for heavily embedded, single-processor applications like workstations, set top boxes, and dedicated telecom switches. The hooks for interprocessor communication and multiprocessing, common requirements for open architecture boards, are minimal and must be incorporated as part of the board design. Open architecture commercial off the shelf solutions (Cots) have been widely accepted by systems integrators as a strategy for shrinking time-to-market. Equally important are the significant savings in costs, easy access to new technology, improved maintainability and multivendor support. These solutions traditionally use one of the standard bus architectures such as VMEbus, PCI or Compact PCI. However, in trying to provide the best solution for the widest range of applications, Cots board vendors face conflicting trade-offs in features and costs when defining new board architectures. For these reasons, systems designers must choose carefully when selecting a Cots board for their application. Equally important is the enormous difficulty in writing software to optimize program execution to mesh with available time slots for bus access across all processors in a multiprocessor configuration. When poorly coordinated, the processors may sit idle, waiting for bus access and wasting precious processing capacity. In high-performance, real-time systems like a network processor, for example, the I/O streams are often quite unpredictable, resulting in unacceptably wide variations in performance at the system level. The first obvious improvement offered by many board vendors is the addition of local memory to each processor node. Apart from the local cache memory required by virtually all processors like the 7410, additional local memory is often required. This memory is usually much larger than the cache and is often implemented as cost-effective SDRAM. In this arrangement, larger blocks of code and data can be accessed during program execution without arbitration for the global shared memory. Although this improves operation, it still leaves the processor nodes vying for I/O over the other shared resources. Depending on the application, communication paths for efficiently moving large blocks of data between processors can significantly boost performance, especially for pipelined processing applications. There are three popular strategies for tackling this requirement. The first is a dedicated interprocessor link directly connecting two processors and permitting high-speed data transfers completely independent of traffic on the global bus. Some processor chips include built-in interprocessor links, like the Tiger Sharc with four 8-bit link ports, each supporting bidirectional data transfers up to 180 Mbytes/s. Unfortunately, the C6000 and PowerPC processors have no such internal links to support multiprocessing and must rely instead upon board-level hardware connected to one of the external buses. One popular implementation of these board-level paths is linking a pair of processors with bidirectional FIFOs, thus allowing either processor to read or write data to its FIFO port without hav ing to wait for the other processor to be free. A second approach for interprocessor communication is an on-board crossbar switch or switch fabric joining the processors. Unlike the dedicated links using point-to-point connectivity, a switched fabric can reallocate signal paths as required to meet changing I/O requirements. An excellent example of such a board-level switch fabric is RACEway. A third strategy involves adding an auxiliary bus dedicated to moving data between processors while independent transfers continue on the global bus. A new PowerPC node controller from Galileo Technology provides dual 64-bit, 66-MHz PCI buses, one for global transfers and the second for interprocessor transfers, each capable of peak rates of 528 Mbytes/s. For data transfers between boards in a system, the backplane is usually the first choice. However, as interboard communication rates increase, auxiliary backplane buses and switched backplane fabrics can serve as extremely effective alternatives. A strat egy for improving the connectivity between processor nodes and backplane interfaces uses direct links to these resources. Like the interprocessor links discussed above, some of these links are built into the processors. Examples include C40 comm ports, Sharc link ports, RACEway ports, PCI ports and others. Emerging standards like Infiniband and RapidIO will soon be appearing as standard interfaces on next generation processors. If processors lack these ports, then board-level designs can implement these interfaces using one of the external processor buses. As with the interprocessor links, by using bidirectional FIFOs as the bridging device, the processors and the backplane resources are nicely decoupled so that either side can move blocks of data at any time, regardless of possible activity on the other side of the FIFO. For applications requiring low-latency data transfers between boards, these direct backplane links can avoid the unpredictable arbitration delays for the global bus, typical of simpler architectures. For decades, the mezzanine board has proven itself one of the most popular strategies for connecting processors to a diverse range of peripherals and data channels. With dozens of standards available and literally thousands of compatible products, mezzanine boards allow system designers to tailor the I/O capabilities of the processor board to suit the specific needs of the application, simply by choosing the right combination of products. By drastically reducing the need for custom board designs, the mezzanine board is a vital cornerstone of virtually all Cots-based systems. For both VMEbus and CompactPCI boards, one of the most popular mezzanines is the PMC or PCI Mezzanine Card, which uses the PCI bus specification as the electrical connection to the processor board. For both modules as well as processor boards, two important PMC design variables are the width of the PCI bus (32 or 64 bits) and the maximum bus clock speed (33 or 66 MHz). In many cases a PMC processor board will provide a wider or faster PCI bus, but the PMC module will not support one or both of these features, requiring the combination to run in the lowest common configuration. Each 6U board can have either one or two PMC sites. However, when two sites are provided, they often share a common PCI bus. In this case, only one of the four processors can perform mezzanine I/O at a time. Even when two PCI buses are provided on a board with two PMC sites, two processors still need to share one module. This not only reduces the effective mezzanine I/O transfer rate, but since one processor may need to wait for another processor to finish, it also leads to program execution inefficiency. Regardless of how many sites and buses support the PMC modules, a major architectural issue for PMC processor boards is whether or not the data transfers to PMC modules use the shared global bus. If so, PMC I/O activity must be conducted around other transfers on the board, once again leading to possible downtime for processors wai ting for access. This limitation is common on many board offerings. The Velocity Interface Mezzanine (VIM) architecture solves all of these problems by providing one mezzanine connection per processor. All transfers are totally independent of global bus activity and of mezzanine I/O and processing tasks running on the other three processors. Although not yet as widely accepted as PMC, with speeds up to 400 MB/sec per processor, VIM offers an excellent mezzanine solution for high performance applications. Evaluation of a high-performance, multiprocessor Cots board for any given application requires a careful analysis of several key architectural elements. System designers first need to assess all of the data transfer requirements and then clearly identify structures on the various candidate boards to support those operations. Critical issues include local and global memory access, interprocessor communication, backplane interboard communication and peripheral support through mezzanine modules. O nly after mapping out strategies to solve these data-movement demands in the beginning, can system designers take maximum advantage of the performance of these exciting new processors.