An increasing amount of information-data, voice and video-is being transferred across Internet-based networks to a wide variety of embedded systems. But Internet protocols require data transfer in packets that incur hefty processing-overhead costs, so as the data-transfer rates required for advanced system performance increase, these Internet-based processing tasks are becoming a significant system bottleneck. Fortunately, a few relatively simple optimization methods can significantly boost system performance. For example, designers can cost-effectively improve performance thanks to the continuing evolution in field-programmable gate arrays (FPGAs). Specifically, implementing a "soft" CPU core in an FPGA offers several options that help the designer optimize an embedded system for Internet data transfer beyond what can be accomplished in a software implementation and without potentially costly microprocessor upgrades. Conventional approaches to opening this bottleneck generally come down to raising system clock speeds and turning to more powerful processors. Unfortunately, these approaches are usually costly and are limited by the availability of high-speed memories and other devices required to support the more powerful processor implementations. An emerging, more cost-effective path to improve system performance (and flexibility) solves the same problem by implementing repetitive software routines in hardware and by using hardware accelerators, direct memory access (DMA) controllers, FPGAs and FPGA-based soft-core processors. Indeed, better FPGA-based soft-core embedded processors have created new opportunities to optimize embedded system performance. In contrast to the conventional approaches, programmable logic gives the system designer additional options, including changing the system architecture, adding simultaneous multimastering capabilities or coding critical sections in hardware as custom instructions. Using a soft-core processor, a designer can combine custom logic with a microprocessor on one chip, for example. Tool support In recent years, soft-core processors have moved into the mainstream, supported by sophisticated hardware and software development tools. These tools have made it possible to create multiprocessor systems, integrate standardizable and user-designed peripherals, generate a customized bus architecture and implement specific I/O requirements. In addition, since the foundation of such a system is fully reprogrammable, all these characteristics can be changed well after system design begins, accommodating ever-changing requirements for performance and logic utilization. Let's examine the performance possible with a Web server system using a 32-bit embedded RISC processor core running at a system clock speed of 33 MHz. A simple set of peripherals, including a 10Base-T Ethernet media access controller (MAC) and a single 256-kbit SRAM bank, is implemented in the FPGA. The Web server was designed to read a timer peripheral before, during and after servicing an HTTP request to log throughput calculations, which are then displayed on a dynamically generated Web page. A very simple read-only file system was implemented using flash memory to store static Web pages and images. Once the Web server is running, a host PC is used to access several JPEG images of varying sizes stored in the file system. For each request, a transmission throughput calculation is performed. Transmission throughput measures the latency between when the server starts to send the first TCP packet containing the HTTP response and when the file is completely sent. The Web server is required to deliver JPEG images of varying sizes. During each transfer, several snapshots of the timer peripheral are taken. After each transfer is completed, the throughput is automatically calculated and displayed on a static Web page. This process is repeated for each enhancement made to the system. When determining the best candidate functions for system optimization, it is necessary to look at the entire process for serving a Web page. In this example, the CPU's data master port is used to read data memory and write to the Ethernet MAC. This occurs for each packet transmitted. With this process in mind, it is possible to explore several approaches for design performance optimization. DMA control Adding a DMA controller to our example system frees the CPU of routine data transfer tasks between peripherals. Since a DMA controller is an available option with the CPU used in this example, adding DMA offers the potential for an immediate performance boost with minimal engineering effort. In addition to reducing the latency required to shuttle data about in the system, adding DMA to an embedded system in an FPGA makes it possible to fine-tune the bus architecture. This is a marked difference from conventional approaches using discrete components, where the printed-circuit board layout is determined up front with little room for any sort of bus architecture change. Usually, the only way to boost system performance toward the end of a design cycle is by fine-tuning software or increasing the clock speed. Using a soft-core processor and an FPGA offers greater flexibility because bus- arbitration schemes for sharing memory between DMA and the CPU are configurable. This increased flexibility is a result of implementing bus interconnects inside the FPGA, whose routing resources can accommodate data flow between logic elements and memory arrays throughout the device. After adding the DMA controller to the system, software is modified in key areas where the CPU previously transferred data. These include the Ethernet MAC low-level driver and the TCP/IP stack software. Using the DMA to transfer packets between the Ethernet MAC and data memory frees the CPU from waiting in a "while loop" until all the packet data is sent or received. In addition, an analysis of the TCP/IP stack source code reveals two other loops for copying payload data for a packet under assembly. This task can also be offloaded from the CPU to the DMA controller, freeing the CPU for other tasks. Thus, adding the DMA controller and modifying the software library can effectively double the transmission throughput. CPU latency can also be reduced by transferring the task of TCP checksum calculation-which must be done for every packet transmitted-from the CPU to a custom-built hardware peripheral. This is achieved with a checksum peripheral that is designed to read the payload contents directly out of data memory, perform the checksum calculation and store the result in a CPU-addressable register. A few pages of Verilog code are needed to create a new peripheral that performs checksum additions of 32 bits of data on each clock cycle. The resulting performance improvement amounts to computation cycles that are 90 times faster than those possible using a software approach. The payoff is a 40 percent improvement in transmission throughput. In addition, the central processing unit is freed to execute other tasks while the checksums are calculated, thus clearing the way for some amount of basic parallel processing. Jesse Kempa (jkempa@altera.com) is an embedded-systems applications engineer at Altera Corp. (San Jose, Calif.). See related chart |