Using PLDs for Algorithm Acceleration - Faster, Better, Cheaper


EE Times: Using PLDs for Algorithm Acceleration - Faster, Better, Cheaper Run out of power? Adding CPUs or custom hardware accelerators may not be the solution. By combining programmable logic and ESL design tools, certain algorithms gain faster performance at lower clock speeds at less power. See how to figure out the acceleration you can achieve for your application.
Jeff Jussel and James Hrica, Celoxica, Inc. (08/31/2005 2:39 PM EDT) URL: http://www.eetimes.com/showArticle.jhtml?articleID=170102296

The microprocessor’s impact on electronics and our society is greater than any other invention since the integrated circuit. What makes the CPU so powerful is its programmability based on simple sets of instructions, delivering the capabilities of silicon to a broad range of applications. Thanks to the scalability of silicon technologies, microprocessors have continuously grown in performance for the past 30 years. But what if a processor does not have sufficient performance to power an application? Adding additional CPUs may not fit power or cost budgets and more than likely still won’t provide the needed acceleration. Adding custom hardware accelerators such as co-processors adds performance at lower power, but it is not easily programmable. Hardware design requires special expertise, months of design time, and costly NRE development charges. Enter ESL design tools Today, with the combination of programmable logic and electronic system-level (ESL) design tools, by using the parallelism of hardware, certain algorithms gain faster performance at lower clock speeds using less power than is possible in serial processor operations. PLDs provide a reconfigurable option for hardware acceleration; and the addition of ESL design flows gives developers access to these devices using a software design flow. The ready availability of ESL design tools provides the keys to opening up more applications to algorithm acceleration using programmable hardware. Software-based design flows are now able to directly program FPGA devices as custom co-processors (see Figure 1). As with microprocessors, hardware algorithms on FPGAs can be programmed from such high-level languages as C, and re-programmed as many times as necessary to optimize the algorithm and system implementation. The software design flow bypasses the lengthy traditional hardware design flow and potential errors introduced in manual translation to RTL. Starting from software also helps developers create optimal solutions since system-partitioning decisions are more fluid. With the ability to program both hardware and CPUs using common high-level languages, application developers have a new capability for making their systems better, faster and cheaper. Figure 1. Software Design Flows Directly Program FPGAs Many existing applications can easily make use of PLDs to augment their CPU systems for acceleration. The ability to easily use custom hardware will also spur the development of many new types of products. But which algorithms are best suited to hardware acceleration? What is the process for partitioning a system? What new skills are necessary to program the hardware co-processors? How much acceleration can you expect from adding an FPGA? This article will provide answers to these questions discussing best practices of system designers using custom FPGA hardware to augment processors for algorithm acceleration. Will hardware acceleration work for all algorithms? When a general purpose CPU or DSP alone cannot deliver the required performance, there are several options available (See Figure 2). Not all algorithms, however, lend themselves well to all types of acceleration. The first step in the process is determining whether the application will benefit from the parallelism of hardware. Figure 2. System Acceleration Possibilities Amdahl’s law, developed by computer architect Gene Amdahl in 1967, puts forward a pessimistic view of the acceleration possible from parallelism. Specifically applied to systems using multiple CPUs, Amdahl’s law states that the acceleration possible is limited to the proportion of the total processing time of the algorithm being accelerated. If the algorithm being accelerated takes up p% of the total processing time and is spread across N processors, then the potential speedup can be written as 1/[(1-p) + p/N]. For example assuming 20% of an algorithm is spread across 4 processors, the total processing time is at best reduced to 85% of the original. Given the additional cost and power required by 3 extra processors, a mere 1.2x speedup may not provide a big enough return. Using custom hardware accelerators, the N in Amdahl’s equation becomes the speedup multiplier for accelerated algorithm. In hardware this multiplier can be orders of magnitude higher and the power utilization and costs of FPGA accelerators are generally much lower. Still Amdahl’s law would still seem to limit the potential theoretical total acceleration times. Fortunately, in practice Amdahl’s law has proven too pessimistic and some suggest that the proportional throughput of the parallel hardware increases with additional processing power, making the serial portion smaller in comparison. For certain systems, total accelerations of 1, 2 or even 3 orders of magnitude are possible. What should go in hardware? In order to accelerate a software application using PLD hardware, the developer must partition the design between the tasks that will remain on the CPU and those that will run in hardware. The first step in this process is to profile the performance of the algorithm running entirely on the CPU. Developers gain insights into where the application spends most of its runtime by using standard software profiling tools. The profiler report points to the functions, or even lines of code, that are the performance bottlenecks. Free profiling tools such as gprof for the GNU development environment or Compuware’s DevPartner Profiler for Visual Studio are readily available. The developer analyzes the application code profile around areas of congestion to determine opportunities for acceleration. The developer should also look for sub-processes that can benefit from the parallelism of hardware. To accelerate the system, the developer moves portions of the implementation to hardware connected to the CPU, such as an FPGA situated on a PCI interface board. Embedded processors in an FPGA provide another possible approach, by placing the reconfigurable fabric and the processor within the same device. It is important during the analysis process not to overlook the issues of data transfer and memory bandwidth. Without careful consideration, data transfer overhead eats into gains made in processing time. Often developers may need to expand the scope of the custom hardware implementation to minimize the amount of data to be transferred to or from the PLD. Processes that expand data going into the accelerated block such as padding or windowing, or processes that contract data coming out of that block, such as down-sampling, should all be included in the PLD design to reduce communication overhead. Where is parallelism in the algorithm? Hardware accelerates systems by turning large serial software functions and performing them in parallel over many smaller processes. This results in a highly optimized and very specific set of hardware functions. ESL tools are an important part of this process, allowing developers to insert parallelism from software descriptions, using PLDs like custom co-processors. Parallelism can be exploited at many levels to move functionality from software into custom hardware. Fine-grained parallelism is achieved by executing many independent simple statements simultaneously (e.g. variable assignments, incrementing counters, basic arithmetic, and others). The pipelining of sequentially dependent statements is another form of fine-grain parallelism achieved by arranging individual stages in parallel such that each stage feeds the input of the next stage downstream. For example, by pipelining the inside of a loop of code including loop control logic, it may be possible to reduce iterations to only one clock cycle. After priming the pipeline with data, the loop can be made to execute in a number of clock cycles equal to the number of loop iterations (plus the relatively modest latency). This type of optimization often provides the greatest benefits when accelerating an algorithm in hardware. Coarse-grained parallelism can also provide acceleration. Larger independent sub-processes such as functions or large looped blocks, which run sequentially on the processor, can be run simultaneously in custom hardware. Similarly, unrolling loops either fully or partially can add parallelism. In unrolling, the processing logic inside loops is replicated multiple times so that each instance works in parallel on different data to reduce the number of loop iterations. Pipelining of large sequentially dependent processes can also be performed to further remove cycles at the cost of some latency. Another important form of coarse-grained parallelism occurs when software and hardware are made to operate in tandem. For example, a system accelerated by transferring frames of data to custom hardware for co-processing might be structured so that while the hardware processes one frame, the software is accomplishing any requisite post-processing of the previous frame and/or pre-processing the next frame. This type of coordination and efficiency is made possible by common development environments for both the hardware and software provided by ESL. Parallelism can also optimize memory bandwidth in a custom hardware accelerator. For example, the addition operator for two vectors may be written as a loop, which reads an element of each vector from memory, adds operands, and writes the resulting sum vector to memory. Iterations of this operation require three accesses to memory, each access taking at least one cycle to accomplish. Fortunately, modern PLDs incorporate embedded memory features that help. By storing each of the three vectors in a separate embedded memory structure, the developer can fully pipeline the action of the summing loop so that iterations take only one cycle. This simple optimization is one example of how developers maximize system acceleration in hardware. How do accelerators handle floating point? The use of floating point mathematics is often the most important issue to resolve when creating custom hardware to accelerate a software application. Many software applications make liberal use of high-performance floating-point capabilities of modern CPUs, whether core algorithms require it or not. Although floating point arithmetic operations can be implemented in PLD hardware, they tend to require a lot of resources. Generally, when facing floating-point acceleration, it is best to either leave those operations in the CPU portion of the design, or change the operations to fixed-point. Fortunately, many algorithms can be implemented effectively using fixed-point mathematics, and there are pre-built floating-point modules for algorithms that must implement floating point in hardware. A detailed description of the process for converting from floating-point to fixed-point operations can be highly algorithm-dependent and is beyond the scope of this article. However, briefly, the process begins by analyzing the dynamic range of data going into the algorithm and determining the minimum bit-width possible to express that range in an integer form. Given the width of the input data, a developer can trace through operations involved to determine the bit growth of the data. For example, to sum the squares of two 8-bit numbers, the minimum width required to express the result without loss of information is 17 bits (the square of each input requires 16 bits and the sum contributes one more bit). By knowing the desired precision of the output, the developer can work backwards through the operation of the algorithm to deduce the internal bit widths. Well-designed fixed-point algorithms can be implemented in PLDs efficiently due to the ability to tailor the width of internal data paths. Once the width of the inputs, internal data paths, and outputs are known, the conversion of data to fixed-point is straightforward leading to efficient implementation on both the hardware and software sides. Occasionally, an application requires more complex mathematical functions (e.g. sin(), cos(), sqrt(), etc.) on the hardware side of the partition. For example, these functions may operate on a discrete set of operands or may be invoked inside of loops with an argument dependent on the loop index. In these cases, the function can be implemented in a modestly sized lookup table that can be implemented in embedded memories. Functions that take arbitrary inputs can be used in hardware lookup tables with interpolation. If more precision is required, iterative or convergent techniques can be used at the cost of more cycles. In these cases, the software compilation tools should make good use of existing PLD hardware resources. What skills are required to use hardware acceleration? Hardware design from ESL tools has come a long way in the last few years. It is now possible to use C-based hardware languages to design hardware portions of the system. PLD hardware can be easily generated from original software algorithms. Software tools can compile C descriptions directly into FPGA devices. These tools commonly provide software APIs to abstract away details of the hardware/CPU connection from the application development. This allows users to treat the FPGA as a co-processor in the system and opens doors to the use of PLDs in such new areas as high-performance computing. Languages used to compile software descriptions to PLD hardware are specifically designed to simplify the system design process with a CPU, and add the necessary elements to generate quality hardware (See Figure 3). Languages such as Handel-C and SystemC provide a common code for developing hardware and software portions of the system. Handel-C is based on ANSI-C, and SystemC is a class library of C++. Figure 3. Adding Hardware Capabilities Any process that compiles custom hardware from software descriptions must deal with the following issues in either the language or the tool: concurrency, data types, timing, and communication and resource usage. Software is written sequentially, but efficient hardware must translate that code into parallel constructs using potentially multiple hardware clocks, implementing that code using all the proper hardware resources. Both Handel-C and SystemC add simple constructs in the language to allow expert users control over the process, while maintaining a high-level of abstraction for representing algorithmic designs. Typically software developers that understand the concepts of parallelism in hardware find these languages familiar and are able to design accelerated systems using corresponding tools in a matter of weeks. What acceleration can be achieved? A few examples of the acceleration achieved are shown in Figure 4. A CT Reconstruction by Filtered Back Projection application is chosen which takes in sets of 512 samples (a tomographic projection) and filters data first through an FFT, applies the filter function to the frequency sample, then applies an inverse FFT. These filtered samples are then used to compute their contribution to each pixel in the 512 x 512 reconstructed image (back projection process). This process is repeated with pixel values being accumulated from each of the 360 projections. The fully reconstructed image is displayed on a computer screen and stored to a disk. Running on a Pentium 4 3.0GHz computer the reconstruction takes approximately 3.6 seconds. For the purposes of this project, the desired accelerated processing time target for the reconstruction is 100ms. Figure 4. Augmenting Processor Performance Profiling this application showed that 93% of CPU runtime is spent in the back projection function, making that function the main target for acceleration. Analyzing back projection code showed the bottleneck clearly. For every pixel location in the final image, two filtered samples are multiplied by coefficients, summed then accumulated in the pixel value. This function is invoked 360 times, so the inner loop executes around 73 million times, each time requiring 3 reads and 1 write to memory, 3 multiplies and several additions. Fortunately, the back projection algorithm lends itself well to parallel processing. Although the code was written using floating point, the integer nature of the input and output data made a fixed-point implementation perfectly reasonable. The application, running on a standard PC, was to be accelerated using a PCI add-in card with a Xilinx Virtex-II FPGA and six banks of 512K x 32 SRAM. A sensible first-order partitioning of the application put the back projection execution in the PLD, including the accumulation of the reconstructed image pixels, leaving filtering and all other tasks running on the host processor. A single instance of the fixed-point back projection algorithm was designed and validated using Handel-C and the DK tool suite. An existing API simplified the data transfer directly from the software application to the FPGA over a PCI bus. Fine-grained parallelism was exploited and some of the arithmetic was rearranged to maximize the performance of the algorithm in hardware. This core unit was capable of processing a single projection in approximately 3ms running at a 66MHz clock rate. The fact that each projection could be computed independently meant that multiple back projection cores could run concurrently. It was determined that 18 cores would be sufficient parallelism to reach the performance goal comfortably. The software would pass 18 sets of 512 samples to the FPGA, which would store them in 18 individual dual-port RAMs. Both ports of these RAMs were interfaced to each back projection core allowing them to read the two samples required in a single cycle. The cores were synchronized and the 18 contributions were fed into a pipelined adder tree with the final summed value accumulated with the previous pixel value stored in one of the external SRAM banks. The accumulated value is written into the other SRAM bank and the two are read in ping-pong fashion on subsequent passes. Within this architecture sample sets are transferred from software to hardware and the reconstructed image is passed back once when complete. The software was modified so that while one set of data was processed in the FPGA, the next set was filtered by the host processor further minimizing design time. In the end, leveraging the features of PLDs and modern ESL tools, and taking advantage of flexibilities provided by a software-based design flow, the design realized a 45x acceleration for a real-world application. References: Gene Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities”, AFIPS Conference Proceedings, (50), pp:483-485, 1967 Benner, R.E., Gustafson, J.L., and Monty, G.R., “Development and analysis of scientific application programs on a 1024-processor hypercube”, SAND 88-0317, Sandia National Laboratories, Feb. 1988 About the Author Jeff Jussel is VP Marketing and America's General Manager. Jeff can be reached at jeff.jussel@celoxica.com About the Author James Hrica is a System Architect at Celoxica.