|
||||||||||||||||||||||||||||||||||||||||||
Accelerating algorithms in hardware
Accelerating algorithms in hardware When you're trying to get the best performance out of your algorithm and you're out of software tricks, try acceleration through hardware/software repartitioning. FPGAs provide everything you need to speed up your algorithms. Lara Simsic shows you how. When you're trying to get the best performance out of your code, what do you do? Optimize your algorithms, use look-up tables instead of algorithms, turn everything into native-word sizes, use registered variables, unroll loops, maybe even code in assembly. If none of this works, you can move to a faster processor, use a different processor architecture, or split the code across two processors to run in parallel. Yikes! But what if there were a way to take the time-critical sections of your code and turn them into a function call that runs five to 100 times faster? And what if this method were a standard tool available for software development? This dream is reality today using programmable logic as a fabric for hardware acceleration. The low-cost programmable logic that's making its way into more embedded systems provides another option for systems designers trying to achieve higher performance without taking the drastic action of changing processors or architectures. With programmable logic, you can convert compute-intensive functions into hardware-accelerated functions. From the perspective of the software, it's simply making a function call into a custom block of hardware that can run many times faster than the same code optimized in tight assembly language or algorithms converted into look-up tables. In my article on Verilog for C programmers, I explained the basics of writing Verilog HDL code using a pulse-width modulator as an example. I'll build on that foundation in this article and illustrate how to increase system performance using hardware acceleration, turning key software algorithms into hardware. Hardware acceleration The improvement in execution time can reach 10,000% (or 100 times) depending on the algorithm. Hardware is much faster at executing operations including performing complex mathematical functions, transferring data from one place to another, and performing the same operation numerous times. Later in this article we'll look at some operations commonly done in software that exhibit large gains in performance after being hardware-accelerated. If you use field-programmable gate arrays (FPGAs) in your system, you can add custom hardware anytime during the design cycle. You can start writing software code right away and run it on portions of the hardware before it's finalized. Furthermore, you can take an incremental approach in deciding which parts of the code will be implemented in hardware versus software. Tools from FPGA vendors make switching between hardware and software a fairly seamless process. The tools generate the HDL code for the bus logic and interrupt logic as well as customize the software libraries and include files according to the system configuration. RISC with a little CISC Wouldn't it be great if you could combine the simplicity and speed of RISC with the processing power of CISC tailored for your specific application? In a nutshell, that's what hardware acceleration is. When adding a hardware-accelerated block that's tailored for your application, you're increasing the processing power and decreasing code complexity and density since the block takes the place of a software module. Basically, you're trading hardware for speed and simplicity. We'll discuss the trade-offs in more detail later on. Custom instructions If the custom instruction takes several clock cycles to complete and you want to call the instruction successively, you can implement it as a pipelined custom instruction, which will produce a result every clock cycle with some latency in the beginning. Hardware peripherals Which to use Another difference is that a custom instruction takes a limited number of operands and returns one result. The number of operands will vary with the instruction-set architecture of the processor. For some operations this may be cumbersome. Also, if you need the hardware to read and write from memory or some other peripheral in memory, you'll have to implement the hardware peripheral since a custom instruction doesn't have access to the bus. Choosing code Even after you find the bottleneck, figuring how to optimize it is a challenge. Some options use native-word-sized variables, lookup tables with precomputed values, and general software algorithm optimization. These techniques can make a significant difference in execution speed by an order of a few hundred percent. Another way to optimize a C algorithm is to write it in assembly. In the past this made big improvements, but today's compilers are already so proficient at optimizing C code that the increase in performance is limited. If you need a substantial improvement in performance, traditional techniques for optimizing software algorithms may not be enough. It's not uncommon, however, for an algorithm implemented in hardware to yield a 100 times improvement over the software implementation. So, how do you determine which pieces of code to turn into hardware? You don't necessarily want to convert an entire software module into hardware. Instead, you should look for operations that by nature execute extremely fast in hardware, such as data copies from one location to another, intensive mathematical operations, and any loops that run many times. If it turns out that a task is made up of several mathematical operations, you might consider accelerating the whole task in hardware. Otherwise, accelerating just a single operation in the task may be all that's necessary to meet the performance requirements. Example: CRC algorithm First we'll optimize the algorithm using traditional software techniques and then accelerate the algorithm by turning it into a custom instruction. I'll discuss the performance benchmarks and tradeoffs for the different implementations. A CRC algorithm is used to check whether data have been corrupted during transmission. These algorithms are popular because they have a high error-detection rate coupled with a relatively modest decrease in data throughput (due to the addition of CRC bits to the data message). However, CRC algorithms require more calculation than some simple checksum algorithms, but the increase in the error-detection rate is worth it. In general terms, the sender performs the CRC algorithm on the message to be sent and includes this CRC result with the message. The receiver of the message then performs the same CRC operation on the message, including the CRC result from the sender. If the receiver comes up with a different result than the sender, the data have been corrupted. (Some details have been left out of this explanation for brevity's sake. For more detailed information on CRC algorithm, refer to Michael Barr's article "Slow and Steady Never Lost the Race.") The CRC algorithm is an intense mathematical operation involving modulo-2 division of a data message by a 16- or 32-bit polynomial (depending on the CRC standard used). This type of operation is normally implemented as an iterative process of XORs and shifts that can take close to a couple hundred instructions per byte of data when using a 16-bit polynomial. When sending hundreds of bytes this calculation adds up to tens of thousands of instructions. Therefore, any optimization could improve throughput quite a bit. The CRC function shown in Listing 1 has two argumentsa pointer to the message and the number of bytes in the messageand it returns the computed CRC value (the remainder). Even though the arguments to the function are bytes, the calculation is performed bit by bit. This algorithm is not efficient due to all of the operations (AND, shift, XOR, loop control) that must be carried out for each bit. Listing 1: C code for bit-by-bit CRC algorithm typedef unsigned char crc; #define WIDTH (8 * sizeof(crc)) #define TOPBIT (1 << (WIDTH - 1)) crc crcSlow(unsigned char const message[], int nBytes) { crc remainder = 0; /* * Perform modulo-2 division, a byte at a time. */ for (int byte = 0; byte < nBytes; ++byte) { /* * Bring the next byte into the remainder. */ remainder ^= (message[byte] << (WIDTH - 8)); /* * Perform modulo-2 division, a bit at a time. */ for (unsigned char bit = 8; bit > 0; --bit) { /* * Try to divide the current data bit. */ if (remainder & TOPBIT) { remainder = (remainder << 1) ^ POLYNOMIAL; } else { remainder = (remainder << 1); } } } /* * The final remainder is the CRC result. */ return (remainder); } Traditional software optimizing When using this algorithm, you need to store these precomputed values in memory. You can choose to use either ROM or RAM that has been initialized before you start the CRC calculations. The table will be 256 bytes with each byte location in the table containing the CRC result for one of the 256 possible 8-bit messages (regardless of the size of the polynomial). Listing 2 shows the C code using the lookup table method including the code for generating the values in the lookup table, crcInit(). Listing 2: C code for CRC algorithm using a lookup table void crcInit(void) { crc remainder; /* * Compute the remainder of each possible dividend. */ for (int dividend = 0; dividend < 256; ++dividend) { /* * Start with the dividend followed by zeros. */ remainder = dividend << (WIDTH - 8); /* * Perform modulo-2 division, a bit at a time. */ for (unsigned char bit = 8; bit > 0; --bit) { /* * Try to divide the current data bit. */ if (remainder & TOPBIT) { remainder = (remainder << 1) ^ POLYNOMIAL; } else { remainder = (remainder << 1); } } /* * Store the result into the table. */ crcTable[dividend] = remainder; } } /* crcInit() */ crc crcFast(unsigned char const message[], int nBytes) { unsigned char data; crc remainder = 0; /* * Divide the message by the polynomial, a byte at a time. */ for (int byte = 0; byte < nBytes; ++byte) { data = message[byte] ^ (remainder >> (WIDTH - 8)); remainder = crcTable[data] ^ (remainder << 8); } /* * The final remainder is the CRC. */ return (remainder); } /* crcFast() */ The entire calculation has been reduced to one loop with two XORs, two shift operations, and two load instructions per byte (not per bit). Basically, we're trading the memory space of the lookup table for speed. This method is about 9.9 times faster than the bit-by-bit method and is a significant improvement that may be enough for some applications. But what if you need more performance? You could try to squeeze out some more performance by writing assembly code or increasing the size of the lookup table. But, what if you need a 20-, 50-, or even 500-times improvement over the bit-by-bit method? Now is the time to investigate implementing the algorithm in hardware. Accelerating CRC in hardware using custom instructions When implemented in hardware the algorithm should perform the calculation 16 or 32 bits at a time, according to the CRC standard you're using. If you're using the CRC-CCITT standard (16-bit polynomial), it's best to perform the calculation 16 bits at a time. If you're using an 8-bit microprocessor, this won't be as efficient due to the extra cycles for loading the values of the operands and returning the CRC value. Figure 2 shows the core of the hardware implementation for a 16-bit CRC algorithm. The signal, msg(15..0), is the message that's shifted into the xor/shift hardware one bit at a time. Listing 3 shows some sample C code for calculating the CRC on a 64KB block of data. This example targets the Nios embedded processor. Listing 3: C code for CRC calculation using custom instruction /* * initialize crc reg to 0xFFFF */ word = nm_crc (0xFFFF, 1); /* nm_crc() is the CRC custom instruction */ /* * calculate CRC on block of data * nm_crc() is the CRC custom instruction * */ for (pointer = data_block; pointer < (data_block + nWords); pointer ++) word = nm_crc(*pointer, 0) return (word); } int main(void) { #define data_block_begin (na_onchip_memory) #define data_block_end (na_onchip_memory + 0xffff) unsigned short crc_result; unsigned int data_block_length = (unsigned short *)data_block_end - (unsigned short *)data_block_begin + 1; crc_result = crcCompute((unsigned short *)data_block_begin, data_block_length); } When using a custom instruction, the code for calculating the CRC value is a function call, or macro. When implementing a custom instruction for the Nios processor, the system builder tool generates a macro, nm_crc() in this case, which you'll use to call the custom instruction. Before starting the CRC calculation, the CRC register inside the custom instruction needs to be initialized. Loading the initial value is part of the CRC standard and is different for every CRC standard. Next, the loop calls the CRC custom instruction for every 16 bits of data in the data block. This custom instruction implementation is about 27 times faster than the bit-by-bit implementation. The peripheral method Using a custom peripheral with a DMA gives over 500-times improvement over the bit-by-bit software-only algorithm when the data block is 64KB. Keep in mind that performance when using a DMA increases as the size of the data block increases. This is because there's a little overhead for setting up the DMA, after which the DMA runs extremely fast since it can transfer data every cycle. It's, therefore, not efficient to use the DMA for only a few bytes of data. The tradeoffs Table 1: Benchmark results for CRC algorithms for various data block sizes As you can see, the more hardware used in the algorithm, the faster the algorithm is. You're trading hardware resources for speed. The FGPA advantage Using an FPGA with a configurable processor makes for a flexible design. You have the option of implementing each module in software code, as a custom instruction, or as a hardware peripheral. Also, because you can add customized hardware, it's possible to achieve performance well beyond some off-the-shelf microprocessors. Another thing to keep in mind is that an FPGA has an abundance of routing resources that configurable processor systems may be able to take advantage of. Evolving embedded system FPGAs facilitate the interchanging of software modules with hardware modules without changing processors or making board changes. You can choose how you want to trade off speed, hardware logic, memory, code size, and cost. With FPGAs, you can create custom embedded systems that evolve with new features and optimizations. Lara Simsic is a senior embedded applications engineer at Altera. She has developed embedded hardware and software for five years and has an EE degree from the University of Dayton. Contact her at lsimsic@altera.com. Copyright 2005 © CMP Media LLC |
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |