Accelerating algorithms in hardware

Accelerating algorithms in hardware
By Lara Simsic, Courtesy of Embedded Systems Programming
Jan 20 2004 (15:00 PM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=17500157

When you're trying to get the best performance out of your algorithm and you're out of software tricks, try acceleration through hardware/software repartitioning. FPGAs provide everything you need to speed up your algorithms. Lara Simsic shows you how.

When you're trying to get the best performance out of your code, what do you do? Optimize your algorithms, use look-up tables instead of algorithms, turn everything into native-word sizes, use registered variables, unroll loops, maybe even code in assembly. If none of this works, you can move to a faster processor, use a different processor architecture, or split the code across two processors to run in parallel. Yikes! But what if there were a way to take the time-critical sections of your code and turn them into a function call that runs five to 100 times faster? And what if this method were a standard tool available for software development? This dream is reality today using programmable logic as a fabric for hardware acceleration.

The low-cost programmable logic that's making its way into more embedded systems provides another option for systems designers trying to achieve higher performance without taking the drastic action of changing processors or architectures. With programmable logic, you can convert compute-intensive functions into hardware-accelerated functions. From the perspective of the software, it's simply making a function call into a custom block of hardware that can run many times faster than the same code optimized in tight assembly language or algorithms converted into look-up tables.

In my article on Verilog for C programmers, I explained the basics of writing Verilog HDL code using a pulse-width modulator as an example. I'll build on that foundation in this article and illustrate how to increase system performance using hardware acceleration, turning key software algorithms into hardware.

Hardware acceleration
First let's discuss what hardware acceleration is and the differences between implementing an algorithm as a custom instruction versus using a hardware peripheral. The term hardware acceleration means replacing a software algorithm with a hardware module to take advantage of its intrinsic speed. From the software perspective, interfacing to a hardware-accelerated module is the same as calling a function. The only difference is now the function resides in hardware, which is transparent to the calling function.

The improvement in execution time can reach 10,000% (or 100 times) depending on the algorithm. Hardware is much faster at executing operations including performing complex mathematical functions, transferring data from one place to another, and performing the same operation numerous times. Later in this article we'll look at some operations commonly done in software that exhibit large gains in performance after being hardware-accelerated.

If you use field-programmable gate arrays (FPGAs) in your system, you can add custom hardware anytime during the design cycle. You can start writing software code right away and run it on portions of the hardware before it's finalized. Furthermore, you can take an incremental approach in deciding which parts of the code will be implemented in hardware versus software. Tools from FPGA vendors make switching between hardware and software a fairly seamless process. The tools generate the HDL code for the bus logic and interrupt logic as well as customize the software libraries and include files according to the system configuration.

RISC with a little CISC
One of the goals of reduced instruction set computing (RISC) architectures is to keep the instructions simple in order to make the instruction pipeline run faster. This contrasts to complex instruction set computing (CISC) architectures, which in general don't execute instructions as quickly, but can do more processing per instruction. Both architectures are widely used today and both have merits.

Wouldn't it be great if you could combine the simplicity and speed of RISC with the processing power of CISC tailored for your specific application? In a nutshell, that's what hardware acceleration is. When adding a hardware-accelerated block that's tailored for your application, you're increasing the processing power and decreasing code complexity and density since the block takes the place of a software module. Basically, you're trading hardware for speed and simplicity. We'll discuss the trade-offs in more detail later on.

Custom instructions
There are two implementations of hardware-accelerated modules. One is a custom instruction that you can implement in virtually every configurable processor and is the key advantage of using configurable processors. The custom instruction is added as an extension of the arithmetic logic unit (ALU) as shown in Figure 1. As far as the processor knows, it's just like any other instruction including having its own opcode. As for the C code, a macro is automatically generated that makes using this custom instruction the same as calling a function.

Figure 1: Configurable processor architecture with custom instruction

If the custom instruction takes several clock cycles to complete and you want to call the instruction successively, you can implement it as a pipelined custom instruction, which will produce a result every clock cycle with some latency in the beginning.

Hardware peripherals
The other implementation of a hardware-accelerated module is a hardware peripheral. Instead of passing data to a software function, you write data to a memory-mapped hardware peripheral. The computation is done outside of the CPU so the CPU can continue running code while the peripheral is working. It's really just a normal hardware peripheral that's replacing a software algorithm. Another difference from a custom instruction is that a hardware peripheral can access other peripherals or memory in the system without CPU intervention.

Which to use
Depending on what the hardware needs to do, how it works, and how long it takes will help you decide whether a custom instruction or hardware peripheral is more suitable. For operations that are completed in only a few cycles, a custom instruction would generally work better because it creates less overhead. For a peripheral you'll typically have to execute at least a few instructions to write to the control registers, status registers, and data registers and one instruction to read the result. If it takes several cycles just for the computation, implementing a peripheral may be better since it won't stall the CPU pipeline. Or you could implement the pipelined custom instruction described earlier.

Another difference is that a custom instruction takes a limited number of operands and returns one result. The number of operands will vary with the instruction-set architecture of the processor. For some operations this may be cumbersome. Also, if you need the hardware to read and write from memory or some other peripheral in memory, you'll have to implement the hardware peripheral since a custom instruction doesn't have access to the bus.

Choosing code
When you need to optimize your C code to meet certain speed requirements, you may run a code profiler or perhaps examine the code yourself to get an idea of what portion of the code bogs down the system. You may, of course, be so familiar with the code that you already know where the bottlenecks are.

Even after you find the bottleneck, figuring how to optimize it is a challenge. Some options use native-word-sized variables, lookup tables with precomputed values, and general software algorithm optimization. These techniques can make a significant difference in execution speed by an order of a few hundred percent. Another way to optimize a C algorithm is to write it in assembly. In the past this made big improvements, but today's compilers are already so proficient at optimizing C code that the increase in performance is limited. If you need a substantial improvement in performance, traditional techniques for optimizing software algorithms may not be enough.

It's not uncommon, however, for an algorithm implemented in hardware to yield a 100 times improvement over the software implementation. So, how do you determine which pieces of code to turn into hardware? You don't necessarily want to convert an entire software module into hardware. Instead, you should look for operations that by nature execute extremely fast in hardware, such as data copies from one location to another, intensive mathematical operations, and any loops that run many times. If it turns out that a task is made up of several mathematical operations, you might consider accelerating the whole task in hardware. Otherwise, accelerating just a single operation in the task may be all that's necessary to meet the performance requirements.

Example: CRC algorithm
A good candidate for hardware acceleration is a cyclic redundancy check (CRC) algorithm or any checksum algorithm, because of the intensive, iterative calculations. Let's go through the process of optimizing a CRC algorithm to see how it's done.

First we'll optimize the algorithm using traditional software techniques and then accelerate the algorithm by turning it into a custom instruction. I'll discuss the performance benchmarks and tradeoffs for the different implementations.

A CRC algorithm is used to check whether data have been corrupted during transmission. These algorithms are popular because they have a high error-detection rate coupled with a relatively modest decrease in data throughput (due to the addition of CRC bits to the data message). However, CRC algorithms require more calculation than some simple checksum algorithms, but the increase in the error-detection rate is worth it.

In general terms, the sender performs the CRC algorithm on the message to be sent and includes this CRC result with the message. The receiver of the message then performs the same CRC operation on the message, including the CRC result from the sender. If the receiver comes up with a different result than the sender, the data have been corrupted. (Some details have been left out of this explanation for brevity's sake. For more detailed information on CRC algorithm, refer to Michael Barr's article "Slow and Steady Never Lost the Race.")

The CRC algorithm is an intense mathematical operation involving modulo-2 division of a data message by a 16- or 32-bit polynomial (depending on the CRC standard used). This type of operation is normally implemented as an iterative process of XORs and shifts that can take close to a couple hundred instructions per byte of data when using a 16-bit polynomial. When sending hundreds of bytes this calculation adds up to tens of thousands of instructions. Therefore, any optimization could improve throughput quite a bit.

The CRC function shown in Listing 1 has two arguments—a pointer to the message and the number of bytes in the message—and it returns the computed CRC value (the remainder). Even though the arguments to the function are bytes, the calculation is performed bit by bit. This algorithm is not efficient due to all of the operations (AND, shift, XOR, loop control) that must be carried out for each bit.

Listing 1: C code for bit-by-bit CRC algorithm

/*   * The width of the CRC calculation and result.   * Modify the typedef for a 16 or 32-bit CRC standard.   */    typedef unsigned char crc;  
  #define WIDTH    (8 * sizeof(crc))  #define TOPBIT   (1 << (WIDTH - 1))  
  crc crcSlow(unsigned char const message[], int nBytes)  {      crc  remainder = 0;  
      /*       * Perform modulo-2 division, a byte at a time.       */      for (int byte = 0; byte < nBytes; ++byte)      {          /*           * Bring the next byte into the remainder.           */          remainder ^= (message[byte] << (WIDTH - 8));  
          /*           * Perform modulo-2 division, a bit at a time.           */          for (unsigned char bit = 8; bit > 0; --bit)          {              /*               * Try to divide the current data bit.               */              if (remainder & TOPBIT)              {                  remainder = (remainder << 1) ^ POLYNOMIAL;              }              else              {                  remainder = (remainder << 1);              }          }      }  
      /*       * The final remainder is the CRC result.       */      return (remainder);  
  }

Traditional software optimizing
Let's look at how the CRC algorithm can be optimized using traditional software techniques. Since one of the operands, the polynomial (divisor), in the CRC operation is constant, the result of all the possible byte-wide CRC operations can be precomputed and stored in a lookup table. This would enable the operation to execute byte by byte with a single read from the lookup table.

When using this algorithm, you need to store these precomputed values in memory. You can choose to use either ROM or RAM that has been initialized before you start the CRC calculations. The table will be 256 bytes with each byte location in the table containing the CRC result for one of the 256 possible 8-bit messages (regardless of the size of the polynomial).

Listing 2 shows the C code using the lookup table method including the code for generating the values in the lookup table, crcInit().

Listing 2: C code for CRC algorithm using a lookup table

crc  crcTable[256];    void crcInit(void)  {      crc  remainder;  
      /*       * Compute the remainder of each possible dividend.       */      for (int dividend = 0; dividend < 256; ++dividend)      {          /*           * Start with the dividend followed by zeros.           */          remainder = dividend << (WIDTH - 8);  
          /*           * Perform modulo-2 division, a bit at a time.           */          for (unsigned char bit = 8; bit > 0; --bit)          {              /*               * Try to divide the current data bit.               */			              if (remainder & TOPBIT)              {                  remainder = (remainder << 1) ^ POLYNOMIAL;              }              else              {                  remainder = (remainder << 1);              }          }  
          /*           * Store the result into the table.           */          crcTable[dividend] = remainder;      }  
  }   /* crcInit() */  
  crc crcFast(unsigned char const message[], int nBytes)  {      unsigned char  data;      crc	         remainder = 0;  
      /*       * Divide the message by the polynomial, a byte at a time.       */      for (int byte = 0; byte < nBytes; ++byte)      {          data = message[byte] ^ (remainder >> (WIDTH - 8));          remainder = crcTable[data] ^ (remainder << 8);      }  
      /*       * The final remainder is the CRC.       */      return (remainder);  
  }   /* crcFast() */

The entire calculation has been reduced to one loop with two XORs, two shift operations, and two load instructions per byte (not per bit). Basically, we're trading the memory space of the lookup table for speed. This method is about 9.9 times faster than the bit-by-bit method and is a significant improvement that may be enough for some applications. But what if you need more performance? You could try to squeeze out some more performance by writing assembly code or increasing the size of the lookup table. But, what if you need a 20-, 50-, or even 500-times improvement over the bit-by-bit method? Now is the time to investigate implementing the algorithm in hardware.

Accelerating CRC in hardware using custom instructions
The CRC algorithm, which consists of sequential XORs and shifts, is simple to implement in hardware with little logic. Because this block of hardware only takes a couple of cycles to calculate the CRC, it makes more sense to implement the CRC calculation as a custom instruction rather than as a peripheral. Also, there's no need to access any other peripheral or memory in the system. You need to be using a microprocessor that supports custom instructions, usually referred to as a configurable microprocessor.

When implemented in hardware the algorithm should perform the calculation 16 or 32 bits at a time, according to the CRC standard you're using. If you're using the CRC-CCITT standard (16-bit polynomial), it's best to perform the calculation 16 bits at a time. If you're using an 8-bit microprocessor, this won't be as efficient due to the extra cycles for loading the values of the operands and returning the CRC value. Figure 2 shows the core of the hardware implementation for a 16-bit CRC algorithm.

Figure 2: 16-bit CRC algorithm implemented in hardware

The signal, msg(15..0), is the message that's shifted into the xor/shift hardware one bit at a time.

Listing 3 shows some sample C code for calculating the CRC on a 64KB block of data. This example targets the Nios embedded processor.

Listing 3: C code for CRC calculation using custom instruction

unsigned short crcCompute(unsigned short *data_block, unsigned int nWords)  {     unsigned short*  pointer;     unsigned short   word;       /*       * initialize crc reg to 0xFFFF       */     word = nm_crc (0xFFFF, 1);  /* nm_crc() is the CRC custom instruction */  
     /*       * calculate CRC on block of data       * nm_crc() is the CRC custom instruction      *      */     for (pointer = data_block; pointer < (data_block + nWords); pointer ++)        word = nm_crc(*pointer, 0)     	     return (word);  }  
  int main(void)  {     #define data_block_begin (na_onchip_memory)     #define data_block_end   (na_onchip_memory + 0xffff)  
     unsigned short crc_result;     unsigned int data_block_length = (unsigned short *)data_block_end - (unsigned short *)data_block_begin + 1;  
     crc_result = crcCompute((unsigned short *)data_block_begin, data_block_length);  }

When using a custom instruction, the code for calculating the CRC value is a function call, or macro. When implementing a custom instruction for the Nios processor, the system builder tool generates a macro, nm_crc() in this case, which you'll use to call the custom instruction.

Before starting the CRC calculation, the CRC register inside the custom instruction needs to be initialized. Loading the initial value is part of the CRC standard and is different for every CRC standard. Next, the loop calls the CRC custom instruction for every 16 bits of data in the data block. This custom instruction implementation is about 27 times faster than the bit-by-bit implementation.

The peripheral method
It's possible to speed this up even more by implementing the CRC algorithm as a hardware peripheral and using a direct memory access (DMA) to transfer the data from memory to the peripheral. This method will remove the extra cycles that the processor uses to load the data for every calculation. The DMA will be able to provide new data on the same clock cycle that the peripheral finishes the previous CRC calculation. Figure 3 shows a block diagram of the system when using a CRC peripheral with a DMA.

Figure 3: Block diagram of system with CRC and DMA

Using a custom peripheral with a DMA gives over 500-times improvement over the bit-by-bit software-only algorithm when the data block is 64KB. Keep in mind that performance when using a DMA increases as the size of the data block increases. This is because there's a little overhead for setting up the DMA, after which the DMA runs extremely fast since it can transfer data every cycle. It's, therefore, not efficient to use the DMA for only a few bytes of data.

The tradeoffs
All of the algorithms discussed using the CRC-CCITT standard (16-bit polynomial) were implemented on Altera's Nios processor on a Stratix FPGA. Table 1 shows the benchmark results for various data lengths as well as the approximate hardware utilization (memory or logic elements in the FPGA).

Table 1: Benchmark results for CRC algorithms for various data block sizes

CRC-CCITT execution time (clock cycles)
	SW bit-by-bit	SW LUT	Custom instruction	Peripheral + DMA
8 byte message	2838	304	124	243
512 byte message	181753	18448	6739	582
1KB message	363243	36880	13459	922
64KB message	23264648	2359312	860179	43925
Hardware resources utilized	Baseline	256 bytes of memory	~ 55 logic elements	~ 975 logic elements

As you can see, the more hardware used in the algorithm, the faster the algorithm is. You're trading hardware resources for speed.

The FGPA advantage
When using an FPGA-based embedded system, you don't have to make the hardware versus software decision for every module at the start of the design cycle. If in the middle of the design cycle you need some extra performance, you can use the available hardware resources in the FPGA to accelerate the bottleneck in the software code. Since the logic elements in the FPGA are programmable, you can customize the hardware for your application. Therefore, you'll only use the hardware that you need and won't have to make any board revisions (assuming you don't run out of logic elements in the FPGA). You can do this without switching to a new processor or writing any assembly code.

Using an FPGA with a configurable processor makes for a flexible design. You have the option of implementing each module in software code, as a custom instruction, or as a hardware peripheral. Also, because you can add customized hardware, it's possible to achieve performance well beyond some off-the-shelf microprocessors.

Another thing to keep in mind is that an FPGA has an abundance of routing resources that configurable processor systems may be able to take advantage of.

Evolving embedded system
Algorithms can be implemented in hardware or software. For ease and cost reasons, it's common to implement most operations in software unless more speed is necessary to meet the performance goals. You can optimize the software, but sometimes this isn't enough. When you need more speed, accelerating the algorithm in hardware may be the answer.

FPGAs facilitate the interchanging of software modules with hardware modules without changing processors or making board changes. You can choose how you want to trade off speed, hardware logic, memory, code size, and cost. With FPGAs, you can create custom embedded systems that evolve with new features and optimizations.

Lara Simsic is a senior embedded applications engineer at Altera. She has developed embedded hardware and software for five years and has an EE degree from the University of Dayton. Contact her at lsimsic@altera.com.