FPGA algorithm tunes gray, color images

FPGA algorithm tunes gray, color images
By Sherif M. Saif, Hazem M. Abbas and Salwa M. Nassar, EE Times
November 24, 2003 (3:52 p.m. EST)
URL: http://www.eetimes.com/story/OEG20031121S0038

For both gray-scale and color image applications in an FPGA, we have implemented block truncation coding (BTC), a lossy image-compression algorithm with proven value in applications that don't require exact reconstruction of the original image.

BTC divides an image with gray-level pixels (8 bits/pixel) into small rectangular blocks of pixels, normally 4 x 4. Each block is quantized into two gray levels. The compression is achieved by sending a bit map for the quantized block (16 bits) and two quantization levels, 8 bits each. Thus, the algorithm achieves a constant bit rate of 2 bits/pixel, resulting in a compression ratio of four.

The algorithm proceeds by calculating the mean value for each block. A two-level quantization is performed for the entire block so that a zero value is stored for the pixels with values smaller than the mean. The rest of the pixels are represented by the value one. The quantization value (a) represents th e average gray level of the pixels whose gray level is less than the block mean, while the value (b) will represent the average value of the pixels whose gray level is greater than the mean. The image is reconstructed at the receiving side by assigning the value (a) to the zero-coded pixels and the value (b) to the one-coded pixels. Implementation of this computationally intensive algorithm using special-purpose hardware offers superior performance compared with general-purpose microprocessors.

The hardware solution

FPGA implementation of the BTC algorithm was divided into three main modules:

the input module, in which the input pixels are received;
the quantizer module, in which the pixels are classified as greater or smaller than average;
divider circuits to obtain the two quantized values.

We used the VHDL language for the design entry process. The different components required for the design were compiled into one library to be used by the main circuit flo w, and ModelSim was used for compilation and simulation. The placement and routing software used was the Xilinx Foundation Series, which generates different types of outputs. The VHDL output was simulated to ensure that the post-placement results were correct. Finally, another output, the bit map, was used to program the chip. The chip used was chosen from the Xilinx Virtex E family because of its storage capability and because it has one full adder per logic cell.

The design has been successfully fitted into a Virtex-E (XCV200e-pq240-8) chip. The minimum clock period of the design on this chip is equal to 25.132 nanoseconds (maximum frequency: 39.790 MHz).

The functions of the three modules implemented onto the FPGA chip were performed in three operational steps: (1) loading the 16 input pixels (for a gray image); (2) comparison, addition and shift operations to produce the bit planes; (3) division operations to obtain the quantization values.

In operation, overlapping can occur betw een Phase 3 of one block and Phase 1 of the next. Once the bit plane of a block is obtained at the end of Phase 2, the division operations can start in parallel with the loading of the new block after allowing a small time increment, delta, to make sure that the bit plane of the last block has become stable. The chip processes the input data block by block.

The rate of data processed by the circuit depends on both the clock frequency of the chip and the frequency of the data being entered. For example, when using a clock with a period of 26 ns and entering the data at a frequency of 66 MHz, the period between input pixels is 15.15 ns, and the input data is processed at a rate of 7.8 megapixels/second or 23.40 megabytes/s. This can be broken down into the following:

The time for setup and load, T1, is 1,440 ns; the time for the second phase, T2, is 550 ns; and the time for the last phase, T3, is 440 ns, for a total of 2,430 ns. Hence, if block n load starts at t0, block (n + 1) load will start at t0 + T1 +T2 + delta, which is t0 + 2,050 ns + delta. The overhead of t0 + delta takes place only for the first block. It will not be calculated in subsequent blocks, because it is included in the timing overlap. Therefore, the effective time for each block is 2,050 ns, which yields the above pixel rate.

To assess those results, we have run the same BTC algorithm on a Pentium III 550-MHz processor where a 372 x 372 image took 70 seconds to be encoded. This amounts to a speedup ratio of about 3,400 for the FPGA implementation at an input frequency of 66 MHz.

Sherif M. Saif is with the Electronic Research Institute (Cairo, Egypt); Hazem M. Abbas and Salwa M. Nassar are with Mentor Graphics (Cairo).

See related chart