IP Cores for accelerating JPEG2000
Louvain-la-Neuve Belgium
Abstract :
This paper presents BARCO SILEX’s IP solutions for accelerating picture compression and decompression with the recent JPEG2000 algorithm.
The algorithm is briefly explained and the structure of the IP’s is detailed. Finally, implementation and performance results are exposed for various FPGA and ASIC technologies.
INTRODUCTION
JPEG2000 [1] is the latest algorithm from the JPEG normalization group for still picture compression. It is based on wavelet technology and is very different from its predecessor. It features a large set of capabilities that will allow it to be adopted in a wide spectrum of applications, even extending to video encoding.
In return this compression scheme requires much more computational power than its classical JPEG predecessor, which make software implementations poor candidates for applications requiring very small encoding times.
To reach high performance applications, BARCO SILEX has developed two JPEG2000 accelerator IP cores: the BA112JPEG2000E encoder and the BA111JPEG2000D decoder. These are targeted at achieving all computationally intensive tasks of the JPEG2000 algorithm. Coupled to a host CPU, these cores allow building a complete JPEG2000 encoding or decoding solution.
This paper describes the structure of the BARCO SILEX JPEG2000 IP cores. The cores have been developed for sustained high speed processing and feature a state-of-the-art pipelined parallel architecture. Several entropy encoders are implemented in parallel in order to increase the overall pixel throughput.
The first section will give an overview of the JPEG2000 algorithm. The second section will detail the architecture of the IP cores. Finally the third section gives implementation and performance results on FPGA and ASIC technologies.
THE JPEG2000 ALGORITHM
JPEG2000 is based on an algorithm offering a wide range of tools for compressing and representing images. These are suitable for a large spectrum of applications such as: Internet streaming, medical imaging, digital cameras,…
This algorithm encompasses various capabilities:
- lossy and lossless compressions with excellent performance,
- precise compression ratio control with single-pass processing,
- bitstream progressivity allowing to get image previews with partial decoding of the bitstream,
- region-of-interest capability,
- error resilience.
For supporting this rich set of features, the JPEG2000 algorithm implements two consecutive processing stages that are explained in the following sections.
First Stage: Wavelet-Based Compression
For applying JPEG2000 compression, the image can be divided into rectangular tiles of any size that will each undergo 2-D wavelet transform [2] as illustrated in Figure 1. Wavelet transform is an iterative decorrelating operation that decomposes a tile into a series of smaller pictures (subbands). Each subband contains tile information limited to a given frequency range (including low-pass).
One level of wavelet decomposition allows building four subbands from the low-pass subband derived during the previous decomposition steps.
Figure 1: Illustration of wavelet decomposition of a square tile (“L” means result of low-pass filtering in a given direction – horizontal or vertical -; “H” means result of high-pass filtering in a given direction; combining both letters yields the four 2-D filtering combinations)
From Figure 1 subbands 1HL, 1LH, 1HH, 1LL are the result of the wavelet decomposition applied on the complete tile; subbands 2HL, 2LH, 2HH, 2LL are the result of the wavelet decomposition applied on subband 1LL; etc.
This process groups information from the same frequency range together, allowing selectively weighting the quantization of these data. Each subband can undergo separate quantization by a programmable factor for lossy compression. Bypassing the quantization yields lossless operation.
The resultant quantized subbands are further divided into smaller rectangular blocks (code blocks) which are separately entropy encoded. This process is achieved by a Modeler and an MQ-coder, which is an adaptive Arithmetic Encoder.
The Modeler examines all bit planes of the current code block, starting from the most significant non-zero bit plane. It scans the current bit plane in a zigzag order with three passes per plane. During each pass, it computes a context to the current bit. The context reflects the predominant value of the neighboring bits.
The adaptive Arithmetic Encoder finally encodes each scanned bit using a probability value derived from the associated context. The Arithmetic Encoder updates its probability tables after each bit encoding.
The Modeler also computes compression metrics reflecting the image distortion involved by reconstructing the code block only with its currently encoded portion. This information is used by the second stage as described below.
Second Stage (Tier-2): Packet Selection And Reordering
The codestream generated by the arithmetic encoder, together with the distortion metrics, allows the JPEG2000 post-processing stage to selectively build the final bitstream for a given compression ratio and progression order. This stage will organize the resultant packets to minimize the overall distortion while trying to attain the specified compression ratio. This allows a precise control of the generated compressed file size while maintaining a good image quality.
Moreover JPEG2000 standardizes various orders of packet inclusion in the bitstream. This allows many bitstream progressivities (e.g. by resolution or by quality) that enable fast preview of a picture with a first portion of the bitstream and further image refinements by decoding subsequent parts of the compressed file.
DESCRIPTION OF THE PROPOSED JPEG2000 IP CORES
Due to its powerful capabilities, JPEG2000 requires more computational resources than the classical JPEG for achieving similar encoding and decoding speeds. Hardware solutions are then required for high-speed applications.
BARCO SILEX has developed two IP cores able to accelerate the JPEG2000 encoding and decoding operations. These cores perform all computationally intensive tasks of the JPEG2000 algorithm by integrating the following operations: wavelet transform, quantization and arithemtic coding.
The cores are designed as accelerating coprocessors in a complete JPEG2000 encoding or decoding system. Indeed the Tier-2 part of the JPEG2000 algorithm is more suitably executed by a software routine running on a host processor.
Figure 2 shows a block diagram of the BA112JPEG2000E IP core designed by BARCO SILEX. This illustrates the main functional modules and a simplified view of the interfaces. Pixel data are input through the Pixel Interface and compressed streams are made available at the Compressed Interfaces together with distortion metrics. The core features a simple generic CPU Interface suited for interfacing it as a bus peripheral to various processors.
The next sub-sections describe the modules constituting the BA112JPEG2000E core as depicted in Figure 2.
Click to enlarge |
Figure 2: Block diagram of the BA112JPEG2000E IP core
Bidimensional DWT
The first module of the core is the wavelet transform engine. This module accepts tiles of pixels of any size up to 128 by 128 and performs bidimensional discrete wavelet decomposition on the incoming data with a programmable number of decomposition levels up to 5. The wavelet transform can be programmed to be lossy (i.e. 9/7 filter), lossless (i.e. 5/3 filter) or bypassed.
The DWT module accepts incoming pixels of any size up to 12 bits for lossless and 10 bits for lossy. It stores its results into the on-chip Tile Buffer ready for undergoing quantization and code block decomposition. The on-chip Tile Buffer allows reducing the IP pin count and improves the overall performance.
The wavelet structure is illustrated in Figure 3. It is based on a pipelined decomposition architecture where the incoming tile first undergoes filtering along the columns and then filtering along the lines. Each 1-D filtering delivers high-pass and low-pass results, which allow building the four HL, LH, HH and LL subbands. Each decomposition result is stored in the tile buffer and the low-pass subband (i.e. LL) is looped back for undergoing the next decomposition. This architecture lowers the hardware complexity while maintaining high throughput as summarized by the following table (throughput expressed in pixel per clock cycle for various number of decomposition levels):
Number of decompositions | 0 | 1 | 2 | 3 | 4 | 5 |
Overall throughput | 0.96 | 0.96 | 0.76 | 0.72 | 0.71 | 0.70 |
The 1-D decomposition filters are based on a state-of-the-art lifting scheme [3], reducing the gate count.
These engines also implement symmetric border extensions as specified by JPEG2000 in order to reduce border artifacts.
Click to enlarge |
Quantizer
The quantizer fetches the subbands available from the Tile Buffer and applies a programmable quantization step. A different quantization step can be programmed for each subband allowing to weight lower frequency subbands differently from higher frequency ones. The quantizer can be bypassed for lossless mode.
Tile Splitter
This unit further divides the quantized subbands into rectangular code blocks of programmable size up to 32 by 32, ready for entropy encoding.
The BA111JPEG2000D and BA112JPEG2000E cores feature a configurable number of entropy processors placed in parallel in order to sustain high coding throughputs. Indeed, the entropy coding part is the slowest processing of the JPEG2000 algorithm by being bit serial. This implies the necessity of implementing several entropy chains in parallel in order to balance the IP performance and to allow the entropy processing to reach the wavelet throughput.
The number of implemented chains is selected during the IP synthesis process. Each entropy chain processes a code block independently from neighboring chains.
The Tile Splitter module is responsible for arbitrating between the available chains, dispatching the various code blocks to be encoded. The Tile Splitter stores the code blocks into local Code Block Buffers.
The Code Block Buffers represent an additional pipeline stage in the IP at code block level. Moreover they uncouple the pixel-based processing part (i.e. DWT, Quantizer, Tile Splitter) and the bit-serial processing part (i.e. Modeler and Arithmetic Encoder).
Modeler and Arithmetic Encoder
Each Modeler fetches the code block available from its Code Block Buffer. It performs the first part of the entropy encoding by scanning the bit planes of the code block and providing bits to the Arithmetic Encoder, together with a context information. It also computes the distortion metrics that will be made available at the Compressed Interface for usage by the Tier-2 part of the JPEG2000 algorithm. These are based on MSE criteria.
The Arithmetic Encoder processes the bits and contexts and generates the stream available at the Compressed Interface.
Due to their bit-serial structure, the Modeler and the Arithmetic Encoder are the slowest processings of the JPEG2000 algorithm. Consequently both modules were deeply optimized. This resulted in a significant improvement of their throughput, allowing their cost-efficient integration in the JPEG2000 IP cores.
Host Interface Module
This module allows interfacing the core to a CPU: it contains configuration registers for the various modules and gives status information about the encoding progress.
The Host Interface module also features a separate Command and Control Interface that allows fast control of the IP core with minimal or no CPU intervention. Together with the simplicity of the CPU Interface, the Command and Control Interface offers the opportunity to build a system where the BA111JPEG2000D and BA112JPEG2000E cores are not directly connected to a host CPU and are driven by a small amount of logic. This increases the integration flexibility of the IP cores.
PERFORMANCE RESULTS
The IP cores feature two asynchronous clock domains for easier integration: the frequency of the pixel interface can differ from the frequency of the compressed data interface. A common clock (Clk1) drives the wavelet transform, Quantizer and Tile Splitter while another clock (Clk2) drives the Modelers and Arithmetic Encoders. This results in a better balance of the IP with respect to the overall performance. Indeed the pixel-based processings (i.e. DWT, Quantization and Tile Splitting) can be optimized separately from the bit-serial processings (i.e. Modelling and Arithmetic Encoding). Together with the configuration of the number of entropy channels, this allows maximizing the IP throughput.
Table 1 gives implementation results for the BA112JPEG2000E and BA111JPEG2000D IP cores. Two configurations are reported: a low-cost one-entropy channel version and a high-end eight-entropy channel version for both FPGA and ASIC technologies, able to target from VGA 640x480x60Hz (RGB 4:4:4) to HDTV 1280x720x50Hz (YUV 4:2:2).
Table 1: Performance Summary
Device | Logic | Frequency (MHz) | Needed Resources | Troughput (Msamples/s)2) | |
Clk1 | Clk2 | ||||
BA112JPEG2000E | |||||
8-channel configuration | |||||
UMC 0.18µm1) | 190 kgates | 143 | 107 | 418 kbits + 512 kbits (tile buffer) | 92.4 |
Altera EP1S30FC780C5 | 23397 LE | 116 | 79 | 2 MRAM, 124 M4K, 22 DSP Mult | 68.2 |
Xilinx XC2V3000-6 | 12781 Slices | 110 | 88 | 66 RAMB16, 9 MULT18x18 | 76.0 |
1-channel configuration | |||||
UMC 0.18µm1) | 88 kgates | 152 | 111 | 150 kbits + 512 kbits (tile buffer) | 12.0 |
Altera EP1S20FC484C5 | 9397 LE | 113 | 80 | 2 MRAM, 40 M4K, 15 DSP Mult | 8.6 |
Xilinx XC2V1000-6 | 6300 Slices | 110 | 88 | 45 RAMB16, 2 MULT18x18 | 9.5 |
BA111JPEG2000D | |||||
8-channel configuration | |||||
UMC 0.18µm1) | 153 kgates | 125 | 91 | 571 kbits + 512 kbits (tile buffer) | 82.8 |
Altera EP1S30FC780C5 | 23871 LE | 118 | 59 | 2 MRAM, 156 M4K, 13 DSP Mult | 54.3 |
Xilinx XC2V3000-6 | 13271 Slices | 110 | 73 | 74 RAMB16, 1 MULT18x18 | 67.2 |
1-channel configuration | |||||
UMC 0.18µm1) | 77 kgates | 143 | 91 | 170 kbits + 512 kbits (tile buffer) | 10.3 |
Altera EP1S20FC484C5 | 8599 LE | 113 | 60 | 2 MRAM, 44 M4K, 13 DSP Mult | 6.9 |
Xilinx XC2V1500-6 | 6300 Slices | 110 | 73 | 46 RAMB16, 1 MULT18x18 | 8.4 |
1) Worst case condition, foundry library (fsa0a_a), Tj=125°C, pre-layout timing
2) Results for typical lossy compression: 8-bit pixels, compression ratio of 4 after DWT+Q+arithmetic, and 5 non-zero least significant bit planes per code block before arithmetic encoding
CONCLUSION
BARCO SILEX introduces its BA112JPEG2000E and BA111JPEG2000D IP cores targeted at high-speed JPEG2000 encoding and decoding. These cores give access to the large capabilities of the JPEG2000 Standard.
Indeed this Standard defines an algorithm able to offer a wide spectrum of features such as progressive bitstream, precise rate control, region of interest, high-quality lossless and lossy compressions. This rich set of advantages leads JPEG2000 to be an important actor in the compression world.
However due to its computational complexity, hardware platforms are required to reach high-speed applications characterized by timings compatible with real-time video encoding. BARCO SILEX takes up this challenge with its BA111JPEG2000D and BA112JPEG2000E IP cores. These are highly optimized solutions for ASIC and FPGA technologies, acting as efficient accelerators in a complete JPEG2000 compression system.
For more information about BARCO SILEX IP cores visit www.barco-silex.com.
REFERENCES
[1] ISO/IEC 15444-1 Information Technology – JPEG 2000 image coding system – Part 1 : Core coding system
[2] S. Mallat, A Theory for Multiresolution Signal Decomposition: The Wavelet Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.11, no. 7, pp 761-765, July 1989.
[3] W. Sweldens, Wavelets and the Lifting Scheme: a 5-Minute Tour, Z. Angew. Math. Mech., vol. 76, no. 2, pp 41-44, 1996.
|
Related Articles
- Accelerating SoC Evolution With NoC Innovations Using NoC Tiling for AI and Machine Learning
- Accelerating RISC-V development with network-on-chip IP
- Lossless Compression Efficiency of JPEG-LS, PNG, QOI and JPEG2000: A Comparative Study
- A guide to accelerating applications with just-right RISC-V custom instructions
- Accelerating 5G virtual RAN deployment
New Articles
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
- Synthesis Methodology & Netlist Qualification
- Streamlining SoC Design with IDS-Integrate™
E-mail This Article | Printer-Friendly Page |