The Power and Bandwidth Advantage of an H.264 IP Core with 8-16:1 Compressed Reference Frame Store
Building a new class of H.264 devices without external DRAM
Power is an increasingly important consideration for the majority of system designers. This is particularly true in the case of small handheld consumer devices such as cameras, camcorders and mobile phones. In such devices, video compression technology is used that relies on power hungry DRAMs to store the reference frames during the encoding and decoding process.
Reference Frame Compression – A Brief Overview
Figure 1 shows the dataflow for a typical video encoder algorithm. Without loss of generality, the discussion is applicable to many other video compression algorithms as they are based on very similar concepts. The letters T and Q represent respectively, the transform and the quantization stages.
Regardless of the particular video compression algorithm, after it has been encoded, a reference frame is reconstructed and stored, generally given its size, in external memory.
Figure 1: Video Compressor Dataflow.
An obvious solution to the problem presented by the reference frame storage size and bandwidth is to use a lossy image compression/decompression stage between the memory used to store the reference frame, and the compression engine as shown in Figure 2.
Figure 2: Video Compressor with Lossily Compressed Framestore.
Indeed, a variety of such schemes have been proposed with various trade-offs. For example reference frame lossy compression schemes have been proposed that are well suited to the macroblock of pixels used by most video compression algorithms. Such algorithms often have other desirable properties such as random access to the pixel data as well as a fixed compression rate.
However, the main disadvantage is clear when one considers that when the lossy compression stage between the encoder and the reference frame storage is present, the extra distortion it introduces is not exactly matched by the decoder.
This is because a general third party decoder will not have a matching lossy compression scheme between itself and the external storage.
This results in a divergence or drift between the encoder and decoder reference frames that is cumulative with each frame predicted from the previous one, making things progressively worse.
Even a single lossy compression and decompression pass can introduce unpleasant noise at the decoder output. In general, the higher the compression of the reference frame storage, the higher the drift or error.
There are proposals to include reference compression schemes but they require a change in the existing video compression standards.
A possible solution to this problem is to substitute the lossy pixel compression and decompression modules with lossless ones. While this can certainly work, lossless compression at pixel level has quite low compression ratios, generally less than 2:1. It is also quite unpredictable and such a small and unpredictable reduction of storage size is not likely to result in the elimination of the external DRAM memory.
So, it would appear that the choice is between highly compressed reference frame storage with a large error/drift, or poor, unpredictable compression with no error/drift.
New Compressed Frame Store Technology
Ocean Logic has introduced an H.264 Baseline encoder and limited decoder based on a new Compressed Frame Store (CFS) technology whose main features are :
- High compression 8-16:1, depending on the desired quality
- The bitstream is fully compatible with existing decoders with no error/drift.
- The technology is not restricted to the H.264 standard and it could be potentially applied to other video compression algorithms
This technology makes a highly compressed reference frame store practical that can result in the integration of the CFS in a SoC and the elimination of the external DRAM.
Benefit Analysis
A CFS technology that allows both high compression and compatibility with existing decoders brings a variety of benefits.
For example, in the case of H.264, a compressed reference frame with an average size of ~250 bits per macroblock will give fairly good visual quality. The same macroblock, uncompressed, as 4:2:0 video is 384 bytes.
This is a compression ratio of ~12:1 with no drift between encoder and decoder. Also consider that often, two full video frames are stored : one as the current and the other as the future reference. In this case a compression of over 24:1 is achieved.
The table below details the approximate uncompressed reference frame storage requirements for various video resolutions, compared to that required for a compressed reference frame at ~250 bits per macroblock.
Resolution | VGA 640x480 | D1 720x480 | 720p 1280x720 | 1080p 1920x1080 |
Uncompressed size | 450 | 506.25 | 1350 | 3060 |
Compressed size | ~38.4 | ~43.2 | ~115.5 | ~261.2 |
Table 1 Compressed reference frame size in Kbytes (1024 bytes).
Higher compression ratios are possible with lower but still perfectly acceptable quality in the output bitstream. The size of a compressed reference frame can be maintained at approximately a constant using bitrate control mechanism. The high compression in the reference frame means that it is now possible to integrate the required memory directly on the same chip as the encoder core. Even for full HD (1080p), being able to reduce the frame reference storage from over 3 Mbytes (6 Mbytes in the case of two frames) to around 2 Mbits allows one to eliminate an external memory chip (and its associated costs) and integrate the storage on the SoC.
The elimination of an external memory chip (normally fast DDR/DDR2/DDR3 DRAM) comes with clear benefits. Apart from the obvious elimination of the cost of the DRAM chip itself and the associated circuit board area, the large power reduction is an additional bonus. The power savings come from the absence of the power hungry, high speed interface to the DRAM as well as the extra power generated by the unused portions of a DRAM chip. In fact, even an uncompressed frame store for 1080p does not use more than ~50 Mbits (out of a 256 Mbit DRAM chip).
An additional less obvious benefit comes from the reduced memory bandwidth. While encoding a 1080p@30 video stream, an Ocean Logic H.264 core is estimated to require an average peak bandwidth of ~750 Mbytes/s vs. only about 50 Mbytes/s when using a compressed frame store in the situation described above. Such low bandwidth means that an internal memory can be easily shared with other processes.
Possible Applications
A low power video compression device with a small memory footprint is extremely useful, especially in consumer hand held devices but also in challenging environments.
This section briefly outlines a series of potential applications of the technology in video surveillance, remote sensing, consumer camera, automotive, and when encoding a large number of simultaneous video channels.
The first example outlined here is of a single device that compresses the video input from a CMOS sensor and makes it available through a variety of standard interfaces such as Ethernet and USB or storage to SD Card.
Figure 3: H.264 encoder using compressed reference frame store on chip.
The camera processor processes the incoming raw Bayer data by performing Bayer interpolation, white balance, RGB to YUV conversion, normalization, gamma correction and sharpening. This produces a 4:2:0 YUV video stream suitable for the H.264 encoder using the compressed reference frame store. Optional elements can include encryption and a microcontroller sharing the compressed frame store memory with the H.264 core. Note the absence of external DRAM. Ocean Logic can provide the camera processor as well as the encryption module (AES, DES).
Applications for such a device include :
- Video surveillance. Here the bitstream can sent encrypted, through Ethernet or WiFi.
- Low cost HD H.264 video camera/webcam with no DRAM.
- Medical. Very low power, very small and non-invasive imaging devices such as endoscopic pills that require no DRAM and transmit video wirelessly
- Automotive. Very low power device that compresses video sources and delivers it throughout the car through a serial link.
The availability of a CFS H.264 decoder, even if limited to decoding the bitstream of the existing encoder, greatly extends the array of possible applications to include all those where an encoder/decoder pair, in a closed system, is advantageous.
A limited H.264 decoder that uses CFS technology also complements perfectly the applications listed above. For example, in the case of automotive applications, the compressed video sources originating from various devices (such as rear vision or infrared cameras) can be transported through a network throughout the car to a small, low power decoding device (with no DRAM) to be played to the driver. In video surveillance, a limited decoder with no DRAM can be part of a DVR system where video streams are captured and played by the user.
The figure below shows an example where the encoder/decoder pair is used for a low power HD video recording and playing device.
Figure 4: H.264 Encoder/Decoder pair in a camcorder system.
Normally, the user of a consumer camcorder would require recording of the video and then playing it back to see the result. This means that the encoder and decoder are never required to function simultaneously. This also means that both encoder and decoder can share the compressed frame storage memory.
Because the user of such a camcorder only needs to play movies that have been encoded by their camera, the aforementioned limited H.264 decoder is perfectly adequate. Its smaller size compared to a fully compliant decoder actually saves gates and power in the design.
Applications of such a ”camcorder SoC” would include :
- 720p/1080p camcorder function for mobile phones
- Consumer camcorder
- Consumer still camera with HD movie capabilities
All these applications would greatly benefit from the absence of the DRAM and the much lower power consumption. Unlike still picture capture, power requirements for video capture is sustained and using CFS technology would not only mean longer lasting batteries, but also smaller and ultimately cheaper ones.
So far this article has focused on integrating the small CFS memory on chip. Another possibility is to use the CFS technology to allow many H.264 encoder (or decoders) cores to share the same DRAM chip through a small bus.
A possible example would be 16 H.264 encoder cores sharing a single DDR/DDR2/DDR3 DRAM chip. Clocked at ~250 MHz, each core will be able to comfortably encode 1080p@30 video. However, since the required bandwidth with the compressed frame store for each core is only ~50 Mbyte/s, they could all theoretically share a single DDR/DDR2/DDR3 DRAM chip through a 16 bit data bus. In practice, even if a 32 bit data bus turns out to be required, but the result is still impressive, with a single chip containing 16 H.264 encoders capable of encoding 16 1080p@30 video channels simultaneously.
As we have already demonstrated for our existing H.264 IP core, a multi-channel encoder with CFS should also be possible. Each such core running at 250 Mhz would be capable of encoding 6 D1@30 (720x480 @ 30 fps) channels. This would mean a total capability of up to 96 D1@30 simultaneously encoded video channels. This of course does not take into account the challenge of transporting and/or buffering the 16 1080p@30 or the 96 D1@30 video streams to the device, but such problems are solvable, depending on the nature of the uncompressed video input.
Conclusions
The power and bandwidth advantages of the CFS H.264 IP cores that allow perfect reconstruction without accumulated error or drift with third party decoders have been analyzed, and possible applications described.
This means that a new class of consumer products which require video compression are now possible with a much smaller physical and power footprint. The author wishes to thank Jonah Probell for reviewing this article.
|
Ocean Logic Pty Ltd Hot IP
Related Articles
- Encoding H.264 without External DRAM : Power and Quality Comparison
- Main profile H.264 codec: A low power implementation for consumer applications
- High Definition, Low Bandwidth -- Implementing a high-definition H.264 codec solution with a single Xilinx FPGA
- From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMs
- eFPGAs Bring a 10X Advantage in Power and Cost
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |