Encoding H.264 without External DRAM : Power and Quality Comparison
By Vincenzo Liguori and Kevin Wong - Ocean Logic Pty Ltd
This article compares the power consumption and quality of the generated bitstream between two Ocean Logic H.264 encoder cores : OL_H264E that uses external DRAM to store the reference frame store and OL_H264E_CFS that uses a Compressed Frame Store (CFS) technology that does not need external DRAM.
Power Estimation
The power consumption comparison is made when both cores encode 1080p video at 30 fps in real time. The clock frequency is approximately 250 MHz for both cores. The power estimation tool we used was InCyte1, a freely downloadable tool. The free version of this tool has a variety of limitations. However, it is still useful for a first order evaluation and, more importantly, for a relative comparison.
Both OL_H264E and OL_H264ECFS designs are fully synchronous with no transparent latches or gated clocks. Only the rising edge of the clock is used. For OL_H264E there is the option of running the external memory interface at a different frequency from the main core. Both cores have a very small latency, ~16 video lines and accurate CBR.
OL_H264ECFS is actually HRD compliant. Neither core requires any CPU assistance during encoding.
OL_H264E is a well established core, available since late 2005 and fully proven in both ASIC and FPGA. Due to its small size and the fact that it only requires a single DRAM chip with a 16 bit databus, it is already a very low power core. Its small footprint and low power make it ideal for embedded applications and hand held devices.
The table below shows the resources required by the two cores. Please note that one gate is equivalent to a two input NAND gate. The internal memories are a mixture of single and dual port memories. OL_H264E can support DDR, DDR2 and DDR3 memories as well as old SDRAM and SRAM (through a 32 bit bus).
Core Name | Gate Count | Internal Memories | Reference Frame Memory |
OL_H264E | 195 K gates | 133 Kbits | Single DRAM chip, min 64 Mbits, 16 bits DDR data bus |
OL_H264ECFS | 280 K gates | 217 Kbits | 2-2.5 Mbits Internal SRAM for good quality1080p frame |
Table 1: OL_H264E and OL_H264ECFS resource requirements.
OL_H264ECFS is a newer core, released in 2011 and it is fully proven in FPGA. It is based on a new Compressed Frame Store (CFS) technology whose main advantages are :
- High compression 8-16:1, depending on the desired quality
- The bitstream is fully compatible with existing decoders with no error/drift. This means that the reconstructed reference frame is bit to bit identical in the encoder and a third party, standard and unmodified decoder
- The technology is not restricted to the H.264 standard and it could be potentially applied to other video compression algorithms
The technology is patent pending. A favourable report has been received by the Korean Search Authority and this has allowed us to petition the US patent office to make the application special.
Unlike for example, the Texas Instruments proposal (1) for H.265 that compresses the reference frame store only 2:1, our technology allows us to the compress reference frame 8-16:1 so that it can fit in an internal memory in the SoC. It also does not require modifications to existing standards and decoders.
For the comparison in InCyte, a 65 nm process was chosen. This Is a generic process in the free version of the tool. One of the limitations was the lack of register files to map small memories to. Instead these were mapped to larger single or dual port memories. Internal memories seemed to contribute very little to the power consumed by the cores, at least compared to the logic. For OL_H264ECFS, a large 2 Mbit memory was also included for the reference frame store. Since the data in this memory is heavily compressed, it is rarely accessed. The very small level of activity of this memory was reflected in the InCyte settings.
Of all the I/Os, only the power used by the DRAM interface was estimated. This is for three reasons. The first is that the processor, compressed data output, and video input interfaces are normally embedded in a design and do not contribute to I/O power directly. The second is that these three interfaces are virtually identical in both cores with very similar consumption that balance each other out in a comparison. Finally, the power I/O contribution of the processor and compressed data output interfaces would be minimal because the former is only used to set up the core at the start and the latter outputs data very infrequently. These choices explain the 0 mW for the OL_H264ECFS I/Os.
The I/O power for OL_H264E was estimated assuming a 4 pf load for the data bus and 0.25 pf load for the address and control lines. This information comes from Micron Technology's 256 Mbits DDR2 datasheet3. The power consumed by the memory chip itself was estimated with a spreadsheet from Micron Technology (3) as well using realistic data access patterns.
The table below summarises the power consumption of both cores :
Core Name | Core | I/O | DRAM | Total |
OL_H264E | 120.57 mW | 550.21 mW | 347 mW | 1017.78 mW |
OL_H264ECFS | 171.04 mW | 0 mW | 0 mW | 171.04 mW |
Table 2: OL_H264E + DRAM chip and OL_H264ECFS Power Estimation.
The Core power consumption contains an estimation for all the internal memories as well. For OL_H264ECFS it also includes 2 Mbits of compressed reference frame store RAM.
Both cores have very low power consumption. However, the lack of the external DRAM chip allows a power reduction of ~5 times in the case OL_H264ECFS. This is believed to be one of the lowest existing for full HD encoding in real time and it opens many new possibilities in the field of hand held and battery powered video encoding.
Quality comparison
This section presents a simple quality comparison between OL_H264E and OL_H264ECFS. Two video sequences YUV 4:2:0 are used : “Jets”, a 720p, 100 frame long, rather static sequence and “Pedestrian Area”, a 1080p, 375 frame long sequence that contains substantially more motion with a number of people crossing the screen continuously from different directions.
The result from these two sequences is a reasonable representative of larger tests that have been performed.
Illustration 2: First frame of the sequence "Pedestrian Area".
Both sequences were encoded using the bit accurate C models available for each core with 9 P frames between I frames. The sequence “Jets” was encoded at 5, 6, 7, 8 and 9 Mbps whereas “Pedestrian Area” was encoded at 10, 12, 14, 16 and 18 Mbps. Decoding was done with the reference JM software decoder that also calculated the average PSNR.
The tables below show resulting average PSNR as a function of the bitrate for the Y, U and V components. The average Y PSNR is also plotted as function of the bitrate.
From these tables it is easy to note that both cores follow the target bitrate very accurately. This can have a negative effect on the PSNR that can be often improved if accurate bitrate control is not a priority.
Table 3: Average PSNR for "Jets" encoded with OL_H264E and OL_H264ECFS.
Illustration 3: Y Average PSNR for "Jets" encoded with OL_H264E and OL_H264ECFS.
Table 4: Average PSNR for "Pedestrian Area" encoded with OL_H264E and OL_H264ECFS.
Illustration 4: Y Average PSNR for "Pedestrian Area" encoded with OL_H264E and OL_H264ECFS.
The first thing that can be noticed is that, for low motion scenes, the OL_H264ECFS performance is quite close to OL_H264E. For higher motion scenes, a penalty of 20-30% on the bitrate can be expected, depending on the particular sequence. In general, at higher bitrates, the difference also tends to diminish.
Conclusion
The extreme low power consumption of OL_H264ECFS allows a great reduction in weight and cost in any handheld or portable device due to a much smaller battery. Even considering the larger file size for a given quality, the cost of a larger flash memory is small compared to the savings of having a smaller battery.
The absence of a DRAM chip is a potential reduction in cost (also for the simplification in of the PCB layout) but it has to be balanced with a higher cost for a larger chip that contains the internal CFS memory. However, the lack of the external DRAM chip will also result in higher reliability of the design that no longer needs physical connections to the external memory.
Lower power, more compact, and more reliable design are an ideal combination for mass produced electronic devices.
References
1. See page 14 of the presentation on http://wftp3.itu.int/av-arch/jctvcsite/2010_04_A_Dresden/JCTVC-A101.pdf
2. Micron Technology DDR2 256 Mbit DRAM : http://download.micron.com/pdf/datasheets/dram/ddr2/256MbDDR2.pdf
3. Micron Technology DDR2 DRAM power estimation spreadsheet can be lownloaded from : http://www.micron.com/support/dram/power-calc
|
Ocean Logic Pty Ltd Hot IP
Related Articles
- The Power and Bandwidth Advantage of an H.264 IP Core with 8-16:1 Compressed Reference Frame Store
- Video encoding with low-cost FPGAs for multi-channel H.264 surveillance
- Multi-chip architectures partition H.264 tasks to achieve high-quality video
- H.264 "zero" latency video encoding and decoding for time-critical applications
- The basics of HD H.264 and next-generation encoding
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |