Revving video encoding on C64x/DM64x DSPs
EE Times: Revving video encoding on C64x/DM64x DSPs | |
Cheng Peng (06/20/2005 9:00 AM EDT) URL: http://www.eetimes.com/showArticle.jhtml?articleID=164900190 | |
How does one improve video encoding performance on the Texas Instruments TMS320C64x/TMS320DM64x digital signal processor generation? The conventional implementation of a video encoder (MPEG-2, MPEG-4, H.263) is based on macroblock-level processing. The video encoder fetches a new macroblock (MB) only after the current MB goes through all the processing steps. But this intuitive approach comes with two drawbacks: - The overall code size of a video encoder is usually bigger than the Level 1 program cache (L1P) on a C64x/DM64x DSP. The code needs to be swapped between L1P and the Level 2 program cache (L2P) during every MB fetching period, causing a significant cache-miss penalty. - It is not efficient for the enhanced DMA (EDMA) controller to transfer a small chunk of data such as a single MB from an external video frame memory to internal memory. To avoid the huge cache-miss penalty and CPU stalling, the algorithm can be broken into three loops, each of them a separate module that fits into L1P. Instead of processing a single MB at a time in each loop, the module processes n macroblocks-an MB strip. The size of a strip is restricted only by the size of the available Level 1 data cache (L1D). The bigger n is, the better EDMA performance we can expect for data throughput. The three loops are: - the MB encoding loop, - the motion estimation loop, and - the MB reconstruction loop. As emphasized above, n MBs are fetched and go through one of the three processing loops together. For example, in the MB encoding loop, when n MBs are fetched into internal memory, they are put through a discrete cosine transform (DCT), quantized and entropy-coded. This set of macroblocks is not flushed out of L1D until the MB encoding loop has been completed. Corresponding programs, including DCT, quantization and variable-length coding kernels, are also kept in L1P until all n MBs are processed completely in this loop. A ping-pong memory buffering scheme driven by the EDMA helps reduce the initial setup time needed to perform these loops for a strip of MBs. Ping-pong buffering also ensures minimal CPU stalling cycles because the transfers are overlapped with processing. Cheng Peng (c-peng2@ti.com), DSP video application engineer for Texas Instruments Inc. (Dallas)
| |
- - | |
Related Articles
- Video encoding with low-cost FPGAs for multi-channel H.264 surveillance
- H.264 "zero" latency video encoding and decoding for time-critical applications
- Achieving Optimized DSP Encoding for Video Applications
- Meeting the Challenge of Real-Time Video Encoding: Migrating From H.263 to H.264
- Emerging H.264 standard supports broadcast video encoding
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |