≡ Menu Design And Reuse

Design And Reuse

Revving video encoding on C64x/DM64x DSPs

-


EE Times: Revving video encoding on C64x/DM64x DSPs
Cheng Peng (06/20/2005 9:00 AM EDT) URL: http://www.eetimes.com/showArticle.jhtml?articleID=164900190

How does one improve video encoding performance on the Texas Instruments TMS320C64x/TMS320DM64x digital signal processor generation? The conventional implementation of a video encoder (MPEG-2, MPEG-4, H.263) is based on macroblock-level processing. The video encoder fetches a new macroblock (MB) only after the current MB goes through all the processing steps. But this intuitive approach comes with two drawbacks: - The overall code size of a video encoder is usually bigger than the Level 1 program cache (L1P) on a C64x/DM64x DSP. The code needs to be swapped between L1P and the Level 2 program cache (L2P) during every MB fetching period, causing a significant cache-miss penalty. - It is not efficient for the enhanced DMA (EDMA) controller to transfer a small chunk of data such as a single MB from an external video frame memory to internal memory. To avoid the huge cache-miss penalty and CPU stalling, the algorithm can be broken into three loops, each of them a separate module that fits into L1P. Instead of processing a single MB at a time in each loop, the module processes n macroblocks-an MB strip. The size of a strip is restricted only by the size of the available Level 1 data cache (L1D). The bigger n is, the better EDMA performance we can expect for data throughput. The three loops are: - the MB encoding loop, - the motion estimation loop, and - the MB reconstruction loop. As emphasized above, n MBs are fetched and go through one of the three processing loops together. For example, in the MB encoding loop, when n MBs are fetched into internal memory, they are put through a discrete cosine transform (DCT), quantized and entropy-coded. This set of macroblocks is not flushed out of L1D until the MB encoding loop has been completed. Corresponding programs, including DCT, quantization and variable-length coding kernels, are also kept in L1P until all n MBs are processed completely in this loop. A ping-pong memory buffering scheme driven by the EDMA helps reduce the initial setup time needed to perform these loops for a strip of MBs. Ping-pong buffering also ensures minimal CPU stalling cycles because the transfers are overlapped with processing. Cheng Peng (c-peng2@ti.com), DSP video application engineer for Texas Instruments Inc. (Dallas)
- -

Search Silicon IP

16,000 IP Cores from 450 Vendors

Related Articles

New Articles

See New Articles >>

Most Popular

See the Top 20 >>

© 2024 Design And Reuse

All Rights Reserved.

No portion of this site may be copied, retransmitted, reposted, duplicated or otherwise used without the express written permission of Design And Reuse.

Partner with us

Partner with us

List your Products

Suppliers, list your IPs for free.

List your Products

Design-Reuse.com