How does one improve video encoding performance on the Texas Instruments TMS320C64x/TMS320DM64x digital signal processor generation? The conventional implementation of a video encoder (MPEG-2, MPEG-4, H.263) is based on macroblock-level processing. The video encoder fetches a new macroblock (MB) only after the current MB goes through all the processing steps. But this intuitive approach comes with two drawbacks: - The overall code size of a video encoder is usually bigger than the Level 1 program cache (L1P) on a C64x/DM64x DSP. The code needs to be swapped between L1P and the Level 2 program cache (L2P) during every MB fetching period, causing a significant cache-miss penalty. - It is not efficient for the enhanced DMA (EDMA) controller to transfer a small chunk of data such as a single MB from an external video frame memory to internal memory. To avoid the huge cache-miss penalty and CPU stalling, the algorithm can be broken into three loops, each of them a separate module that fits into L1P. Instead of processing a single MB at a time in each loop, the module processes n macroblocks-an MB strip. The size of a strip is restricted only by the size of the available Level 1 data cache (L1D). The bigger n is, the better EDMA performance we can expect for data throughput. The three loops are: - the MB encoding loop, - the motion estimation loop, and - the MB reconstruction loop. As emphasized above, n MBs are fetched and go through one of the three processing loops together. For example, in the MB encoding loop, when n MBs are fetched into internal memory, they are put through a discrete cosine transform (DCT), quantized and entropy-coded. This set of macroblocks is not flushed out of L1D until the MB encoding loop has been completed. Corresponding programs, including DCT, quantization and variable-length coding kernels, are also kept in L1P until all n MBs are processed completely in this loop. A ping-pong memory buffering scheme driven by the EDMA helps reduce the initial setup time needed to perform these loops for a strip of MBs. Ping-pong buffering also ensures minimal CPU stalling cycles because the transfers are overlapped with processing. Cheng Peng (c-peng2@ti.com), DSP video application engineer for Texas Instruments Inc. (Dallas) |