VLIW core teams with dedicated hardware to crunch audio-video streams
VLIW core teams with dedicated hardware to crunch audio-video streams
By Martin Bolton, Senior Systems Architect, STMicroelectronics, DVD Division, Bristol, U.K., EE Times
June 17, 2002 (11:17 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020614S0112
A carefully planned balance between functions done in hardware and those executed in software is the trick to achieving a flexible and cost-effective audio-video codec cell well-suited for consumer applications. Targeting system-on-chip projects, now a new cell architecture encodes and decodes MPEG-2, MPEG-4, and related algorithms for consumer products such as DVD and personal video recorders. The new architecture balances flexibility and cost (silicon area) by assigning processes that vary across different standards to a VLIW processor while performing fixed operations in a hardware accelerator. Elements include a VLIW processor built for multimedia applications, a hardware-based inner loop accelerator (ILA), and an innovative motion estimation algorithm that yields high-quality pictures without draining the available computational resources. Additional elements include a video preprocessor, motion estimation engine, system control pro cessor, and memory. Although a purely hardware approach to building the codec promises the smallest chip area, and therefore the lowest component cost, it also limits flexibility, restricting the chip's utility to a narrow range of standards. Moreover, starting a chip-level hardware design from scratch tends to incur long design and verification cycles. While the opposite extreme, pursuing a purely software strategy to realize the codec functions, yields the greatest flexibility for handling a broad spectrum of standards, it would also require a costly high-performance processor ill-suited for the price-sensitive consumer market. Instead, a hybrid of hardware and software yields a design that makes efficient use of silicon area without overly compromising the flexibility needed to execute a wide range of compression standards. Specifically, this design philosophy adds a hardware accelerator, in this case the ILA, to the VLIW core processor. The approach frees the core processor from numerous special instructions and operators that are now assigned to the accelerator. In this way, the codec cell serves as a platform that can be changed to accommodate different standards without having to completely redesign the system. For this example, the VLIW core processor, an ST220, belongs to a scaleable, customizable architectural family of embedded cores designated Lx. The ST220 is based on technology jointly developed by Hewlett-Packard Laboratories and STMicroelectronics. The Lx architecture is scaleable in its ability to change the number of resources it contains, as well as the mix of operations it can execute at once. Also, the architecture can be tailored to a specific application area through the addition of specific instructions, hence the customizable aspect. In the case of the ST220, that area is multimedia applications. Operating at a clock frequency of up to 400 MHz, the ST220 is a four-issue RISC processor, meaning that four separate 32-bit instructions, called a bundle, can be active ("issued") in one clock cycle. The processor contains 64 general-purpose 32-bit registers and eight branch registers, and the compiler speeds execution by unrolling software loops and precomputing branch instructions. The processor also supports predication and speculative loading of data, adding to its overall efficiency. To maximize the flow of I/O data, the processor employs two independent I/O mechanisms: a conventional data cache and a unique streaming data interface (SDI), which carry data to and from system memory and peripherals, respectively. Specifically, the SDI performs synchronized exchanges of I/O data directly between the processor's load-and-store unit and the FIFO (first-in, first-out) register of a peripheral.
Data exchanges of this type bypass both the processor's cache and the on-chip bus ports, resulting in a very low-latency path that is well-suited for exchanging data wi th coprocessors or streaming-data I/O devices. In contrast, architectures that require data that flows between operators to systematically pass through main memory although achieving a simpler design by decoupling processing elements and avoiding interblock buffers can also suffer large bandwidth penalties. These limit both their performance and their range of features.
Other significant performance elements of the processor include a memory subsystem equipped with an explicit data-prefetch cache, four combining write buffers, and cache-replacement policies to match the usage patterns for accessing MPEG video data.
During the encoding process, the flow of data among the codec functional blocks consists mainly of predictor blocks going to the motion estimation engine, prediction errors to the discrete cosine transform (DCT) stage in the ILA, DCT-transformed prediction errors to the processor's SDI. Data flow also includes inversely quantized coefficients from the processor's S DI to the ILA's inverse DCT (IDCT) stage and reconstructed pictures to system memory.
Of the function blocks themselves, the motion estimation engine is a special-purpose processor that fetches and interpolates sets of predictors from reference pictures stored in the system memory. It makes iterative updates to motion vectors, and selects the best prediction mode and motion vector for every macroblock a 16-x-16-pixel subdivision of a picture and the basic unit on which motion estimation is performed for MPEG encoding. Within the motion estimation engine, a predictor cache minimizes the required data-transfer bandwidth from system memory when performing the motion estimation algorithm.
Another functional block, the video preprocessor, accepts the video input and converts it to the correct picture format. It also performs ancillary algorithms that detect scene changes and telecine formats, determines whether pictures have progressive or interlaced frames, and reduces noise. The cell's system control requirements could be met by a RISC microprocessor core external to the codec cell, or by the VLIW processor itself.
The cell's video encoding loop comprises the VLIW processor and the ILA. Of these, the processor performs quantization, rate control, and Huffman variable-length coding (HLVC). Importantly, it's these functions that largely differ among video compression standards. Thus, by selecting these particular functions for execution in software, the design makes it possible for the codec to accommodate different video standards.
For its part, the ILA serves as a hardware accelerator for performing DCT and IDCT operations. Moreover, it issues elementary video streams by packing the code-size pairs generated by the HLVC operation. Unlike the processor, the ILA requires no flexibility and is able to offload these tasks from the VLIW processor. The ILA occupies less than 5 percent of the total codec cell area while relieving the processor of a task that, for MPEG-2 encoding , would require 120 MHz of processing capacity. In doing this, the ILA gives designers a choice of lowering the processor's maximum clock frequency or executing additional software tasks.
Of the remaining processor tasks, direct and inverse quantization adds about 125 MHz of loading, while HVLC and rate control tasks take 30 MHz and 3 MHz, respectively. In addition, the main program, which manages encoding and intercommunication with the ILA and motion estimation engine, requires about 37 MHz. In all, therefore, the MPEG-2 encoding loop requires about 200 MHz from the processor's 400-MHz clock frequency. These figures were derived by simulating one second of MPEG-2 encoding at a target bit rate of 15 Mbits/sec.
Given the remaining 200-MHz of processing capacity, it is possible to add audio encoding on top of the codec's video tasks. Indeed, experiments have shown that a negligible overhead is incurred for context switching and audio and video multithreading. Toward this end, a simple, royalty -free operating system, along with task-aware debugging tools, have been created to simplify resource sharing and application development. And, while specifically conceived for audio and video encoding, this same architecture is suitable for performing efficient decoding as well as encoding. For encoding video, the processor is harnessed to perform bit stream parsing, inverse quantization, and variable length decoding. It also performs all audio decoding. As for the ILA, it performs IDCT and reconstruction operations, while the motion estimation engine does motion compensation.
The MPEG-2 video standard is based on a video compression method that exploits the high degree of spatial and temporal correlation that occurs in natural video sequences. Specifically, during encoding, a coding loop removes temporal redundancy using interframe motion estimation. Residual error images are further processed using a discrete cosine transform, a process that reduces spatial redundancy by decorrelating the pixels w ithin a block and concentrating the energy of the block itself into a few low-order coefficients. Finally, scalar quantization and variable length coding produce a bit stream having a good statistical compression efficiency.
To remove temporal redundancy, the MPEG-2 standard relies on motion estimation, a process that computes the similarities among successive pictures and, therefore, transmits only the differences between them. Motion estimation is arguably the single most computationally intensive task of the MPEG-2 encoding process, with implementation costs typically measured in terms of internal memory size, required bandwidth to external memory, and millions of sums of absolute differences per second. For portable applications, power consumption is another important factor.
Consequently, an innovative approach to motion estimation that yields high quality without overly taxing computational resources was developed for the audio-video codec cell. The algorithm is based on a two-step rec ursive-predictive method.
In the first step, information from other macroblock estimations is used to predict where the motion of the 16-by-16-pixel macroblock under estimation is. The underlying idea is that motion in natural pictures varies slowly, passing from one macroblock to its neighbors. In other words, the motion vectors of neighboring macroblocks are closely correlated. An intuitive proof of this can be observed in the fact that normally objects in a picture span more then one macroblock, parts of the same object move with the same motion, and the motion direction and speed vary slowly from one picture to the next.
This "inheritance" of information to successive macroblocks is done by applying the selected motion vectors (the "winners") of previous macroblocks to the current macroblock under estimation. The winners can come from macroblocks of the same frame or from those previously estimated frames. The candidates to choose from are dynamically selected according to the type of ma croblock and the motion content.
The second step is an update, which is applied to allow evolution of the motion vector field by means of a small fixed number of new positions. As before, these positions are adaptively determined by the motion contents.
Thus, the predictive method, unlike techniques that start from scratch, reuses the estimation of other blocks as the starting point for the current macroblock conserves computational resources and memory bandwidth. Moreover, because it reapplies the motion vectors of previous macroblocks, it favors alignment of the motion vector field; that is, motion vectors change slowly when moving from one macroblock to a neighbor. The result is better tracking of actual motion compared to a "full search" algorithm, a computationally complex approach that tests all possible candidates.
This article will be presented at ICCE in a paper titled "Hardware-Software Balance Yields Flexible Audio-Video Codec Cell for Consumer Products."