Video codecs in SoCs using OCP-based programmable accelerator design

By Achim Nohl, Solution Specialist, CoWare, Inc.
April 27, 2007 -- networksystemsdesignline.com

OCP standardizes the communication and infrastructure in SoC designs and thereby ensures interoperability between the IP. Using the Open Core Protocol, System on Chip designers can analyze and evaluate various processor, interconnect, memory and peripheral IP alternatives during the sub-system or platform architecture exploration. While processors are mainly obtained by third-party IP providers, SoC designers differentiate their design by the overall SoC architecture, algorithms and the implementation of specific blocks. Those specific blocks are, to a large extent, hardware accelerators, where data-path and implementation have been the key differentiators. With the increasing data-rates, functionality and complexity of today's standards for video and wireless, design efficiency and flexibility of those blocks are becoming the most important differentiators.

Flexibility is becoming crucial for efficient design re-use in SoCs and derivatives where features and functionality are added over time. Furthermore, flexibility is increasingly necessary when supporting multiple standards (and modes), such as the VC-1 and H.264 video Codecs, within a single SoC. This flexibility can be achieved by having programmable state machines instead of hardwired state machines in those blocks. As a bonus, programmability would allow for late changes to be done in software, mitigating the risk of design errors.

Next-generation designs, especially for new video and wireless standards, have to handle enormous data-rates which results in tremendous throughput and computing requirements. Enabling programmability cannot result in decreased performance or energy efficiency as is the case in standard processors compared to hardwired logic. Therefore, more and more designers are adopting a design paradigm that combines the advantages of processors and hardwired logic into so-called Programmable Accelerators.

A CoWare customer has presented a programmable accelerator for a video deblocking filter unit for standard resolution in set-top boxes at 160MHz. In addition CoWare has shown design examples of a programmable accelerator for a video deblocking filter unit that can operate at 200 MHz, support full, high-definition resolution and frame-rate, and at the same time be re-usable for the VC-1 and the H.264 video CODEC. This performance is achieved through the application-specific deployment of any conceivable computer architecture features: The data-path of a Programmable Accelerator is likely to be massively parallel and highly specialized for certain tasks. Similar to a hardwired implementation, functional units can execute in parallel. However, the control of the functional units is not fixed, as in a hardwired implementation, but taken over by an instruction decoder and a program, similar to a processor. This makes the function reusable in different applications and variations of an algorithm. Advanced programming schemes like software pipelining, combined with highly specialized parallel data paths, allows for an optimal utilization of the functional units. This achieves the highest throughput at lowest clock frequency. The functional units can communicate via dedicated registers and buffers and are not limited, for instance, by the bit-width and size of a general purpose register file as is the case for processor instruction-set extensions.

Especially video codecs provide a huge amount of data parallelism due to their block based structure. This means, that most of the computations are not performed on a single pixel, but on a block of pixels. Thus, dedicated acceleration units with a wide data-path (e.g. 16 x 16 x 8bit = 2048bit) speed up the codec by up to three orders in magnitude compared to a pure software solution. However, designers cannot afford paying the performance of hardware accelerators with limited flexibility. Especially on the encoder side, flexibility and programmability is key as the heuristics for e.g. the motion estimation are the key critical factor for better compression, better quality and thus better products. In a Programmable Accelerator the designer can break up the accelerator data-path into small highly re-usable and programmable units. This is possible through the tight link between acceleration units and control software code in a Programmable Accelerator. In contrast, traditional hardwired accelerators have to be controlled by a separated controller in the SoC. Here, the synchronization and scheduling via interrupts and memory mapped register interfaces costs expensive cycles. Therefore, the task implemented on a hardwired accelerator had to be large enough to justify the control and synchronization overhead with giving up flexibility.

However, in a Programmable Accelerator these acceleration units can be controlled by control software code that is running on the accelerator itself without additional overhead For example, in a video encoder this enables designer to continue to improve the encoder quality by tweaking software programmable instead of hardwired heuristics even after the hardware architecture is fixed.

Click here to read more ...