Efficient SIMD and Algorithmic Optimization Techniques for H264 Decoder on Cortex A9

Vinith Kumar N, Supriya P (PathPartner)

Abstract

In this paper, we propose and explain two efficient optimization approaches on H264 decoder.

1. First method exploits efficient utilization of SIMD for deblocking filter.
2. The second method uses algorithmic level optimization in the core decoding path upon thorough understanding of the H264 video standard.

The test were made on TI’s AM437x board with single core running at 600 MHz and 1 GHz operating frequencies. Using these two techniques 1 and 2, average per frame decoding time (FPS) improved by 42.5% and 37.8% for 600MHz and 1 GHz operating frequencies

1. Introduction

H264 standard has dominated and is continuing to dominate the video codec market for more than a decade. H.264 was developed by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC JTC1 Moving Picture Experts Group (MPEG). The project partnership effort is known as the Joint Video Team (JVT). The ITU-T H.264 standard and the ISO/IEC MPEG-4 AVC standard (formally, ISO/IEC 14496-10 – MPEG-4 Part 10, Advanced Video Coding) are jointly maintained so that they have identical technical content. The final drafting work on the first version of the standard was completed in May 2003, and various extensions of its capabilities have been added in subsequent editions.

This paper explains the two novel optimization techniques conducted on H.264 decoder (Baseline profile), Cortex A9 platform, to get the best performance. This paper gives a brief overview of H264 decoder and ARM Neon architecture before explaining the various optimization techniques we have executed on the decoder.

2. Overview of H264 decoder and ARM Neon:

2-1. H264 Decoder

Figure 1: H.264 Decoder Block diagram

Input to the decoder is encoded stream (264) and output is the raw data (YUV, RGB) depending on the test application. The decoder receives a compressed bitstream from the NAL and entropy decodes the data elements to produce a set of quantized coefficients X which is scaled and inverse transformed to give D’n. Using the header information from the bitstream, the decoder creates a prediction block PRED, identical to the original prediction (P). P is added to D’n to produce uF’n which is filtered to create each decoded block Fn. The entire process happens macroblock (MB) by macroblock until all MBs of a frame are completed. More information related to H264 decoder can be found in ^[1] and ^[2].

2-2. ARM NEON

ARM's NEON technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of multimedia and signal processing applications, including video encoding and decoding, audio encoding and decoding, 3D graphics, speech and image processing ^[3]. NEON supports 8-, 16-, 32- and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time.

3. Optimization techniques proposed:

Two major aspects were considered for achieving the best targeted performance of the decoder on Cortex A9. First one was efficiently using SIMD with maximum number of pixels to be processed parallelly for the deblocking filter module (Section 3.1). Second one was efficient handling of special cases in core decoding (Section 3.2).

3.1 Deblocking filter processing ensuring maximum SIMD exploitation:

The deblocking filter is applied to each macroblock within a decoded frame to reduce the blocking artifacts (i.e. smoothens the block edges) which is caused as a result of inverse quantisation and transform stage. Filter is applied to vertical and horizontal edges of a 4x4 blocks within a 16x16 macroblock (except for edges on slice boundaries). Filter operation mainly takes two phases; boundary strength calculation and filter decision.

3.1.1 Boundary Strength Calculation

The extent to which a particular block edge is required to be filtered is based on boundary strength calculation phase. The boundary strength parameter BS is decided according to the following rules listed in the table 1.

Boundary Modes And Conditions	Boundary Strength(BS)
One of the blocks is Intra and the edge is macroblock edge	4
One of the blocks is Intra	3
One of the blocks has coded residuals	2
Difference of block of motion > =1 luma samples distance	1
Motion compensation from different reference frames	1
Else	0

Table 1: Boundary Strength Calculation

3.1.2 Filter Decision

Normal implementation involves the filtering of single 4x4 block at a time. This implementation was changed in such a way that it performs unconditional filtering of two 4x4 blocks simultaneously ^[8].

Algorithmic Flow:

Flowchart shown below provides the algorithmic flow for filtering luma vertical edges. Loading of 8 pixel lines i.e. pqSample_0 to pqSample_7 (as shown in figure 2) each having eight pixels to corresponding registers (D0 to D7) is done using neon instruction VLD. Each of these registers contain 8 pixel values from index -4 to 3 i.e. pqSample[-4] to pqSample[3]. For ease of computation, transpose operation is applied (VTRN) so that each register will hold pixel values of a particular index only (say for example D0 holds pixel values at index -4 i.e. pqSample_0 [-4] to pqSample_7 [-4]). Alpha, beta, and boundary strength (BS) values are required for making filter decision.

Particular edge is filtered based on the 3 conditions as listed below.

If absolute difference of the pixel values at index 0 and -1(as shown in figure 3) is less than alpha
If absolute difference of the pixel values at index -2 and -1, and of the pixel values at index 1 and 0 (as shown in figure 2) is less than beta
Non-zero boundary strength

Figure 2: Assembly level flow chart

Figure 3: 16 x 16 Macroblock

These conditions are implemented using neon instructions ^[4] such as VABD (for absolute difference), VCLT (compare less than). If all four conditions are satisfied, then it goes to filtering stage which involves delta calculation and clipping. Neon instructions used at this stage includes VNEG (vector negate), VRSHR (shift right and round), VMAX and VMIN (for clipping). If the conditional check turns true, then the filtered pixel values are stored into the registers using neon instruction VST. Otherwise the unfiltered pixel values are stored. This selection of pixel values is made possible through the neon instruction VBSL (bitwise select). And the Lumasample pointer is incremented such that it points to the next two 4x4 blocks. Transpose needs to be applied only for horizontal filtering (i.e. filtering vertical edges). For vertical filtering, transpose is not required.

3.2 Algorithmic level optimization:

3.2.1 Skip Macroblocks Processing:

When the current Macroblock which is to be processed is a skip MB, the coefficients are not transmitted in the bit stream. Also, according to the standard, when the MB is skip MB, interpolated values are the reconstructed values. So, the interpolated values need not be added to the coefficients and stored again in the frame buffer. Also, deblocking filter need not be processed for the inner 4x4 blocks in the macroblock as shown in the figure below, which reduces 56% of total filter computation (for one MB) .

Figure 4: Blue 4x4 blocks no deblocking

3.2.2 Line Buffer storage on the fly for Intraprediction:

After decoding every Macroblock, it is important to preserve the corner pixels of the Macroblock (Bottom 16 and Right 16) as shown in the figure below. This is to get the spatial neighbor information when the next MB or bottom MBs which is to be decoded is intra.

Figure 5: MB pixels for Intrapred line buffer (Red blocks)

Figure 6: Line buffer computation on the fly

The line buffers are not updated for every MB instead they are derived from the frame buffer on the fly. Only when the current MB is intra MB, the line buffers are copied from the frame buffer to the line buffer for spatial prediction of the neighbors and that too only when the neighbors are available. This avoids 37 loads/stores for every MB which is a bit costly affair. This is illustrated in the figure 6 above.

3.2.3 Handling of MBs with CBP zero:

CBP (Coded block pattern) is the syntax element signaled for every non skip MBs in H264 encoded bit stream. This syntax indicates the presence of coded non zero coefficients in a MB. If this flag is zero, it means there are no coded coefficients in the MB. From the definition of CBP zero it means that the coefficients are not present in the MB and is not signaled in the bit stream. For the deblocking filter, standard specifies that when the MB is having CBP zero and it is coded with 8x16 or 16x8 partitions, deblocking filter need not be applied to all the vertical and horizontal edges instead only the alternate edges marked in red needs to be filtered. This is shown in figure 7 below.

Figure 7: Deblocking filter for CBP zero

4. Conclusion

Two efficient optimization schemes have been proposed and have been implemented in the decoder which is ported in A9 platform. Firstly, we have exploited SIMD for deblocking filter to use 8 pixels to be operated in parallel, unconditionally. Secondly, we proposed algorithmic optimizations by thoroughly exploring h.264 standard and incorporated it in the decoder. These schemes together gave very encouraging results. The performance of the decoder in frames per second has been improved by 42.5% and 37.8% for 600MHz and 1 GHz respectively on cortex-A9 platform.

5. References:

[1]. H264 overview and explanation of the concepts (https://en.wikipedia.org/wiki/H.264/MPEG-4_AVC and http://lib.mdp.ac.id/ebook/Karya%20Umum/Video-Compression-Video-Coding-for-Next-generation-Multimedia.pdf).

[2]. H264 decoder syntax and standard (https://www.itu.int/rec/T-REC-H.264-201402-S/en)

[3]. ARM connected community blog (Coding for NEON) - https://community.arm.com/groups/processors/blog/2010/03/17/coding-for-neon--part-1-load-and-stores.

[4] NEON Development guide (http://infocenter.arm.com/help/topic/com.arm.doc.dht0002a/DHT0002A_introducing_neon.pdf).

[5] ARM Architecture Reference Manual ARMv7A (http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf).

[6] H264 AVC reference software (http://iphome.hhi.de/suehring/tml/)

[7] TI reference manual for AM437x hardware (http://www.ti.com/product/AM4379)

[8] IEEE paper on “H.264 Video Decoder Optimization on ARM Cortex-A8 with NEON” , 2009 Annual IEEE India Conference

(http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5409460&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5409460 ).

6. PathPartner’s Expertise

PathPartner carries a grand legacy in the field of Video Codec Development. From video encoders to decoders, PathPartner has worked on optimizing such codecs on variety of platforms – be it multicore DSPs, multicore ARM, GPGPUs, FPGA or ASICs. PathPartner has expertise in embracing every piece of code to fit into the required performance budget with no compromise on the quality of the codec. PathPartner, equipped with Video lab and a strong group of visual analytic experts, has done evaluation of video sequences for variety of encoders such as MPEG-2, MPEG-4, WMV9VC-1, JPEG, H.264 etc.

In the space of IPs, PathPartner has also developed optimized HEVC decoder on multicore ARM, multicore DSPs, FPGA and GPGPUs. PathPartner also works with various semiconductor companies, OEMs and ODMs for porting and optimizing codecs on their platforms or ASICs.

About PathPartner

PathPartner based out of California, USA and Bangalore India is a leading provider of Consulting, Services and Solutions for digital media centric devices market. Our services range from Product Engineering/R&D and System Integration to Middleware, Applications and System solutions

With an expert management team which has rich experience in Technology, Engineering & Business practices, PathPartner has Sales & marketing presence in USA, Europe, Korea and Taiwan. The company specializes in addressing challenges faced by leading OEMs, Silicon and OS providers in their product development.

Industry Articles

Efficient SIMD and Algorithmic Optimization Techniques for H264 Decoder on Cortex A9