Using the ARM Cortex-R4 for DSP, part 2: Software optimization

November 26, 2007 -- dspdesignline.com

BDTI explains how to work with the Cortex-R4's pipeline, instruction set, and SIMD capabilities to optimize its performance.

Part 1 explains the DSP features of the Cortex-R4 and shows how the Cortex-R4 stacks up against the competitors.

Applications that involve real-time signal processing often have fairly stringent performance targets in terms of speed, energy efficiency, or memory use. As a result, engineers developing signal processing software often must carefully optimize their code to meet these constraints. Appropriate optimization strategies depend on the metric being optimized (e.g., speed, energy, memory), the target processor architecture, and the specifics of the algorithm.

In our last feature article, we presented signal processing benchmark results for the ARM Cortex-R4. These results were achieved by careful hand-optimization of the Cortex-R4 benchmark assembly code. In this article, we'll share some of the tips and tricks we used to develop our benchmark implementations for the Cortex-R4, starting with algorithm-level optimizations and working our way down to assembly-level optimizations.

Cortex-R4 Instruction Set
As we discussed in the previous article, the Cortex-R4 core implements the ARMv7 instruction set architecture. It uses an eight-stage pipeline and can execute up to two instructions per cycle. The core supports the Thumb2 compressed instruction set, though most of BDTI's signal processing benchmark code is implemented using standard ARM instructions because of their greater computational power and flexibility. (Signal processing algorithms are typically optimized for maximum speed rather than minimum memory use, though memory usage is often a secondary optimization goal.)

On the Cortex-R4, the instruction set is fairly simple and straightforward, and most of it will be familiar to engineers who have worked with other ARM cores, particularly the ARM11. Compared to the earlier ARM9E core, however, the Cortex-R4 is noticeably more complex to program due to its superscalar architecture and deeper pipeline (8 stages vs. 5). And, unlike the ARM9E, the Cortex-R4 supports a range of SIMD (single-instruction, multiple-data) instructions, which improve its signal processing performance but often require the use of different algorithms, different data organization, and different optimization strategies compared to approaches that worked well with earlier ARM cores.

The Cortex-R4 is in some ways similar to the ARM11, which supports a similar range of SIMD operations and also has an eight-stage pipeline. One difference between the two cores is that the Cortex-R4 is a dual-issue superscalar machine while the ARM11 is a single-issue machine. In some cases, this will mean that different optimization strategies are needed to ensure that instructions dual-issue as often as possible. But in many tight inner loops, the two cores may end up using very similar code. This is because of a key limitation on the Cortex-R4's dual-issue capabilities: it cannot execute multiply-accumulate (MAC) operations in parallel with a load, and it cannot use its maximum load bandwidth (64 bits) in parallel with any other operation. As a result, in signal processing inner loops that require maximum MAC throughput or maximum memory bandwidth, the Cortex-R4 is often limited to executing a single instruction at a time.

Click here to read more ...