Architecture Oriented C Optimizations

Belaish

By Eran Belaish, CEVA, Inc.
Herzeliya, Israel

Abstract :

Know your hardware! That's what it's all about. Using programming guidelines derived from the processor's architecture can dramatically improve performance of C applications. In some cases, it can even make the difference between having the application implemented in C and having it implemented in assembly. Well written C code and an advanced compiler that utilizes various architectural features often reach performance results similar to those of hand written assembly code. A quick survey of assembly coding drawbacks should make it fairly clear why real-time programmers need architecture oriented programming guidelines in their toolkit.

Loop mechanisms

Loops are responsible for most of the cycles in an average application. It is therefore essential to know the hardware loop mechanism and how to comply with its needs when writing critical loops. Trivial loop mechanisms consist of compare and branch sequences. The compare instruction evaluates the loop condition and the branch instruction jumps according to the comparison result. This method has two major drawbacks:

Branch instructions break the pipeline. This introduces an overhead, which further increases when branch prediction is wrong or when branch delay slots are not filled to completion. Having this overhead in a loop means that it is multiplied by the number of iterations.
All "do-while" loops execute at least once. Therefore, the compiler can safely evaluate the loop condition at the bottom of the loop. In contrast, a "while" loop will execute zero times if the condition is false at the top of the loop. For these loops, the compiler must add a compare and branch sequence, also known as a guarding-if, before the loop. This extra sequence adds significant overhead, particularly in nested loops where the overhead is multiplied by the total number of iterations of all outer loops.

Zero overhead loop mechanisms

In zero overhead loop mechanisms, the number of iterations is calculated prior to entering the loop body. If iteration count is too low (normally negative), the loop is skipped with little overhead. This unique mechanism has one major requirement-the number of iterations has to be known in advance prior to entering the loop. Therefore, the loop condition has to be simple enough to be pre-calculated. Consequently, the following elements are not recommended as loop delimiters:

Function calls
Variables that may change inside the loop, including volatile ones
Global variables that may be changed by functions called in the loop

Branch and decrement loop mechanisms

In branch and decrement loop mechanisms, the number of iterations is stored in a loop counter. The loop counter is automatically decremented when jumping to the loop beginning. When it reaches zero, the loop breaks. This mechanism has the potential of zero overhead too, but unlike the zero overhead loop mechanism, this one cannot automatically skip the loop if it does not iterate. Therefore, in some cases, dedicated assembly code is required in order to skip the loop. This extra code obviously involves an overhead and the following guidelines can be used to eliminate it:

Do-while loops iterate at least once and make skip code unnecessary. Use them whenever possible.
Use dedicated pragmas to specify that a loop iterates at least once. These are normally referred to as “trip count” or “must iterate” pragmas and they too make skip code unnecessary.

DSP fundamentals

DSP fundamentals include architectural features that are found in most DSP processors and are extremely important for the performance of various DSP applications.

Multipliers

Multiplication is one of the most common operations in DSP code. It is often combined with accumulation to form a multiply-accumulate (MAC) operation which, for example, is massively used in filter implementation. A very important aspect of multipliers is their natural input width. Multiplying C operands of the same width as the multiplier's natural input width is essential for having an efficient single-cycle multiplication. For example, multipliers with 16-bit inputs should be triggered with 16- bit operands in C code multiplications. Narrower operands are also fine as they are automatically extended.

Problems start when C multiplication operands are too wide for the multiplier. This often triggers the compiler to use expensive sequences with several multiplications to compensate for the narrow multiplier inputs. In some cases the algorithm calls for wide multiplications and there is no alternative. However, C variables wider than necessary are often used inefficiently, mostly when porting code from architectures with wider multipliers or from PC.

Some multipliers yield outputs with no equivalent C variables. The two multipliers of the CEVA-TeakLite-III DSP Core for example, yield a 72-bit output when multiplying two 36-bit inputs, as shown in figure 1. The output is stored in two 36-bit accumulators and there is no native ANSI C variable that can handle them. To solve this problem, the CEVA-TeakLite-III Compiler provides assembly macros that enable direct access to 72-bit operations from C code. The two 36-bit outputs are stored in two variables of type 'acc_t' (CEVA's C language extension for accumulator type).

Figure 1. CEVA-TeakLite-III computation and bit

Hardware saturation mechanisms

Saturation prevents overflow in DSP code and is especially essential in vocoders. Without hardware saturation support, programmers have to implement saturation handling mechanisms in software, which often leads to significant performance loss. This inefficiency has motivated architects to support saturation in hardware. On the software side, using hardware saturation simply requires turning saturation mode bit on.

Advanced compilers take advantage of hardware saturation support offering compiler intrinsics. These intrinsics are in the form of basic DSP operations compliant with the ETSI/ITU standard. Using this compiler feature instead of implementing saturation in software boosts performance dramatically. For example, working in this mode with the CEVA-X1641 Compiler boosts the speed of certain vocoders, like the AMR-NB (Adaptive Multi-Rate Narrow Band), by a factor of 10.

Parallel architectures

There are two main architectural approaches for supporting parallel execution of instructions-Very Long Instruction Word (VLIW) and superscalar. VLIW architectures offer excellent instruction-level parallelism (ILP) without the expensive hardware required for parallelizing sequential code in superscalar architectures. VLIW parallelization is done by the compiler and the assembly programmer, and mostly tops superscalar, which suffers from a limited visibility of assembly code and is therefore less common in advanced DSP applications. This means C programmers working with VLIW architectures need to be more aware of ILP considerations while writing their C code, although the guidelines below fit superscalar architectures as well. It also means that when working with VLIW architectures, like the CEVA-X architecture family, a state of the art optimizing compiler is mandatory.

Although ANSI C does not support parallel computation, C code can still be written in a way that eases parallelism, especially for a VLIW compiler. Advanced compilers can handle parallelism well in serial computation and even in conditional execution- assuming the architecture supports predication. However, for complex program flows, ILP will probably decrease. When implementing complex program flows, it is highly recommended to test the ILP achieved by the compiler by manually exploring the assembly it generates, especially in critical functions and loops.

Pseudo parallel computation in C level

Pseudo parallel computation is all about untying dependencies. It can be used in serial code but it often involves manual loop unrolling. In most cases, VLIW compilers unroll loops quite well and use it to untie dependencies and increase ILP. However, when the flow of the loop body is too complex they might fail. Manual unrolling combined with knowledge of the algorithm unties dependencies and makes parallelism easier for the compiler.

For example, consider the binary search algorithm. This algorithm traverses a sorted data structure in a loop where each step determines the flow for the next one. This dependency between iterations makes it very hard for compilers to parallelize efficiently. However, ILP increases dramatically when writing several iterations one after the other and keeping the decision making part at the end of the loop body. ILP increases as the manually unrolled iterations are now independent and can be executed in parallel to one another.

Dependency breaking and utilization of arithmetic characteristics

Another way to increase ILP is to break dependencies using slight modification of the algorithm. This can be especially effective when linear calculations are involved. A linear calculation can often be computed using superposition-the calculation is decomposed to several independent elements and summarized when they are all calculated. When the computation is performed in a loop, the decomposed elements' independency makes it easy to calculate them in parallel while summation can be performed outside of the loop. This method is a particular case of pseudo parallel computation in C code where arithmetic characteristic are used in manual loop unrolling to untie dependencies and increase ILP.

Computational units

When working with a VLIW processor it is important to take into consideration the available computational units, especially when writing assembly code. The number and type of these units impose the number and type of instructions that can fit into a parallel instruction packet. For instance, the CEVA-X1641 DSP Core has four computation and bit manipulation units, two load/store units, one scalar unit and one program control unit. This leads to a maximum parallel execution of eight instructions per cycle.

Figure 2. CEVA-X1641 block diagram

Deducing optimal function and loop throughputs

Deducing optimal throughput is highly important for optimizing critical functions and loops in C. It enables the programmer to determine whether or not there is more room for improvement and it also provides indications of possible optimizations. Writing an optimal assembly draft of the code often leads to an optimal implementation in C- assuming the assembly draft is not tricky and assuming the compiler is very good at optimizing C code. In order to write such a draft, it is essential to know the available computational units. This knowledge allows decomposition of functionality into smaller, hopefully independent stages, which will eventually result in a more optimized implementation in C and higher ILP.

C implementations can be modified to better fit the instruction set architecture (ISA). This will improve instruction selection done by the compiler and bring the resulting assembly closer to the theoretical optimum performance. For example, the CEVA-TeakLite-III ISA supports efficient extraction of fields from a register through special extraction instructions. This is useful when testing certain bits in a variable, as demonstrated in purple in figure 3 below. To ensure that the compiler uses these extraction instructions, the function should be written in a way that better expresses bit extraction. The generated assembly then needs to be checked using trial and error.

Figure 3. Efficient bit Extraction Generated by the CEVA-TeakLite-III Compiler

Another important consideration when deducing optimal throughput is memory bandwidth. In functions and loops that massively access data memory, it is essential to make sure that the entire memory bandwidth is used and that memory access is not a bottleneck. This might require some user intervention (e.g. specifying pointer alignment as explained below) and it is recommended to check the generated assembly code to ensure maximum use of memory bandwidth. For example, consider the copy loop copy loop in Figure 4 compiled by the CEVA-TeakLite-III Compiler. The memory bandwidth of the CEVA-TeakLite-III is 64 bits per cycle and indeed the purple colored loop uses it entirely in every cycle.

Figure 4. An optimal copy loop Generated by the CEVA-TeakLite-III Compiler

Avoiding expensive operations

Knowing the available computational units makes it easier to avoid operations that are not naturally supported by hardware. DSP processors for example, normally do not have dividers and floating point units (FPU). This means that most division and floating point operations in C programs are translated to expensive function calls by the compiler. In functional terms, such programs are correct but in terms of efficiency they often have poor performance and are unsuitable for real time purposes.

The best advice is simply to avoid unnatural operations. Yet in some applications there is no choice. This has triggered scientists, DSP engineers and compiler developers to come up with various methods that compensate for the lack of computational units. For example, efficient fixed point arithmetic is commonly used in DSP applications to emulate expensive floating point operations. For division, if dividing by a constant that is a power of two, division can be accomplished by a shift operation. For constants that are not a power of two, division can be accomplished with a multiplication followed by a shift. Such divisions are often automatically optimized at compile time, so it's recommended to check the generated assembly code before modifying any C code.

Memory related guidelines

Alignment considerations

Architectures may allow or disallow unaligned memory access. While no special guidelines are required when unaligned memory access is allowed, if disallowed, the programmer must be careful. Ignoring alignment considerations causes severe performance issues and even malfunctions. To avoid malfunctions, all memory accesses need to be executed with the proper alignment. To improve performance, the compiler needs to be aware of the alignment of pointers and arrays in the program. Optimizing compilers normally track pointer arithmetic to identify alignment at each stage of the code in order to apply SIMD (Single Instruction Multiple Data) memory accesses and maintain correctness. In some cases the compiler can tell that a pointer alignment allows memory access optimization (for example, when a pointer to a 16-bit variable is aligned to 32 bits) and then SIMD memory operations are emitted. In other cases, the pointers are not aligned. Then the only option is to make them aligned by copying them to aligned buffers or by using the linker.

In most cases, the compiler simply cannot tell the alignment. It therefore assumes the worst case scenario and avoids memory access optimization as a consequence. To overcome this lack of information, advanced compilers offer a user interface for specifying the alignment of a given pointer. The compiler then uses this information when considering memory access optimization for the pointer. For loops with excessive memory accesses (such as copy loops), this feature allows two and even four times acceleration.

Memory subsystem considerations and memory arrangement

In general, memory is divided into code memory and data memory. In small memory models, where there is very little room for code and data, various paging techniques are applied. Most modern architectures however, are designed for large memories due to the complex applications they must run. Caches are common for acceleration of memory fetching and DMA (Direct Memory Access) units are used for transferring large quantities of code or data.

The main concern when organizing code in memory is minimizing cache misses. This is a complex task involving many variables such as the function call tree, frequency of function execution, function size and various test vectors. The process can also be counter intuitive at times, as optimizing a function might worsen the overall performance due to changes in the overall memory map.

Data memory organization involves frequency and size considerations with regards to data caches. But it can also be affected by memory architecture which might inflict conflicts during transactions. As architectures use numerous configurations of memories, caches, DMAs and memory subsystems, each one requires thorough research to find the optimal memory organization. Advanced profilers, like the CEVA-X and the CEVA-TeakLite-III profilers, offer memory usage diagnostics, which are highly important for this process.

Endianness considerations

Endianness determines the order of bytes or words in memories. It mostly comes in one of two opposite versions, little endian and big endian. Some architectures, like the CEVA-TeakLite-III, offer both little and big endian and can be dynamically configured at run-time.

The main concern about endianness arises when a little endian machine needs to communicate with a big endian machine, or when a machine of one type needs to process a binary stream saved in the other type. In such cases, one side has to convert the data to the endianness of the other. Another concern arises when the same machine has code memory of one kind and data memory of the other. This causes problems when storing code in data memory and vice versa. This too calls for a conversion from one type to the other.

Predication

Predication enables the conditioning of instructions according to the evaluation of certain computations, usually compare operations. The conditional instruction executes if and only if the predicate it depends upon is turned on, yet the cycles required for this instruction to execute are consumed anyway. Modern architectures use dedicated registers to store various predicates. This enables compilers to treat predicates just like any other resource. Compilers use predication to convert traditional if-else implementation based on compare and branch instructions into efficient parallel sequence without the overhead of pipeline breaks. This compiler optimization is often referred to as if-else-conversion or simply if-conversion.

A few simple guidelines can help the compiler in performing if-conversion. First, it is not always beneficial. If the conditional code blocks are too large, the compiler would normally prefer traditional compare and branch sequences, as they prevent wasting cycles on code that does nothing. This means that if-else blocks should be rather compact. Second, conditional instructions depend on compare operations to load their predicate registers. This dependency causes latency between condition evaluation and conditional code execution. To reduce this latency, all code that can be executed unconditionally should be taken out of the if-else structure. Breaking the dependency achieves both a higher ILP and a better chance for if-conversion.

16-bit vs. 32-bit architectures

Most modern architectures for embedded and DSP processors are 32-bit. 16-bit architectures are more common in older or more lightweight processors. Conventionally, the size of int variables used by the compiler is 16 bits in 16-bit architectures, and 32 bits in 32-bit ones. The sizes of short and long variables are 16 bits and 32 bits respectively, regardless of the architecture type.

The main effect of bitwidth on C programming regards memory access. To begin with, 16-bit architectures usually have a 16-bit address space, which is quite small in modern terms. This means applications must be quite compact, although there are some mechanisms that bypass this limitation.

Another relevant consideration is the size of loop iterators. Loop iterators are used to count the number of loop iterations, but are also used to access arrays and pointers. As part of a compiler optimization called strength reduction, compilers consider iterators as induction variables and try to replace them with cheaper pointer arithmetic. There is a clear advantage for iterators as wide as the machine's address registers, since no conversion is required during pointer arithmetic. Since pointer width is usually the same as that of an 'int', it is recommended to use int typed iterators. This provides both better performance and better portability. On the other hand, using 'int' for variables that require a specific width, like voice samples, can cause problems when porting from 16-bit to 32-bit or vice versa. Therefore, int variables should be used carefully.

Byte addressable vs. word addressable architectures

Some architectures offer memory and register access at the byte level (typically 8 bits) while others offer access at the word level only (typically 16 bits). The main concern here arises when working with variables of type 'char'. These variables are 8-bit wide on byte addressable machines and 16-bit wide on word addressable ones. This issue has two main implications that are relevant when 8-bit chars are required on a word addressable machine. For one, the 'char' buffers will be twice as large. For another, 8-bit 'char' handling will require packing and unpacking each one of the two bytes that comprise a word-mostly when external streams are involved in read or write transactions.

In practice, most voice and audio applications are indifferent to byte vs. word resolution as they use 16 and 32 bits variables. On the other hand, video applications involve a lot of byte level calculations and therefore word addressable machines are not recommended for that purpose. Moreover, video applications require unique byte manipulation instructions that are only available in byte addressable machines such as the CEVA-X.

Utilizing unique architectural features and instructions

DSP architectures often include unique features and instructions that are tailored for specific algorithms. These instructions are seldom used by the compiler as they are too specific and the compiler has to be general. However, advanced compilers offer assembly intrinsics that allow accessing these instructions directly from C code. For more information about assembly intrinsics please refer to Combining C and assembly in DSP applications. For accessing more complex hardware modules such as DMAs, assembly code can be embedded into C code using the inline assembler.

Architecture Oriented C Optimizations By Eran Belaish, CEVA, Inc. Herzeliya, Israel