|
|||||
Tips for efficient embedded code generation Tips for efficient embedded code generation Compiler optimizations allow for the development of comparatively less expensive products with more features. An understanding of the practical compiler optimizations allows the developer to write the most efficient code, and to know where to focus his time in improving the his product’s performance. Following are a number of coding examples and tips for efficient code generation on compilers. Choice of Data Type Sizes: In practice, using a smaller type for a local variable is computationally inefficient. Consider the following snippet of code: int m; In PowerPC assembly code, this will look like: addi r4, r4, 1 ; add 1 to c On a CISC chip like the 80386, it will look like: add $1, %al ; add 1 to c On a RISC chip, one pays the price after writing to a sub-integral variable – the value must be sign or zero extended in register to make sure that the high bits of the register are consistently set. On a CISC chip, one pays the price when mixing the short variable with a larger integer variable. And unless you’re using an 8 or 16-bit microprocessor, the microprocessor doesn’t save on the shorter add instruction, so there is nothing to make up for the cost of the sign or zero extension after the increment. Variables in memory are sometimes different. Since memory is scarce, it often makes sense to conserve by using a smaller data type. Most microprocessors can load and extend the short value in a single instruction, which means that the extension is free. Also, the truncation that is necessary when storing the value can often be avoided since sub-integral store instructions generally ignore the high bits. Larger data types should also be used sparingly. For example, C99 adds the long long type, and this type has been available as an extension to C89 in most compilers for years. On a 32-bit microprocessor, these variables take up two registers and require multiple operations to perform arithmetic on them. An add or a subtract can usually be done inline in a few instructions, but an innocuous looking shift can turn into quite a few instructions. And you can forget about division! Of course, these variables may be necessary or convenient when coding, but don’t use them unless you need them and understand the corresponding code size and speed impact. Variable signedness: Characters deserve special mention here. Many architectures, such as the ARM, and PowerPC, are more efficient at loading unsigned characters than they are at loading signed characters. Other architectures, such as the V800 and SH, handle signed characters more efficiently. Most architectures’ ABIs define the plain “char” type as whichever type is most efficient. So, unless a character needs to be signed or unsigned, use the plain char type. Use of access types: This code can be optimized by the compiler in several ways because the functions were declared static. First, the compiler can inline the call to allocate_new_unique_number() and delete the out-of-line copy, since it is not called from any other place. Second, if the compiler has basic information on the external allocate_other_fields() function, it can sometimes tell that the function will not call back into this module. This allows eliminating the second load of unique_number embedded in the in-lined allocate_new_unique_number() function. On the other hand, “static” data should be avoided whenever possible for function-local data. The data's value needs to be preserved between calls to the function, making it very expensive to access the data, and requiring permanent RAM storage. The static keyword should only be used on function-local data when the value must be preserved across calls or for large data allocations (such as an array) in which the programmer prefers to tradeoff consumption of stack space for permanent RAM. Global Variable Definition For example, if the global variable “glob” is used the most in file.c, it should be defined there. Use a definition such as: int glob; or: int glob = 1; in addition to declaring it (with extern int glob;) in a header file so that other modules can reference it. Some programmers get confused about uninitialized variable definitions, so it is worth clarifying how they are often implemented. Most C compilers support a feature, called common variables, where uninitialized global variable definitions are combined and resolved at link-time. For example, a user could place the definition, as follows: int glob; in a header file and include this header file in every file in the project. While a strict reading of the C language implies that this will result in link-time errors, under a common variable model, each module will output a special common reference, which is different from a traditional definition. The linker will combine all of the common references for a variable, and if the variable was not already defined elsewhere, allocate space for the variable in an uninitialized data section (such as .bss) and point all of the common references to this newly allocated space. On the other hand, if the user employs a definition such as: int glob = 1; in one module, all other uninitialized, common references would resolve to this definition. It is best to write code without defining the same uninitialized global variable in multiple modules. Then, if you turn off the common variable feature in the compiler, the compiler is able to perform more aggressive optimizations since it knows that uninitialized global variable definitions are true definitions. Volatile Variables 1) The variable is a memory mapped hardware device where the value can change asynchronously and/or where it is critical that all writes to the variable are respected. 2) The variable is used in a multi-threading environment where another thread may modify the variable at any time. The alternative to careful use of the volatile keyword is to disable so-called "memory optimizations". Effectively, all variables will be treated as volatile when this option is chosen. Since this disables many optimizations, developers are encouraged to choose a less conservative approach. Separate threads often perform separate functions. As a result, they may have many variables and data structures that are not accessed by other tasks. Good software engineering practices may be able minimize the overlap to a few shared header files or. If such practices have reduced the scope of the code to a few files, it is then more feasible to find the variables and data structures that are shared between threads. Const Restrict: Also, consider the inner loop for a PowerPC target: Better code could be generated by pulling the first load out of the loop, since in[i] does not change within the inner loop. However, the compiler can't tell that in[i] doesn't change within the loop. The compiler is unable to determine this because the in and out arrays could overlap. The restrict keyword helps in cases like these. If the function declaration is changed to: void fir_and_copy(int *in, const int *coeff, int *restrict out) the compiler now knows that writes through *out cannot change the values in * in. The restrict keyword is a bit confusing since it applies to the pointer itself rather than the data the pointer points to (contrast const int *x and >i> int *restrict x). This is because some interesting conceptual problems developed with the first attempt to add this feature to the C language - the noalias keyword that was considered for the C 1989 specification. Dennis Ritchie, one of the progenitors of the C language, pointed out some of these problems, and noalias died in committee. It's reincarnation in the C99 restrict keyword, while conceived independently of Ritchie, has received his blessings. Pointers and the & operator Taking the address of a variable forces the compiler to allocate the variable on the stack, all but assuring less efficient code generation. Passing an argument as a C++ reference parameter has the same effect. Declaration scope of variables The compiler could potentially perform a tail call for the call to func2(). This is an optimization where the frame for func(), if any, is popped, and the last call instruction to func2() is replaced by a branch to func2(). This saves the need to return to func1(), which would immediately return to its caller. Instead, func2() returns directly to func1()’s caller. Unfortunately, the compiler cannot employ this optimization since it cannot determine that loc will not be used in the second call to func2() (which is possible if its address was saved in the first call). It would be better to write the code as follows: In this case, the compiler knows that the lifetime of loc ends before the final call – and the tail call, at least in principle, can happen. Another benefit of using inner scopes is that variables from non-overlapping scopes can share the same space on the stack. This helps to minimize stack use and can result in smaller code size on some architectures. However, it is usually not worthwhile to create artificially small scopes simply to bound the lifetimes of variables. Floating Point Arithmetic Unless doubles are implemented in hardware, it will be more efficient to do arithmetic with the single precision type. If this is the case and if single precision arithmetic is sufficient, follow these rules: 1) Write single precision floating point constants with the F suffix. For example, write 3.0F instead of 3.0. The constant 3.0 is a double precision value, which forces the surrounding arithmetic expressions to be promoted to double precision. float foo(f) instead do: float foo(float f); This is probably only a concern for old code bases. Variable Length Arrays Low Level Assembly Instructions Inlined assembly code uses non-portable syntax and compilers generally make over-conservative assumptions when encountering inlined assembly, thus affecting code performance. Intrinsics can be #define’d into other names if switching from one compiler to another. This can be done once in a header file rather than going through the code to see every place where inlined assembly was used. For example, instead of writing the following code to disable interrupts on the PowerPC: asm(“mfmsr r3”); One can write: __SETSR(__GETSR() & ~(1 << (31-16))); Manual Loop Tricks is sometimes manually transformed into: Such transformations are usually only effective under fairly specific architectures and compilers. For example, the 32-bit ARM architecture supports the post-increment addressing mode used above, but the PowerPC architecture only includes the pre-increment addressing mode. So, for the PowerPC, this loop could better be written as: As a general rule, only do manual transformations for time critical sections of your code where your compiler of choice has not been able to perform adequate optimizations on its own, even after adjusting compilation options. Write simple code for most cases and let the compiler do the optimization work for you. Greg Davis is Senior Engineer / Engineering Manager, Compiler Development at Green Hills Software, Inc.. This article is based on material developed for Class 411 on compiler optimization presented at the Fall 2005 Embedded Systems Conference. |
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |