Compilers Enable Embedded Comm Application Optimization
By Lerie Kane, Intel
Mar 15 2005 (5:13 AM)
As processor architectures evolve and the complexity of instructions increases, the role of the compiler in application development continually gains importance. The applications developed in the embedded industry are becoming progressively more intricate, placing even more importance on software tools. An optimal compiler and good optimization methods cannot only increase the performance of an application, but can also decrease development cost and engineering cycle time, thereby accelerating time to market.
Performance Methodology
Unfortunately, application optimization is not an exact science because not every compiler optimization behaves the same for every application. This makes it important to fully understand the steps required to optimize an application.
The ideal time to start thinking about optimization is during the application design phase, and time should be budgeted specifically for optimization activities. However, having time available to concentrate on optimizations is only part of the process. Ideally, code should be architected in "optimization friendly" ways. This means that it is readable, reliable, portable, and maintainable. It is a common misconception that performance-intensive sections of an application require hand-written assembly code. In fact, as processor architectures have exponentially increased in complexity, so have the compiler optimization technologies.
There are many benefits to allowing the compiler to do the optimizing. For example, as new processors are introduced, the computing industry is inundated with information on how the newest processor instructions will benefit existing programs. At this point, a developer must assess available options and platform approaches. The developer can study the new instructions, find out how they work and where they would be most beneficial in the application and then re-code, or the developer can simply use the latest compiler optimization flag.
Performance Cycle
Once a program is stable and in correct working order, it is time to begin the Performance Cycle. Also known as the Closed Loop Cycle, this process consists of five steps: gathering performance data, analyzing data and identifying issues, generating alternatives to resolve issues, implementing enhancements, and verifying results (Figure 1).
Figure 1: After a program is stable, the Performance Cycle allows iterative analysis for optimization.
As illustrated in (Figure 1), this is meant to be an iterative cycle. There is likely to be more than one performance intensive area in an application, and it is best to concentrate on one area at a time. Focusing on a large portion of the code will only confuse the data and make it more difficult to see how the applied optimizations directly affect the program.
The first step is to gather performance data, which requires a method to measure the performance of the application. Simply using a clock on the wall or a stop watch is a highly inaccurate way to time an application. Timing functions inserted into the code are more effective ways to gather performance data. Although they are extra overhead, the timers will provide accurate results. One of the easiest ways to gather timing data is to purchase a good performance analyzer software package. Most performance analyzers will show the time spent in each function of the program and will provide an analysis based on this data.
Once performance data has been collected, find the routines that are taking the majority of the application time and decide if the numbers make sense. Is there a loop that appears to be taking more time than it should? These areas are known as "hot spots." This process is not always obvious, and a performance analysis tool that graphically displays the data will often help. Some tools will actually show the path in the code that is executed most frequently. However, in the absence of these tools, a strong knowledge of the application and time should yield the same results: identifying the sections of code that are taking more time than they should to process.
The next two steps require time and knowledge of optimization techniques. Enhancement modifications may be as simple as applying a compiler flag to targeted source files, or they may require redesigning sections of code. Perform a code inspection to diagnose why the application is slow and to see if any changes will increase the speed. Subsequent sections of this article describe various compiler optimizations and where they will be of most help. Success here delivers a lightning-fast application built on the combination of good code and a good compiler.
After the compiler optimization has been applied or the most efficient algorithm has been implemented, it is important to verify the results. Does the program still run correctly? If so, did it actually achieve a performance improvement? Both of these are very important questions. If either of the answers is "No," the head of the cycle has been reached and another iteration is in order.
Compiler Overview
One of the best ways to ensure good, fast optmization is to choose the best compiler. If the target operating system has only one compiler available, then this step is easy. However, as the number of compilers available for an operating system rises, the competition for the highest performing compiler also increases. This competition motivates compiler vendors to continue developing new optimization technologies which can greatly benefit the developer.
Once a compiler is selected, read the documentation. While this may seem like a trivial step, typical compiler documentation will outline the technologies implemented in the compiler as well as provide examples on how to use those features. It is important to show restraint when using compiler optimizations. It is tempting to find the most powerful optimization options and throw all of them at an application. However, no single optimization technique will work the same for every application. It is best to apply one optimization at a time, verify the results, and measure any performance improvements before moving on. A recommended way to proceed is first with general optimizations, then processor optimizations, interprocedural optimizations, and finally profile-guided optimizations, as described below (Figure 2) . Methods covered first are the easiest to use, whereas the latter have the potential for greater performance improvements.
Figure 2: It is best to apply one optimization at a time, verify the results and measure any performance improvements before moving on.
General Optimizations
Most compilers have a series of options known as general optimizations, ones that bundle together a variety of safe and easy optimization techniques. They will most often have a positive effect on the majority of programs. For example, the "optimize for speed" option is generally named "-O2". This option incorporates optimizations like inlining of intrinsic functions as well as simple loop optimizations. If an application runs correctly in debug mode (i.e. with optimizations turned off, -O0), but not at -O2, it is likely that more complex optimizations will not work either.
Using aggressive optimizations will often reveal errors in code that may have otherwise gone undetected. Another common problem associated with higher optimizations is precision control. When the compiler is given free reign to fully optimize a program, it will sacrifice precision for speed. In applications where floating point consistency is heavily relied upon, it is important to tell the compiler to omit optimizations that affect these values. (Most compilers have precision control options.)
Processor Specific Optimizations
Some compilers have processor specific optimizations that provide hints to the compiler about what processor the application is going to be run on. For example, if an application is going to target a processor capable of running special instructions, then the compiler needs to know it is free to generate these instructions where it feels they will be of most use. The compiler is also able to vectorize operations in the program that can be done in parallel. Vectorization occurs when one operation is applied to multiple data that are then processed in parallel. The vectorizer will automatically use the necessary instructions that convert sequential code into code that processes 2, 4, 8, or 16 elements in one operation.
Processors are more efficient when instructions are optimally scheduled for the architecture. Most often there is a separate optimization flag for instruction scheduling. This option takes into account specific processor instruction latencies as well as cache line sizes. For example, given the following code:
Depending on the target processor specified, the compiler could produce the following instruction sequence:
Here the compiler changed the multiply by 32 into a series of adds. The resulting code will execute faster on the targeted processor.
With a technology called automatic processor dispatch, it is even possible to target more than one processor. With this option, the compiler inserts processor ID checks and creates multiple code paths in the executable. At run time, the processor's ID is fed into the program and the stream of code associated with that processor is executed. This results in slightly larger code due to the multiple code paths, but is a good option if the end user is planning to run the application on more than one target system.
Interprocedural Optimizations
Programs can often benefit from optimizations occurring across multiple functions or even across the entire program; this technology is known as Interprocedural Optimization (IPO). This option allows the compiler to look at the entire program as though it were one file. In this way, actions and optimizations taken in one routine can affect those in another routine (Figure 3) .
Figure 3: In the Interprocedural Optmization process, the complier can implement actions and optimizations taken in one routine with those in another routine.
To use this technique, the source files first must be compiled with the IPO option enabled. The compiler creates object files containing the intermediate language (IL) used by the compiler. Upon linking, the compiler combines all of the IL information and analyzes it for optimization opportunities. Since the compiler has the whole program to analyze in this step, the linking phase will take much longer than normal.
One of the main optimization techniques enabled with IPO is inline function expansion, which occurs when the compiler finds a function call that is frequently executed. Using internal heuristics, the compiler can decide to replace the function call with the actual code of the function. By minimizing jumps through the code, this creates a higher performing application.
Another example of how IPO can benefit a program is through pointer disambiguation. The compiler's main goal is to ensure a correctly running program. If the compiler cannot predict the full impact of an optimization, it will take the safe route and not perform it. For instance, in the following function:
In this situation, the compiler will not optimize this routine without the help of IPO because it does not know what the variables a and b are pointing to. To ensure a correctly running program, the compiler will assume these pointers are aliased. Aliasing occurs when two pointers are assigned the same location in memory. If, in fact, these pointers are not aliased, IPO will feed this information to the compiler which will then optimize the routine.
Profile-guided Optimizations
The actual execution of a program could behave in a way that is not intuitive from simply analyzing the available source code. Perhaps one of the most evolved optimization techniques is known as profile-guided optimizations (PGO). This optimization allows the compiler to use data collected during program execution to aid in the optimization analysis. Knowing which areas of code are executed most frequently, the compiler can be more selective about the optimizations it performs and can also make improved decisions about how to perform them (Figure 4).
Figure 4: Profile-Guided Optmization is a three-step process in which the complier can use data collected during program execution to aid in optimization.
First, the compiler creates an instrumented executable that contains additional statements that write the execution information into a file. Next, the instrumented application is run. This executable will perform slowly due to the added overhead of writing the dynamic information, such as basic block numbers with associated execution counts, to a file. Finally, the application is recompiled, and the compiler reads in the dynamic information and optimizes the code paths contained in the file.
When using PGO, ensure that the instrumented executable runs on a dataset that is representative of the typical workload processed by the application. If the dataset used only hits the corner cases, then the compiler spends its time optimizing corner cases that the end user will likely never encounter.
PGO can affect a program in many ways. Since the compiler is aware of the order in which the code is executed, it can make more intelligent decisions about basic block ordering, register allocation, and vectorization. It can also perform enhanced switch statement optimizations. For example, given the following code:
PGO will gather data on the percentage of times each case statement is executed (presented in comments above). To enable the highest performance, the compiler will order the code with the common case occurring first. In this example, case 10 is executed most frequently and having that case evaluated first will eliminate the overhead associated with evaluating case 3.
Profile-guided optimization is not suited for every application. It benefits most when an application has a consistent execution path and performs similarly for multiple datasets. If the application has a relatively flat profile, with code execution evenly distributed, then PGO will likely not provide a performance improvement. Before embarking on a PGO build, it is important to consider that it requires changing the build structure. Adding the three step process to some build systems can be difficult. This technique should generally be used at the end of the optimization process.
Increasing Demands Compiler optimization technologies are continually advancing. As the demand placed on embedded applications increases and the required time to market decreases, the compiler plays an integral part in all application development. Armed with a good compiler and some knowledge about optimization techniques, every developer can create high performance embedded applications.
About the Author
Lerie Kane is a software engineer at Intel Corp in the Digital Enterprise Group. Lerie has a B.S. in Computer Science from Portland State University and is completing an MBA from Arizona State University. Lerie can be reached at lerie.kane@intel.com .
Copyright 2005 © CMP Media LLC
E-mail This Article | Printer-Friendly Page |
Related Articles
- Selecting an operating system for an embedded application
- Optimizing embedded software for power efficiency: Part 4 - Peripheral and algorithmic optimization
- Optimizing embedded software for power efficiency: Part 3 - Optimizing data flow and memory
- Why Embedded Software Development Still Matters: Optimizing a Computer Vision Application on the ARM Cortex A8
- Design guidelines for embedded real time face detection application
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)