Performance Optimization of Embedded Software for ARM Processors and AMBA Methodology-based Systems

by Bart Weuts, Sr. Manager, IP and Methodology, CoWare, Inc.

Synopsis:

Designing software for ARM technology-based systems requires verifying that it functions correctly with the hardware, but also that the performance is optimized for both the processor core and the memory architecture subsystem in which it is running. Ideally, all of this should be done very early in the design cycle before hardware prototypes have been created. ARM processors, such as the ARM926EJ-Sâ„¢ and ARM1136J(F)-Sâ„¢ cores, have features such as the Tightly Coupled Memories (TCMs) and varied cache sizes, so understanding the performance of real-time software and how the TCMs and the cache memories can be tuned could have a significant affect on the overall performance, cost, and power consumption of the final system.

If you would like to learn more about optimizing your ARM-based designs, click here to download a paper presentation (PDF, 3,2Mb) entitled "Performance Optimization of Embedded Software for ARM-Based Designs".
Bart Weuts, Sr. Manager of IP & Methodology at CoWare, presented this at the ARM Developer's Conference, October 2004, held in Santa Clara, CA.

In order to speed up embedded software running in a system, one must first understand exactly where it is spending its time and why. With complexities such as caches, TCMs and system interrupts, it takes a combination of understanding what is happening in the ARM processor along with details of how the core and the software are using the memory subsystem. This information can then be used to detect where the problems are and to test the software. A tool is needed that can give this information to the embedded software designer. This article discusses an example JPEG application where detailed analysis of the system and software performance can be validated and then increased by over 6X.

One of the tools that can be used to do this type of analysis is the software analysis views available with ConvergenSC--the SystemC-based ESL design toolset from CoWare. An example system is used to describe some techniques for how this could be done. This example will use the platform shown in Figure 1.

The platform consists of two master devices on a multi-layer AHB bus, namely an ARM926EJ-S processor with instruction and data buses and a dual-port DMA controller. The memory in the system consists of internal ROM and RAM and external memory with an Static Memory Interface (SMI). There are also peripherals on the APB bus that are used to display the results and get input, but do not affect the performance. The model of the AMBAâ„¢ bus (AHB+APB with bridges) is a cycle accurate transaction-level model in SystemC and the ARM926EJ-S processor is also a cycle accurate model working at the transaction level with interfaces in SystemC. The ARM926EJ-S core model functionality and accuracy is modeled and verified by ARM and both pieces of IP are available in the CoWare Model Library for AMBA and ARM designs. Using transaction-level modeling with cycle accuracy enables detailed performance analysis of the system while keeping simulation performance extremely high; well over 1000x of the RTL simulation. The example application running on this platform is a JPEG application.

Figure 1: Example ARM AHB system.

Figure 2: In the top screen, you see the function loading of the application while the bottom window shows the ARM core-to-slave access counts.

Within the ConvergenSC tool set there are many different types of views that can help the user analyze various aspects of the software execution. In this example, the views show master to slave accesses, memory accesses, function tracing and loading, etc., all presented over a time range. Any of these types of views can be attached together so that they can be compared along a consistent time axis that shows â€œreal-timeâ€ simulation time. The top portion of Figure 2 shows a function load view of the JPEG application. Based on this analysis view, this simple application is divided into 3 sections. The first section consists of about 75 percent SW interrupts and getting JPEG data, the second section is about 50 percent Huffman decoding and 50 percent5 SW interrupts and getting JPEG data and the third section is mostly comprised of IDCT. The bottom part of the view shows the master to slave access view. The accesses to instruction and data memory have been filtered out to see other peripherals. This view, combined with the function trace view, could show exactly which functions were responsible for a cache miss when the ARM caches are enabled. In the first simulation, the caches are disabled to see all of the accesses and to get a software performance baseline time. In the simulation, the execution of the JPEG algorithm with I/D caches disabled and no optimization compiler switches is 15.96 ms.

Not all of the views have been described in this article. By using the full suite of SW analysis views, as will be described in the ARM Developersâ€™ Conference presentation, it is possible to draw the following conclusions from the analysis of the JPEG application:

It takes close to 16ms to process one JPEG picture
The idct2d function is consuming most of the time
There is enormous overhead to protect critical sections of software using software interrupts
There is enormous overhead in checking whether the DMA controller should start with the next JPEG data block in the poll_jpeg_frame function
The locality of the software is relatively high and the location and size of the stack and heap memory can be optimized.

The next step is simply to turn the I/D caches on in the ARM simulation model and compile the application with optimization flags. After doing this we can observe how the software behaves in the memory subsystem. The cache sizes can also be changed to understand the effects on the system, but often software developers do not have this option, as the hardware platform is fixed. This kind of information, however, could be very useful for a new design as a recommendation to the hardware team.

After the I/D caches are enabled and the application recompiled, the memory accesses are greatly reduced and the time to process one JPEG image drops from 16ms to 4.85ms.

Figure 3: In this screen capture, the ARM core-to-ROM accesses are displayed.

Figure 4: In this window, the ARM core memory access is compared to the application software function trace.

In Figure 3, the master to slave access view can show the exact location of cache misses. An expansion of the area of cache misses, grouped with the function trace at this exact time, shows exactly which functions are causing the misses and why, as demonstrated in the function trace view and master/ slave view grouping of Figure 4.

With this information it can be determined that in order to increase the performance there are several options:

Increase the size of the cache
Inline small functions that are used infrequently but are causing the misses
Put small functions that are used infrequently in a non-cacheable region so that they do not â€œpolluteâ€ the cache

These options could easily be tried and the simulation re-run to determine their exact effect on performance. The turnaround time is quick to re-compile, reload the software, and run the simulation.

As a summary of the example analysis, if the software interrupts, polling for the next DMA transfer, and the cache misses were addressed by modifying the software, the performance of the JPEG application could be increased. The presentation given during the ARM Developersâ€™ Conference will go through each step of these optimizations in detail and describe how the analysis shows exactly how effective each change has been on the softwareâ€™s performance in the system.

RELATED

Performing these optimizations, results in a â€œreal-timeâ€ simulation time of 2.5ms, which is another 2X improvement in speed. The analysis will also show that the optimized software fits nicely in the current cache sizes. The total optimizations of the caches, compiler switches, and these techniques resulted in a proven 6X reduction in simulation time for processing a single JPEG image fitting in the same hardware platform while still verifying that no timing or functional errors were introduced. These kinds of techniques can be used to bring new levels of performance to existing ARM technology-based designs and help optimally design new ones to avoid â€œover designingâ€ the hardware because the impact of the software could not be determined at the time of the design.

Industry Articles

Performance Optimization of Embedded Software for ARM Processors and AMBA Methodology-based Systems