Is your chip fast enough? Is it too fast? Systems engineers might be paying for more chip than they need, or they may be dangerously close to over-taxing their current processor. Take the guesswork out of measuring processor utilization levels. Many theories and guidelines dictate how burdened a processor should be at its most loaded state but which guideline is best for you? This article presents several ways to discern how much CPU throughput an embedded application is really consuming. You can use this information to verify the system software design versus a maximum processor load.
Sizing a project
Selecting a processor is one of the most critical decisions you make when designing an embedded system. Your selection is based on the features required to satisfy the control functionality of the final product and the raw computing power needed to fulfill those system requirements. Computing power can be formally specified with benchmarks such as MIPS, FLOPS, Whetstones, Dhrystones, EEMBC marks, and locally contrived benchmarks. Many times, however, you won't know precisely how much raw throughput is needed when you select the processor. Instead you'll have only experience and experiential data to work with (from the microprocessor vendor or the systems engineer).
In any case, once the system development has progressed, it's in the team's best interest to examine the CPU utilization so you can make changes if the system is likely to run out of capacity. If a system is undersized, several options are available: upgrade the processor (if possible), reduce available functionality, or optimize, optimize, optimize.
This article doesn't focus on any of those solutions but illustrates some tools and techniques I've used to track actual CPU utilization. You can use these methods to determine how close to the "edge" a specific project is performing.
EQUATIONs 1 through 4
Defining CPU utilization
For our purposes, I define CPU utilization, U, as the amount of time not in the idle task, as shown in Equation 1.
The idle task is the task with the absolute lowest priority in a multitasking system. This task is also sometimes called the background task or background loop, shown in Listing 1. This logic traditionally has a while(1) type of loop. In other words, an infinite loop spins the CPU waiting for an indication that critical work needs to be done.
Listing 1: Simple example of a background loop
int main( void )
{
SetupInterrupts();
InitializeModules();
EnableInterrupts();
while(1) /* endless loop - spin in the background */
{
CheckCRC();
MonitorStack();
... do other non-time critical logic here.
}
}
This depiction is actually an oversimplification, as some "real" work is often done in the background task. However, the logic coded for execution during the idle task must have no hard real-time requirements because there's no guarantee when this logic will complete. In fact, one technique you can use in an overloaded system is to move some of the logic with less strict timing requirements out of the hard real-time tasks and into the idle task.
Using a logic state analyzer
Of the several ways to measure the time spent in the background task, some techniques don't require any additional code. Let's look at three techniques.
The first is an external technique and requires a logic state analyzer (LSA). The LSA watches the address and data buses and captures data, which you can interpret. In this test, you should configure the LSA to trigger on an instruction fetch from a specific address and measure the time between each occurrence of an observation of this specific address.
The address to watch for could be any address within the while(1) loop from Listing 1. The task of identifying an appropriate address is tricky but not inordinately difficult. You can use the map file output by the linker to get close to a good address. Peruse the map file for the address of the main function, and then set up the LSA to look for the occurrence of any address within a limited range beyond the entry to main. This range is justified because, unless there's a large amount of logic between the entry to main and the start of the while(1) loop, the beginning of the loop should be easy to spot with a little iteration and some intelligent tweaking of the address range to inspect.
If your LSA can correlate the disassembled machine code back to C source, this step is even more straightforward because you only have to capture the addresses within the range known to hold the main function (again, see the map file output from the linker) and then watch for the while(1) instruction. If the while(1) loop is moved to its own function, perhaps something like Background(), then the location is much easier to find via the linker map file.
If the previous approach isn't appealing, you have other options. By inspecting Listing 1, you'll notice that the CheckCRC function is called every time through the background loop. If you could ensure that this is the only place where CheckCRC is called, you could use the entry to this function as the marker for taking time measurements.
Finally, you could set a dummy variable to a value every time through the background loop. The LSA could trigger on writing to this "special" variable as shown in Listing 2. Of course, I'm supposed to be showing you how using the LSA means you don't have to modify code. However, the code change in Listing 2 is so minor that it should have a negligible effect on the system.
Listing 2: Background loop with an "observation" variable
extern INT8U ping;
while(1) /* endless loop - spin in the background */
{
ping = 42; /* look for any write to ping)
CheckCRC();
MonitorStack();
.. do other non-time critical logic here.
}
Regardless of the method you use to trigger the LSA, the next step is to collect time measured from instance to instance. Obviously, the LSA must be able to time stamp each datum collected. Some of the more sophisticated modern logic analysis tools also have the ability to carry out some software performance analysis on the data collected. One such function that could help would be one that mathematically averages the instance-to-instance timing variation. Even more helpful is a histogram distribution of the variation since this shows the extent to which the background-loop execution time varies.
If the LSA doesn't perform any kind of data analysis, you have to export the data and manipulate it using more labor-intensive tools, such as a spreadsheet. The spreadsheet is a good alternative to an LSA-based performance analysis tool as most spreadsheet applications have many statistical tools built in.
To accurately measure CPU utilization, the measurement of the average time to execute the background task must also be as accurate as possible. To get an accurate measurement of the background task using the LSA method, you must ensure that the background task gets interrupted as little as possible (no interruptions at all is ideal, of course). Essentially two classes of interrupts can disrupt the background loop: event-based triggers and time-based triggers. Event-based triggers are usually instigated by devices, modules, and signals external to the microprocessor. When measuring the average background time, you should take all possible steps to remove the chance that these items can cause an interrupt that would artificially elongate the time attributed to the background task.
It may be possible to disable the timing interrupt using configuration options. If it's possible, the background measurement should be extremely accurate and the load test can proceed. However, if it's impossible to disable the time-based interrupts, you'll need to conduct a statistical analysis of the timing data. Specifically, the histogram analysis of the time variation can be used to help the tester discern which data represent the measured background period that has executed uninterrupted and those that have been artificially extended through context switching.
Figure 1: Sample Histogram
Figure 1 shows a histogram of an example data set. This data set contains a time variation of the measured idle-task period. Analysis of idle-task"period histogram data requires that you know how background loops become interrupted. This knowledge can help you isolate which histogram data to discard and which to keep. Looking at the sample histogram, you might estimate that any data above the threshold of 280μs represents instances where the background task was interrupted. Using this threshold, you would discard all data above 280μs for the purpose of calculating an average idle-task period. For the sake of this example, let's assume that the average of the histogram data below the threshold of 280μs is 180μs. Therefore, in all of the subsequent calculations, we'll use a value of 180μs to represent the average execution time for one cycle through the background loop in an "unloaded" system.
Once you know the average background-task execution time, you can measure the CPU utilization while the system is under various states of loading. Obviously there's no way (yet) to measure CPU utilization directly. You'll have to derive the CPU utilization from measured changes in the period of the background loop. You should measure the average background-loop period under various system loads and graph the CPU utilization.
For example, if you're measuring the CPU utilization of a engine management system under different systems loads, you might plot engine speed (revolutions per minute or RPM) versus CPU utilization. Assume the average background loop is measured given the data in Table 1. Note that the background loop should only be collected after the system has been allowed to stabilize at each new load point.
Table 1: System load (RPM) vs. average background loop period (T)
RPM | T (μs) |
500 | 249 |
1,000 | 272 |
1,500 | 310 |
2,000 | 350 |
2,500 | 372 |
3,000 | 451 |
3,500 | 703 |
4,000 | 854 |
4,500 | 1,008 |
5,000 | 1,206 |
5,500 | 1,359 |
6,000 | 1,501 |
Now you've collected all the information you'll need to calculate CPU utilization under specific system loading. Recall from Equation 1 that the CPU utilization is defined as the time not spent executing the idle task. The amount of time spent executing the idle task can be represented as a ratio of the period of the idle task in an unloaded CPU to the period of the idle task under some known load, as shown in Equations 1 and 2.
Table 2 shows the results of applying Equations 1 and 2 to the data in Table 1. Figure 2 shows the salient data in graphical form. Of course you'll want to reduce the amount of manual work to be done in this process. With a little up-front work instrumenting the code, you can significantly reduce the labor necessary to derive CPU utilization.
Figure 2: CPU utilization vs. system load (RPM)
Table 2: System load data and calculated utilization
RPM | T (μs) | % Idle | %CPU |
500 | 249 | 72.3 | 27.7 |
1,000 | 272 | 66.2 | 33.8 |
1,500 | 310 | 58.1 | 41.9 |
2,000 | 350 | 51.4 | 48.6 |
2,500 | 372 | 48.4 | 51.6 |
3,000 | 451 | 39.9 | 60.1 |
3,500 | 703 | 25.6 | 74.4 |
4,000 | 854 | 21.1 | 78.9 |
4,500 | 1008 | 17.9 | 82.1 |
5,000 | 1206 | 14.9 | 85.1 |
5,500 | 1359 | 13.2 | 86.8 |
6,000 | 1501 | 12.0 | 88.0 |
Counting background loops
The next method is actually a simple advance on the use of the LSA and histogram. The concept is that, under ideal nonloaded situations, the idle task would execute a known and constant number of times during any set time period (one second, for instance). Most systems provide a time-based interrupt that you can use to compare a free-running background-loop counter to this known constant. Let's say we use a 25ms period task to monitor the CPU utilization. We enhance the while(1) loop of Listing 2 so that a free-running counter is incremented every time through the loop as shown in Listing 3. A free-running counter uses a variable that, when incremented, is allowed to overflow. No math protection is needed (or desired) because the math that will look for counter changes can comprehend an overflow situation. Math protection would just add unnecessary overhead.
Listing 3: Background loop with a loop counter
extern INT16U bg_loop_cnt = 0;
int main( void )
{
SetupInterrupts();
InitializeModules();
EnableInterrupts();
while(1) /* endless loop -
spin in the background */
{
bg_loop_cnt++;
CheckCRC();
MonitorStack();
... do other non-time critical logic here.
}
}
We still know the average nonloaded background-loop period from the LSA measurements we collected and postprocessed. Recall that in the earlier example, the average idle-task period was calculated as 180μs. Therefore, in a 25ms time frame, the idle task would execute 138 times if it were never interrupted.
We must modify the 25ms task as shown in Listing 4 to use this count to calculate the CPU utilization, and we have to retain the previous loop count so that a delta can be calculated. The delta indicates how many times the background loop executed during the immediately previous 25ms timeframe. Comparing this value to the maximum loop count indicates how much time was spent in the idle task versus doing other processing.
Listing 4: Additional logic added to a period task for CPU use calculation
/* predetermined average Idle Task Period */
#define IdleTaskPeriod ( 180 )
/* Unloaded 'max' bg loops per 25ms task */
#define BG_LOOPS_PER_TASK ( 25000 / IdleTaskPeriod ) /* 138 */
INT8U CPU_util_pct; /* 0 = 0% , 255 = 100% */
void INT_25ms_tasks( void )
{
static INT16U prev_bg_loop_cnt = 0;
static INT16U delta_cnt;
INT8U idle_pct;
delta_cnt = bg_loop_cnt - prev_bg_loop_cnt;
prev_bg_loop_cnt = bg_loop_cnt;
if ( delta_cnt > BG_LOOPS_PER_TASK )
delta_cnt = BG_LOOPS_PER_TASK;
idle_pct = (INT8U)( (255 * delta_cnt) / BG_LOOPS_PER_TASK );
CPU_util_pct = 255 - idle_pct;
..... do other 25 millisecond tasks .....
return;
}
One thing you'll notice when comparing the actual C code to Equation 3 is that the delta loop counter is multiplied by 255, not 100% as indicated in the equation. This is a common scaling trick used to maximize the resolution of a variable. Since the minimum variable size that can hold the CPU utilization number is an 8-bit value, 100% is scaled to be the full range value of this variable space. To convert back to real percentage, use Equation 4.
The conversion from computer units back into engineering units can be done after you've collected the data. This allows the end result to retain as much resolution as possible. Of course, if you're using floating-point math, you can do the conversion in the actual C code.
Once these changes have been implemented, you must be able to retrieve the CPU-utility value from the processor. You can do this through various instrumentation ports or through communications protocols using UART, J1850, CAN, and so forth. You then pull the data into a spreadsheet and manipulate it to create the graph shown previously in Figure 2. Table 3 shows how the data would look and some of the intermediate calculations you can do. Some instrumentation solutions allow the scaled value to be converted from computer units to engineering units automatically.
Table 3: Scaling the output for human consumption
RPM | CPU Util (Scaled) | % CPU |
500 | 71 | 27.7 |
1,000 | 86 | 33.8 |
1,500 | 107 | 41.9 |
2,000 | 124 | 48.6 |
2,500 | 132 | 51.6 |
3,000 | 153 | 60.1 |
3,500 | 190 | 74.4 |
4,000 | 201 | 78.9 |
4,500 | 209 | 82.1 |
5,000 | 217 | 85.1 |
5,500 | 221 | 86.8 |
6,000 | 224 | 88.0 |
Automating the system
Although counting background loops is more convenient than collecting all of the data on an LSA, it still requires a fair amount of human preparation and verification. Every time the software set is changed, a human tester must verify that the background loop hasn't changed in some way that would cause its average period to change. If the loop has changed, a human must reconnect the LSA, collect some data, statistically analyze it to pull out elongated idle loops (loops interrupted by time and event tasks), and then convert this data back into a constant that must get injected back into the code. The good news isyou can employ the CPU to do all of this work and more accurately to boot.
I'll call this the automated method. The automated method calculates, in real time, the average time spent in the background loop. There are two main advantages to having the software calculate the average time for the background loop to complete, unloaded:
- You can accurately detect preemption (rather than making a guess from histogram data). Detecting preemption enables you to discard average data that's been skewed by interrupt processing.
- You aren't required to recharacterize the system after each software release, saving lots of time and avoiding errors.
For this method to work, the system must have access to a real-time clock. Most microprocessors can create a clock tick at some period (a fraction of the smallest time interrupt). For this method to have any usable accuracy, the real-time clock should be less than 1/20th of the average background-task execution period. Therefore, in this example, we need a real-time clock with a resolution of 180μs/20, or 9μs. The code in Listings 5 through 7 assumes a 5μs real-time clock tick. Also, each interrupt service routine, exception handler, and preemption mechanism must indicate that a context switch has happened. A preemption indicator flag facilitates this notice. The background loop can use the flag to discern that the time measured has been elongated by another task. See Listing 5 for an example of how a preemption indicator can be used. Notice that the PreemptionFlag variable is more than a Boolean value; you can use it to indicate which actual event executed since the last time the preemption flag was cleared. I won't explain this function any further here but it may spark some ideas for expanding the method to measure the time spent in each individual task and not just in the background.
Listing 5: A task harnessed with a preemption indicator
extern INT32U PreemptionFlag;
void SomeEventISR( void )
{
PreemptionFlag |= 0x0080;
..... do Event logic .....
return;
{
Once a flag exists that can indicate preemption, the background-loop logic can be enhanced to measure its own execution period. Listing 6 shows a completely modified background loop with the logic necessary to measure and calculate the average, uninterrupted, idle-task period.
Listing 6: Idle task period measurement with preemption detection
extern INT16U bg_loop_cnt = 0;
extern INT32U PreemptionFlag = 0;
extern INT16U FiltIdlePeriod;
int main( void )
{
SetupInterrupts();
InitializeModules();
EnableInterrupts();
while(1) /* endless loop - spin in the background */
{
MonitorIdlePeriod();
CheckCRC();
MonitorStack();
... do other non-time critical logic here.
}
}
void MonitorIdlePeriod( void )
{
static INT16U RT_Clock, prevRT_Clock;
INT16U IdlePeriod;
bool interrupted = TRUE;
bg_loop_cnt++;
prevRT_Clock = RT_Clock;
DisableInterrupts(); /* start atomic section */
RT_Clock = GetRTClock();
if ( PreemptionFlag == 0 )
interrupted = FALSE;
PreemptionFlag = 0;
Enable Interrupts(); /* end atomic section */
IdlePeriod = RT_Clock - prevRT_Clock;
if ( !interrupted )
FiltIdlePeriod = Filter( FiltIdlePeriod, IdlePeriod );
}
Notice that the average idle-period variable, IdlePeriod, is filtered in the source code shown in Listing 6. The definition of the filter is beyond the scope of this article; the filter could be as simple as a first-order lag filter or as complex as a ring buffer implementing a running average.
The CPU-utilization calculation logic found in the 25ms logic must also be modified to exploit these changes. Listing 7 shows how you can modify this piece of code to use a filtered idle period (scaled in real-time clock counts).
Listing 7: Refined CPU measurement
/* How many RT clocks (5 us) happen each 25ms */
#define RT_CLOCKS_PER_TASK ( 25000 / 5 )
INT8U CPU_util_pct, FiltCPU_Pct; /* 0 = 0% , 255 = 100% */
void INT_25ms_tasks( void )
{
static INT16U prev_bg_loop_cnt = 0;
static INT16U delta_cnt;
INT8U idle_pct;
INT32U idle_time;
PreemptionFlag = 0x0004; /* indicate preemption by 25mS task */
delta_cnt = bg_loop_cnt - prev_bg_loop_cnt;
prev_bg_loop_cnt = bg_loop_cnt;
idle_time = delta_cnt * FiltIdlePeriod;
if ( idle_time > RT_CLOCKS_PER_TASK )
idle_time = RT_CLOCKS_PER_TASK;
idle_pct = (INT8U)( (255 * idle_time) / RT_CLOCKS_LOOPS_PER_TASK );
CPU_util_pct = 255 - idle_pct;
FiltCPU_Pct = Filter( FiltCPU_Pct, CPU_util_pct );
..... do other 25 millisecond tasks .....
return;
}
This logic now uses the filtered idle period instead of a constant to calculate the amount of time spent in the background loop. Of course, the logic must still know how much total time exists between measurements, but now the time constant is relative to the resolution of the real-time clock instead of a hard-coded average idle period. Note that a filtered CPU utilization value has also been added to assist you if the raw CPU-usage value contains noise.
How much is enough?
There are several schools of thought regarding processor loading. Although the academic field of processor loading is constantly evolving, the seminal work regarding schedulability of tasks was published in 1973 and covers a topic called rate monotonic scheduling.1 Rate monotonic analysis (RMA) provides a set of constraints and equations that help a system architect determine if a set of tasks will be schedulable. Liu and Layland indicate that, as the number of task increases, the maximum cumulative CPU utilization available for task execution approaches a ceiling of 69%.1
A less scientific (and perhaps more heuristic) limit is the 70 to 80% range. This range makes sense when considered in the context of RMA and also understanding that RMA is a fairly restrictive theory in that it assumes a fixed priority of tasks. Many real-time implementations of logic allow tasks to raise their priority to accomplish critical functions.
However, opinions abound. A study I found on the Internet indicates that system designers should strive to keep CPU use below 50%:
"A CPU with really high utilization will lead to unpredictable real-time behavior. Also, it is possible that the high priority tasks in the system will starve the low priority tasks of any CPU time. This can cause the low priority tasks to misbehave. . . . [K]eep the peak CPU utilization below 50 %."2
Refining your tools
In this article, I've demonstrated three ways to employ the technique of CPU utilization. All three methods have been used successfully to develop and verify an automotive powertrain control system. Although these methods demonstrate the simple evolution of the CPU-utilization technique, you can refine them even further for your needs. As I stated previously, you could use the preemption flag to a greater degree to measure all sorts of CPU utilization. You can also modify the system to measure more than just total CPU utilization: for instance, measuring the time in each service routine or event handler. This knowledge might help illuminate where the majority of time is being spent in the system and thereby decompose and optimize sections of code that may be monopolizing the processor.
Although I've mentioned that some logic-analysis equipment contains software-performance tools, I didn't explain how to exploit these tools. Every software-performance tool is a little different, but if your project team has such a tool available, it's in your best interest to discover whether the tool can help you understand your system loading. Profiling tools can also help you understand where the system is spending a majority of its time. You should use these tools if they're available to you.
Michael Trader is a senior applied specialist with EDS' Engineering and Manufacturing Services business unit. A software engineer since 1989, he currently develops embedded powertrain control firmware in the automotive industry. He has a BSEE from the Milwaukee School of Engineering and a MS-CSE from Oakland University in Rochester, Michigan.
Endnotes
- Liu, C, and J Layland, "Scheduling Algorithms for Multiprogramming in a Hard Real Time Environment," Journal of the Association for Computing Machinery, January 1973.
- EventHelix.com, "Issues In Realtime System Design," 2000"2001, www.eventhelix.com/RealtimeMantra/IssuesInRealtimeSystemDesign.htm
Further reading
Labrosse, Jean J., MicroC/OS-II: The Real Time Kernel, CMP Books, 2002.
Oshana, Robert, "Rate-monotonic Analysis Keeps Real-time Systems on Schedule," EDN AccessDesign Feature, September 1997, www.reed-electronics.com/ednmag/article/CA81193
Krishna, C. M., and Kang G. Shin, Real-Time Systems, WCB/McGraw-Hill, 1997.
Copyright 2005 © CMP Media LLC