Developing memory-efficient DSP apps without assembly code

Developing memory-efficient DSP apps without assembly code
By Davor Magdic, Senior Applications Engineer, Texas Instruments Inc., Dallas, Texas, EE Times
March 8, 2002 (10:51 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020307S0080

Digital signal processors are significantly more powerful than they were a few years ago — they now sport hundreds of MIPS, tens or hundreds of kilobytes of internal RAM memory, various built-in peripherals. Yet the applications are far more demanding as well: multichannel audio-video streaming, speech recognition, high-end telephony; rarely are there memory or MIPS to spare. They cannot afford the comfort of workstation application environments, and that is particularly true when it comes to using available memory. But the size and complexity of these applications no longer allows for hand-optimized assembly approach, except for critical operations, lest developing and maintaining such applications become a nightmare.

Fortunately, there are ways to develop memory-efficient DSP applications without resorting to assembly code. We have found that an approach based on software interrupt threading that is flexible enough for most medium- and hi gh-complexity embedded DSP designs. It is based on both the desire to efficiently utilize the available memory and to take advantage of the inherent nature of a wide range of DSP applications. The two are not incompatible; by respecting constraints and rules naturally imposed by what the application should do, we can design so that it is both modular and respectful about the memory resources it uses.

Separate contexts of execution need separate stack spaces. Focusing on the stack-allocation aspect of our memory-management policy, we want to design the application so that the stack is as small as possible; that way, we can place it in fast internal on-chip memory. This is essential since stack data is very frequently accessed. If allow the application stack area to become too large, we may be forced to move it to slow external off-chip memory — provided that we have one — and be punished with significant performance decline.

The key concern for memory-efficient multithreading is use of stack. Each thread needs space to keep its automatic (local) variables and parameters for function calls and often needs temporary storage for intermediate results between calculations or data transformations. Also, a thread's context must be placed on its stack upon preemption, so that the interrupted thread can later resume from the point where it was interrupted.

The software-interrupts-threads approach is based on the idea that the threads do not have any context that must be preserved when they are not running, except when preempted. This concept cannot work with threads organized in the usual fashion, which have a main loop that loops forever, block-wait on the resources and perform some processing before starting the loop over and block-waiting again. Instead, those threads must be organized so that they start execution only when all of their resources are ready, perform processing without any block-waiting and exit when done, only to be called again when their resources become ready the ne xt time around.

The approach is conceptually similar to that of hardware interrupt threads, where such a thread is merely the interrupt service routine (ISR) called upon the interrupt. When an interrupt occurs, the currently executing context — that is, the CPU registers — is stored on the stack and the ISR for that interrupt is called. The ISR function knows that it has been called precisely because the event it cares about has happened (that is, a data sample has arrived through the serial port), performs whatever data movements it has to perform and exits. No context has been stored for this ISR by the system, as it doesn't need one; the next time around the ISR will again run from the beginning. (However, the procedure is free to have any state variables — for example, counter the number of samples received — in a global memory; but the space needed for that is relatively small and under strict control of the procedure.) That's why ISRs, also called hardware threads, do not ha ve their own private stack, but use the system stack instead.

ISRs are not allowed to use semaphores or block-wait any other way. Also, interrupts may have different priorities, so if a higher priority interrupt occurs while the ISR for a low-priority interrupt executes, the CPU registers (the context of the lower priority hardware thread) will be stored on stack, the higher priority interrupt thread will run (that is, its ISR will be called) and when it completes, the execution of the interrupted ISR will resume.

These hardware interrupt threads have one important restriction: They have to be short to incur minimal interrupt latency as much as to avoid missing other hardware events. So instead of performing the bulk of the work themselves, they normally just raise a flag (for example, when the DMA transfer of a block of data has been completed) or collect the data from the hardware and schedule a procedure to run later and process the data, when the ISR completes and hardware interrupts are re-enabled. Exactly that procedure is the body of a software interrupt thread, and the context in which the procedure is executed is that of the software interrupt thread. Relieved of a hardware thread's extremely tight real-time deadlines — on the order of just a couple of hundred cycles in which the ISR has to complete so that another interrupt is not missed — the software thread may have entire milliseconds to process the block of data before the next block arrives. And a software thread is free to schedule for execution another software thread (or itself) as the logic of data processing dictates.

In the example in Fig. 5, two hardware threads are very short: The receive hardware interrupt thread takes only the new sample and stores it in the buffer and schedules the swiProcess software interrupt thread to run if a full frame of input data has been assembled. It's t he same for the transmit hardware interrupt thread. When the output hardware is ready to accept a new sample, the transmit interrupt occurs, hwiTransmit runs, sends the next sample from the buffer to the output hardware and returns (it also maintains its own counter).

On the other hand, the software interrupt thread is perhaps a long, time-consuming filtering function, which gets interrupted every time a new sample arrives and resumes when the hardware interrupt thread's ISR (hwiReceive() or hwiTransmit()) completes. So when swiProcess() is running and a receive interrupt occurs, CPU registers — the swiProcess thread's context — are stored on the stack, hwiReceive() executes, registers are stored and swiProcess resumes from where it was interrupted, and likewise for the transmit interrupt. The constantly interrupted software thread must only complete processing of the current frame before the next input frame completes.

When processing of one frame is complete, procedure swiProcess exits and the software thread completes execution for that cycle. There is no need for the system to store the thread's context anywhere since the next time around when the thread runs it will start executing its procedure from the beginning. The common, system stack is used for all three threads and there are never any unused gaps in the stack area.

The transmit hardware thread has no say in scheduling the software thread-though, strictly speaking, that is not correct since the software thread must be sure not only that its input frame is ready, but also that an empty frame on the output is available. That is, that the software thread will not write into the same buffer from which hwiTransmit is still consuming data.

Because there is a single, system stack for hardware and software threads' procedures, we do not have any gaps in the stack: A thread's context is stored on the system stack only if the thread is preempted by a higher priority thread and then the new thread's procedure frame (a lso called "activation frame" or "stack frame," as opposed to DSP data frame) is stored right below the preempted thread's context. When the new thread completes, its procedure frame will be removed from the stack, the preempted thread's context will be restored and the latter will resume execution.

Further steps can be undertaken to reduce the size of the stack. Again, we want to reduce it to make sure the stack can fit in the fast internal memory.

One way is to have procedures not allocate large temporary buffers on the stack. Many digital data processing algorithms require some sort of scratch memory, temporary buffers used during the calculation and discarded afterwards. In a software-interrupt-based approach we can allocate it on the stack and not suffer from penalties we would have in the non-software-interrupt case. However, that may not always be best.

But it is a good policy to have all algorithms explicitly allocate their entire scratch space off stack, for example, by aski ng a memory manager for the block of certain size and specifying whether it has to be fast internal memory or it can be a slower external memory. The manager can enforce a simple memory-sharing scheme in which all threads that run at the same priority use for scratch space a buffer whose size is the maximum of all scratch size needs across all the threads within the priority level. Because one software-interrupt thread cannot preempt another running at the same priority, each thread is guaranteed exclusive use of the scratch buffer from its peers.

In fact, we can benefit from having the algorithms ask the memory manager for all their buffer needs, both the scratch space and the permanent space where they store state/history data. Such a memory-allocation scheme makes it easier to move the algorithms from application to application.

This article is based on excerpts taken from ESC class # 551, Efficient memory management for multithreading DSP applications.