General purpose processors (GPPs) are attractive for the digital signal processing requirements of software defined radio. For platforms where the power budget allows the use of GPPs, development cost and time to market can be significantly reduced compared to dedicated DSPs. Parts count can be reduced since most target platforms have a GPP in any case. The highest available raw computing power by some metrics is available in high-clock-rate GPPs, rather than in DSPs. GPPs have complex microarchitectures, cache subsystems, and I/O interfaces. These features can cause significant performance problems if not managed well. This article discusses some of the design approaches required to exploit GPPs as high speed signal processing engines. Vanu, Inc. has built a number of software radio systems based on GPPs and found that current generation processors on commodity boards running Linux are fast enough for a wide range of SDR waveforms including GSM/GPRS and IS95 CDMA cellular telephone standards. Time stamp-based processing For many signal processing tasks, the jitter on a GPP board is higher than the required timing accuracy. For example, a GSM basestation requires 3 microsecond transmission timing accuracy for certain handoff operations. This is much tighter than the jitter of the x86/Linux platform used in the Vanu, Inc. GSM basestation implementation. Time stamp-based signal processing is an effective design technique to overcome this problem. Incoming data from the analog-to-digital converter is time stamped. The GPP computes transmission times and labels outgoing data to the digital-to-analog converter with the time stamp of when it should be clocked out. As a result, processing in the GPP can be decoupled from wall-clock time without affecting transmission timing accuracy. The engineering challenge is to develop time-stamping schemes with minimal hardware and software overhead. Memory system issues The memory access costs measured on the 2.8 GHz Intel Pentium III based desktop PC used for an SDR system are as follows: Level 1 cache access latency = 2.1 cycles Level 2 cache access latency = 18.6 cycles Main memory access latency = 400.7 cycles On this processor the floating point unit can sustain one multiply-accumulate (MAC) every cycle. In the time it takes to access main memory, this processor can do four hundred MACs. On a GPP board, therefore, there is a high benefit to memory access efficiency in signal processing algorithms. An algorithm which appears to be more than two orders of magnitude worse in terms of MACs can easily be the faster algorithm if it has a better cache hit rate. So algorithms must be designed and benchmarked differently on a GPP than on DSPs where MAC count was the dominant performance effect. Processor and compiler Choosing different programming styles to express the same mathematical transform can lead to dramatically different performance. The following examples are Intel Pentium III code compiled with the Gnu C compiler (GCC) version 2.95.2 at optimization level 2. Conditionals: Branches are surprisingly expensive in sophisticated modern CPUs. The execution pipelines of these processors rely on instruction prefetching. When a conditional branch occurs, the processor's branch predictor must guess which direction the condition will go. An incorrect prediction causes all instructions issued since the branch to be aborted and their results dropped. Branches that are poorly predicted end up costing many times more than an arithmetic operation. Here is a simple function and the assembly language output from GCC for it. The logically identical similar conditional expression compiles to assembly code with no conditional branch. This difference in assembly code can have a substantial performance impact, depending on the data values that occur at run time. When doing high-speed signal processing, it is valuable to examine inner loops closely and see if changing the mathematics can reduce the number of if statements. This is usually worthwhile even if the modified version requires more arithmetic. Data types: Each modern processor is designed to operate most efficiently on a particular data type, represented in C as an int. Operations on this 32-bit or 64-bit data type are usually faster than operations on shorter data types such as 16-bit (short) or 8-bit (char). The simple operation of a conditional branch based on the sign of a value is two to three times more expensive on a Pentium III for a short than for an int. The data-type choice is not just about different integer widths. On two prominent high-end CPUs, Intel and PowerPC, the floating-point execution unit has a pipelined single-cycle mul which is, surprisingly, much faster than integer multiplies. For this reason, it's wise to exploit faster floating-point code in the middle of integer pipelines. However, conversions back and forth between integer and floating point must be carefully minimized. This is a slow operation which can easily become a major computational cost. Another area to be careful is in the use of unsigned values. They are not any more expensive in themselves than signed values. However, expressions which operate on combinations of signed and unsigned values can turn out to be surprisingly expensive. In the end, a variety of issues must be considered when doing high speed signal processing on a GPP board. In addition to the issues discussed here, there are a variety of issues related to I/O, memory bandwidth and instruction scheduling. It is worthwhile to invest in automatic code generation tools for the most performance critical components which get rebuilt for each waveform or port, such as finite impulse response filters and fast Fourier transforms. Overall, exploiting general-purpose platforms for SDR requires new algorithm designs, system designs, and programming styles. John Chapin (jchapin@vanu.com) is chief technical officer at Vanu Inc. (Cambridge, Mass.). |