Embedding FPGAs in DSP-driven Software Defined Radio applications

Embedding FPGAs in DSP-driven Software Defined Radio applications
By Rodger Hosking and Richard Kuenzler, Embedded.com
Jun 13 2005 (23:31 PM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=164302833

With the advent of software defined radio platforms in military aerospace and now more recently in some consumer radio and electronics segments, the usefulness of Field programmable logic (FPGAs) as reprogrammable digital signal processing (DSP) SDR engines is taking on increased importance.

Field programmable logic has been the circuitry of choice for connecting high-speed peripherals like wideband A/D and D/A converters, digital receivers and communication links to programmable processors in embedded real-time systems.

FPGAs (field programmable gate arrays) are especially well suited to handle the clocking, synchronization, and the other diverse timing circuitry needed to tame these specialized devices. In addition, FPGAs are excellent for data formatting tasks like serial-to-parallel conversion, data packing, time stamping, multiplexing, and packet formation.

But their DSP capability has become one of the most significant capabilities inherent in FPGAs, as evidenced by sharp increases in engineering and marketing investments in this technology on the part of FPGA vendors over the last few years.

Digital Signal Processing Tasks
In traditional software radio receiver systems, the translated and filtered baseband signal is sent into the DSP as a stream of complex samples of a time domain waveform. The DSP must handle all the demodulation tasks as well as higher-level decisions based on the analysis of the signals being received.

Signal intelligence receivers typically classify a signal by first performing a spectral analysis of the signal to estimate what type of modulation was used, and then apply demodulation algorithms to determine if useful information is extracted, such as intelligible speech or meaningful data.

Other significant tasks for the DSP include decryption, data storage, channel switching, signal routing to other systems, logging activity, and sending audio or digital data to a operator for listening or display.

In a cell phone base station, the number of digital signal processing tasks grows with each new communications standard. The proliferation of sophisticated digital voice and data protocols require decoding, convolution, framing, error correction, and vocoding.

Compounding the processing load for these additional tasks is the steady increase in sampling rates requirements. To support new applications such as wideband CDMA for example, DSPs are being moved closer and closer to the antenna.

To meet these needs, DSP clock rates have increased to over 200 MHz and many of the new devices feature two or more hardware multipliers. Nevertheless, as one of the most expensive and power-hungry resources in the system, it is clear that minimizing the substantial workload for the DSP can be quite important.

The role of FPGAs in SDR
During the last five years, FPGAs have made dramatic gains in several critical areas in order to accommodate DSP functions. The gate density of these devices has nicely followed Moore's law, doubling approximately every year and a half. Some recently announced devices are boasting of 10 million gates! Gate arrays are typically structured as logic cells equipped with memory and capable of performing math functions. These high-density logic cells are now available in a wide range of basic "cores" to support fast multipliers, block memory to handle FFT processing and distributed memory for FIR filters.

FPGA synthesis tools now support "parameterizable" cores that accept bit width definitions and automatically generate core structures to match the signal processing accuracy requirements without wasting gates.

A wide range of front end design tools are now available to suit the various input preferences of both hardware and software systems engineers. These include block diagram system generators, schematic processors, and high-level input language compilers for Verilog and VHDL. The speed, accuracy and ease-of-use of new simulators simplifies the testing of new designs and minimizes the time spent debugging applications.

Third party vendors are now offering high-level IP cores to complement the standard cores supplied by the FPGA vendors. These range from complete DSP processors to application specific blocks like high-speed internet modems. With these new commercial "off-the-shelf" functions, FPGAs are now able to penetrate both the general-purpose ASIC market as well as the DSP market.

Of even greater significance, the digital signal processing capabilities of FPGAs can often outperform general purpose DSPs. For example, if a wideband FIR digital filter requires 32 MACs (multiply/accumulate operations) within a single clock cycle, the general purpose DSP with only two multipliers will fall far short of the mark. On the other hand, FPGAs can easily incorporate 32 MAC cores to handle the task.

Flexible and Reusable
This COTS-based software radio system is an ideal platform for implementing a wide range of applications. By using the new FPGA design tools and IP libraries for these highly-configurable FPGA-based COTS board-level products, system designers can eliminate need for custom boards. Since FPGA "hardware" can be radically reconfigured with no new board design, the same products used in the current project can be easily retooled for future applications.

As new software radio algorithms are developed, they can first be tested on the DSP, taking advantage of wider range of code generation, simulation and optimization tools. When complete, the algorithm can be ported to the FPGA for better real-time operation or to support the processing burden of many parallel channels. Finally, for transition to high-volume production, most FPGA designs can be easily converted into mask tooling for custom ASICs.

While reprogramming the FPGAs to handle new functions can be somewhat more complicated than writing new algorithms for a DSP, this level-of-effort gap appears to be closing. No longer the exclusive domain of the hardware designer, FPGA design tools are now being used more and more extensively by software engineers, ensuring that this major technology shift will represent the mainstream paradigm for future embedded system design.

Software Radio Module Application
One illustrative example of the power and flexibility of a DSP-driven FPGA SDR platform (shown in Figure 1 below) is a dual channel digital receiver daughter card module that attaches to a quad DSP processor VME board. It contains two 12-bit A/D converters capable of operating at sampling rates up to 100 MHz and two digital down converters that translate and filter selected portions of the wideband digitized input.

An on-board FPGA accepts real outputs from both A/D converters as well as complex base band outputs from both of the digital down converters. The FPGA implements the VIM (Velocity Interface Mezzanine) interface to deliver data directly into each DSP or PowerPC on the processor board, where FIFO buffers support DMA block data transfers at rates up to 400 MB/sec.

With an eye towards adding DSP capability, a natural choice for the FPGA in this kind of platform is the Xilinx Virtex-II family. With 96 dedicated 18x18 multiplier blocks and over 200 kBytes of block RAM, the XC2V3000 offers a generous mix of signal processing resources, even for some of the more substantial applications.

In the basic factory configuration of the module, the FPGA still performs the traditional tasks of timing, formatting, and glue logic for the various devices on board. Because these functions are relatively simple, they consume only 6% of the programmable logic. This leaves 94% of the logic blocks, all 96 multipliers and virtually the entire block RAM available for adding DSP algorithms.

To help demonstrate the power of these untapped resources, an engineering project was launched to implement a high-performance FFT engine. Since communications, radar, and signal intelligence systems all utilize FFTs for tracking, tuning and image processing operations, the FFT remains one of most popular algorithms for benchmarking processor performance.

In a nutshell, the FFT accepts a block of input time-domain samples and converts them into a block of output frequency-domain samples. Because the calculation is rather complex, it consumes a significant share of DSP processing resources and becomes a prime candidate for FPGA implementation.

Constructing the FFT
One of the most efficient methods of performing the FFT calculation is an iteration of the radix-4 "butterfly" algorithm. Inside each butterfly, four input data points are multiplied by coefficients from a sine table and then combined to produce four output points. This butterfly operation is repeated until all input points are processed, four at a time, representing a single "stage". To implement a 4,096 point FFT, six stages of butterflies are required.

One of the benefits of using an FPGA over a conventional programmable processor for computing FFTs is the large number of multipliers available for simultaneous calculation.

In the 4,096 example above, a total of 60 multipliers are needed to implement all six FFT butterfly stages in parallel. Since the XC2V3000 has 96 multipliers available, it becomes obvious why FPGAs can often dramatically outperform a standard DSP processor having only two or four hardware multipliers, especially for algorithms like the FFT.

Since the FFT is inherently a block-oriented algorithm, the FFT operates most efficiently when a freely addressable RAM supports quick access to all input and output samples. However, this ideal model of random data availability is contrary to the sequential input data samples streaming from the A/D converter.

Fortunately, the configurable block RAM resources of the FPGA can be retooled to form a memory structure that feeds the appropriate samples into four input data memory ports of the butterfly engines in parallel, thus solving the data availability problem. This proprietary memory architecture allows subsequent input blocks to be processed in a continuous, systolic manner so that all of the multipliers in all six stages can be productively engaged all the time.

For every FPGA clock cycle, each radix-4 butterfly processes four input samples. Therefore, when the FPGA processing clock is equal to the A/D clock, the architecture above is capable of running four times faster than real-time. With suitable hardware multiplexing schemes, this same FFT engine can be used to handle four streams of input data instead of just one.

In this example, with two A/D converters and the FPGA all clocking at 100 MHz, the FPGA is only working at half capacity. But with a little extra effort, the engine can be set to handle 50% input overlap processing of both channels to fully utilize the hardware. In this case, the pipelined execution time is an amazing 10.24 microseconds for each FFT! This is four times faster than the time it takes to collect the 4,096 input points at a 100 MHz sampling rate, consistent with performing four FFTs in real time.

FFT Enhancements
Since only 60 of the 96 multipliers were used for the FFT algorithm, additional features were incorporated. At each of the four complex input streams, an optional Hanning window can be applied, requiring eight extra multipliers. Since coefficients for the FFT and for the Hanning window utilize separate FPGA table memories, alternate input windowing functions can be substituted for the Hanning window.

Eight more multipliers are used to perform an optional power calculation at the FFT output, in which the real and imaginary components of each of the four outputs are squared and then added together. Finally, an averager stage adds the two outputs of the 50% input overlap FFTs to improve signal-to-noise characteristics.

At the output of the FPGA, a multiplexer allows the results of each signal processing stage to be directed to the processor interface. Figure 2 below shows all of the basic function blocks inside the FPGA of the daughter card module shown in Figure 1.

Conclusion
At an execution speed of 10.24 microseconds for a 4,096 point complex FFT, this FPGA engine outperforms benchmarks for an optimized FFT algorithm running on a 400 MHz G4 PowerPC by a factor of ten!

In order to achieve a calculation dynamic range of better than 90 dB, several techniques were employed to reduce the rounding and truncation errors inherent in FPGA integer arithmetic. After optimization for execution speed by deploying the available FPGA resources, the entire design utilized 76 of the 96 multipliers, 99% of the logic slices, and 97% of the block RAM of the XC2V3000 device.

Although this particular FPGA component is still expensive because of its recent introduction, two concentric subsets of the ball grid array footprint pattern accommodate two smaller devices in the same family, to save costs for less demanding applications.

Rodger Hosking is vice president, and Richard Kuenzler is Senior Design Engineer at Pentek, Inc.

Industry Articles

Embedding FPGAs in DSP-driven Software Defined Radio applications