SoCs: DSP World, Cores -> FPGAs cranked for software radio

FPGAs cranked for software radio

FPGAs cranked for software radio
By Chris Dick, Senior System Engineer, Manager, Signal Processing Team, Xilinx Inc., San Jose, Calif. , EE Times
April 11, 2000 (4:38 p.m. EST)
URL: http://www.eetimes.com/story/OEG20000411S0045

Software-defined radios are highly configurable hardware platforms that provide the technology for realizing the rapidly expanding third-generation digital wireless communication infrastructure.

A software-defined radio performs many sophisticated signal-processing tasks, including advanced compression algorithms, power control, channel estimation, equalization, forward error control (Viterbi, Reed-Solomon and turbo coding/decoding) and protocol management.

Digital filters are employed in a number of ways in DSP-based transmitters and receivers. Polyphase interpolators are used in the transmitter for upsampling a baseband signal to the digital intermediate frequency (IF), to ensure compliance with the appropriate regulatory bodies' spectral requirements, and to match the signal's bandwidth to that of the channel.

In the receiver section of the system multi-stage polyphase filters are frequently used in a digital downconverte r to perform channelization and decimation. Complex filters are also required for estimating channel statistics and performing channel equalization to compensate for multipath effects, and to correct for phase and amplitude distortion introduced during transmission. Resampling filters are also an integral component in all-digital symbol synchronization loops.

Finite impulse response (FIR) differentiators are also commonly used during the demodulation process, for example, in systems that use frequency-modulated waveforms for channel access.

While a plethora of silicon alternatives is available for implementing the various functions in a software-defined radio, field-programmable gate arrays are an attractive option for many of these tasks for reasons of performance, power consumption and configurability. Semiconductor vendors provide a rich range of FPGAs, and the architectures are as diverse as the manufacturers. But some generalizations can be made.

Most of the devices are basically organized as an array of logic elements and programmable routing resources used to provide the connectivity among logic elements, FPGA I/O pins and other resources such as on-chip memory. The structure and complexity of the logic elements, as well as the organization and functionality supported by the interconnection hierarchy, distinguish the devices from one another.

Other device features such as block memory and delay-locked-loop technology are also significant factors that influence the complexity and performance of an algorithm implemented using FPGAs. A logic element usually consists of one or more RAM n-input lookup tables, where n is between 3 and 6. There may also be additional hardware support in each element to enable high-speed arithmetic operations. Application-specific circuitry is supported by downloading a bit stream into SRAM-based configuration memory. This personalization database defines the functionality of the logic elements, as well as the internal routing.

Different a pplications are supported on the same FPGA hardware platform by configuring the FPGAs with appropriate bit streams. As a specific example, consider the Xilinx Virtex series of FPGAs. The logic elements, called slices, essentially consist of a pair of four-input lookup tables (LUTs), several multiplexers and some additional silicon support that allows the efficient implementation of carry-chains for building high-speed adders, subtracters and shift registers. Two slices form a configurable-logic block, which is the basic tile used to build the logic matrix.

Some FPGAs supply on-chip block RAM. Current-generation Virtex silicon provides a family of devices offering 768 to 32,448 logic slices, and from eight to 208 variable-form-factor block memories. Xilinx XC4000 and Virtex devices also allow the designer to use the logic element LUTs as memory-either ROM or RAM. Constructing memory with this distributed-memory approach can yield access bandwidths in the range of many tens of gigabytes per second. Typ ical clock frequencies for current-generation devices are in the 100- to 200-MHz range.

The objective of the FPGA-DSP architect is to formulate algorithmic solutions for applications that best utilize FPGA resources to achieve the required functionality. This is a three-dimensional optimization problem in power, complexity and bandwidth.

The FIR filter is one of the basic building blocks common to nearly all digital signal processing systems. The output samples are formed by convolving the input sample stream with the filter coefficients. (This operation is also referred to as an inner-product or vector dot-product calculation.) The filter coefficients define the frequency response of the network. In demanding applications that require a large filter order, high sample rate or combination of both these parameters, the arithmetic workload required can be quite substantial. N multiplications and N-1 additions are required to compute a single out-put value, and high-performance hardware platform s that can provide this level of arithmetic performance are of great interest to the general signal-processing community.

Digital filters are used in a variety of ways in digital receivers and transmitters. More often than not, multirate filters, both decimators and interpolators, are at the heart of many of the key functions in a digital communication system.

For example, in a digital exciter, a polyphase interpolator may be part of the signal-processing chain that shapes and translates a signal from baseband up to the digital IF before it is converted to an analog signal by a wideband D/A converter. The signal is subsequently processed by the RF back end.

In a digital receiver, multirate filters are used in the digital downconverter, in the channel equalizer and for the digital resampling of a signal in the timing recovery and acquisition loop. A common option for implementing real-time filters is a software-programmable signal-processing chip. A higher-performance alternative is an ASIC solution. A more recent design option is to exploit the parallelism that an FPGA-based hardware system can provide.

Allocating resources

There are numerous options for implementing FIR filters in an FPGA. The most obvious approach, though not always the most optimal, is to model the technique used in an ASIC or instruction-set-based DSP (ISDSP). This employs a scheduled multiply-accumulate (MAC) unit.

An inner-product computation may be partitioned over one or several MAC units-a common approach used by current-generation signal processors. Clearly, this same method can be used in an FPGA implementation. But in the FPGA environment the designer has virtually complete control of the silicon and can decide how much of this resource is allocated to the inner-product engine.

There are several metrics that designers consider when evaluating a particular filter implementation technology. These include performance (sample rate), power, size and cost. In the interest of providing a frame of reference for the FPGA FIR filter implementation, we will compare a MAC-based FPGA FIR filter with several of the many ASIC and ISDSP approaches.

First consider the ASIC solution provided by DSP Architectures' DSP-24 signal processor. This device can perform several functions, including filtering. Using a clock frequency of 100 MHz and 24-bit data and coefficients, the DSP-24 can compute a real filter tap in 5 nanoseconds and a complex tap in 10 ns. At the heart of most MAC engines is a multiplier. To support 24-bit input samples and coefficients, a 24-bit precision multiplier is required.

Using Xilinx Virtex FPGA technology, this multiplier is implemented using 348 logic slices. This takes about 11 percent of the circuitry of a low-density device like the XCV300. To realize a complete MAC unit, an accumulator must be cascaded with the multiplier. Using FPGAs, the system architect is free to choose the precision of this component. For this example, we will specify a 56-b it-wide accumulator.

The accumulator requires 28 logic slices. Excluding the small amount of control logic needed for address generation and scheduling the arithmetic unit, the MAC engine can be implemented using approximately 348 + 28 = 376 logic slices. In addition to the arithmetic engine, a complete FIR filter requires storage for both the filter coefficients and the input-sample-time history buffer.

Memory resources to choose from in a Virtex FPGA include block RAM or distributed RAM. For large filters, the block memory provides efficient storage for the input samples and/or the filter coefficients. Alternatively, the input samples and coefficients may be stored in distributed memory. Distributed memory is implemented using the elements that form the logic fabric-that is, the 16 x 1 LUTs. As an example, a 16-tap filter using 24-bit input samples and 24-bit coefficients requires a total of 24 slices to realize both the filter memory and coefficient storage.

A clock frequency of 1 00 MHz results in a multiply-accumulate rate of 10 ns/tap. In practice, higher clock speeds would be possible depending on the speed grade of the part. While a single MAC unit provides 100 mega-MACs of performance, the configurable nature of FPGAs permits the designer to exploit the high degrees of parallelism present in many DSP algorithms to boost the performance beyond that.

In this case, two MACs could be used in a single filter to provide a throughput of 200 mega-MACs. This is equivalent to computing a MAC every 5 ns. This dual-MAC design requires approximately 3 percent of an XCV1000 FPGA. Of course, additional MAC units can be used to further reduce the effective MAC cycle time.

This use of parallelism is an obvious and effective method for increasing the performance of many numerically intensive applications. Accessing concurrency is the key to successful FPGA implementations of virtually all signal-processing functions. However, unlike ISDSPs where the level and access to concurrency has been decided by the chip designer, the system designer using FPGA signal-processing hardware is free to allocate silicon resources to exploit a suitable level of concurrency to satisfy the system performance requirements.

The continuing evolution of communication standards and competitive pressure in the marketplace dictate that communication system architects must start the engineering design and development cycle while standards are still fluid. Third- and future-generation communication infrastructure must support multiple modulation formats and air-interface standards.

FPGAs provide the ability to achieve this goal, with high levels of performance. The software-defined radio implementation of traditionally analog and digital hardware functions opens up new levels of service quality, channel access and cost efficiency. The software in such a radio defines the system's personality, but currently, the implementation is often a mix of analog hardware, ASICs, FPGAs and DSP software.

< P> The rapid uptake of state-of-the-art semiconductor process technology by FPGA manufacturers is opening up new opportunities for the effective insertion of FPGAs in the software-defined radio's signal-conditioning chain.

THIS ARTICLE IS EXCERPTED FROM A PAPER PREPARED WITH A CLASS PRESENTED AT DSP WORLD SPRING 2000 TITLED "FPGAS FOR DIGITAL COMMUNICATIONS."

See related chart