|
||||||||
Array Processors Enable Flexibility in FFT DesignsPatrick Fuller, picoChip Although it has found many uses in signal processing, the Fast Fourier Transform (FFT) has taken on even more importance as a fundamental part of the algorithms used for communications protocols based on orthogonal frequency division multiplexing (OFDM). In the wireless space, OFDM is used in the newer forms of IEEE 802.11 wireless LAN (WLAN) designs and in the IEEE 802.16-2004 (WiMAX) specification for metropolitan area networking. It has also been proposed as the basis for successors to 3G cellular communications systems (or even as an option for increased data rates for a future version of UMTS, which will cause a strange terminological situation: "the OFDM mode of WCDMA"). In the wired area, OFDM is referred to as discrete multi-tone (DMT) and is the basis for the ADSL standard. The common strand in all these systems is a demand for high-speed FFT processing. But often time-to-market pressure drives vendors to release products that comply with early versions of a standard, thus locking down an FFT architecture. The problem, however, is that as standard change, FFT architectures can also change. Thus, designers need to implement a flexible FFT architecture in their design in order to account for potential spec changes. For example, 802.16e has recently shifted from a constant FFT size (256 for OFDM, 2048 for OFDMA) to a scalable physical layer (PHY), with the FFT size shifting for different channel bandwidths with a maximum of either 1024 or 2048 being argued. This demands a solution that can be implemented on programmable silicon. Many algorithms and architectures exist for implementing the FFT. But, very often, the use of a software-programmable platform may mandate the use of an algorithm that is optimized for software processing even though it may be slower or less power-efficient than a hardware-oriented design. Changing the FFT size for a traditional, block-based implementation also poses significant issues to the overall system performance due to scheduling considerations. Fortuantely, designers now have an answer to this challenge. By implementing the FFT in a reconfigurable processor, designers can efficiently implement a hardware-based FFT that can be adapted over time to meet changing performance requirements. Let's see how. Understaning FFTs
Figure 1: FFT with decimation in (a) frequency and (b) decimation in time. Two approaches exist for reducing the DFT into a series of simpler calculations. One is to perform decimation in frequency and the other is to perform decimation in time. Both approaches require the same number of complex multiplications and additions. The key difference between the two approaches is that decimation in time takes bit-reversed inputs and generates normal-order outputs, whereas decimation in frequency takes normal-order inputs and generates bit-reversed outputs. The manipulation of inputs and outputs is carried out by so-called butterfly stages. The use of each butterfly stage involves multiplying an input by a complex twiddle factor, e-j2πn/N. A software implementation of the FFT has typically involved block-based processing due to the need to share the processor with other tasks. Any change to the implementation may adversely affect the whole system. A hardware implementation of the FFT typically involves dedicated logic in the form of a pipeline with buffering in each of the butterfly stages. Pipeline Approach Pipeline FFTs use decimation in frequency in order to avoid reordering the inputs. As a result, the outputs are in bit reversed order. The FFT algorithm involves the temporal separation of data, a task that is performed by the butterfly stage. Because samples will be taken from different points in the input stream, pipeline FFTs need a way of buffering and reordering data. Various architectures exist for this. The pipeline FFT processor is based on the Radix-2 Single-path Delay Feedback (R2SDF) architecture. In the FFT approach described above, delay between each butterfly stage in a pipeline FFT is dictated by the amount of buffering required for the inputs. The largest delay is in the first stage where N/2 samples have to be buffered before outputs can be generated. The smallest delay is in the last stage where only one sample needs to be buffered. An optimization of the FFT algorithm can be made with respect to the twiddle factors. This cuts the number of multipliers required in half. The FFT retains the Radix-2 butterfly structure but has the multiplicative complexity of a Radix-4 algorithm. This results in two Radix-2 butterflies for each complex multiplier. The resulting architecture is termed R22SDF. Figure 2 shows two, Radix-2 butterfly stages between each complex multiplier. The memory requirements for each butterfly stage are also shown.
Figure 2: Diagram showing an R22SDF pipeline FFT. Processor Details A long instruction word of up to 64 bits allows up to three execution units to execute in each processor in a single cycle at 160MHz. Each array element has a number of ports for communicating with other elements within the array using a switch fabric. The processors are linked by an interconnect fabric. Like an FPGA, this is allocated at compile time, so all communications & processing is predictable and deterministic (unlike a conventional DSP, with RTOS, scheduling and arbitration which are only statistically predictable). FFT Implementation
Figure 3: Diagram showing the implementation of a 256-point pipeline FFT on a reconfigurable processor. In the pipelined FFT implementation, each array element takes its input from a bus, processes it, and then provides an output to the next array element in the pipeline. As the overall throughput is limited by the slowest array element, each loop on the each array element should, ideally, take the same number of cycles for optimum performance. For example, if each array element processes each sample in eight cycles, then the maximum throughput is 20 MSample/s at 160MHz. The 256-point FFT shown in Figure 3 employs eight butterfly stages and three multipliers. The very first butterfly stage uses a memory array element as this first stage is where the most buffering is needed. The memory requirement for the second butterfly stage is half that of the first and can be satisfied using the memory contained in one of the standard array elements. The twiddle factors needed for multiplication also take up memory. But the amount of memory needed for the twiddle factors only exceeds the capacity of the standard array element for the first of the multipliers. The bit-growth that results from each fixed-point multiply is removed by array elements that are dedicated to rounding. Increasing FFT Length For slower sample rates, the overall FFT can be made smaller by combining several functions into one array element. For more substantial throughput increases, further parallelism could be exploited in each butterfly stage, multiplier and rounding stage, with each function being spread across multiple array elements. An alternative approach would be to instantiate two FFTs and use an array element to multiplex between them. This multiplexing approach allows sixteen 256-point FFTs to run on a single reconfigurable processor with an overall throughput of 320 MSamples/s. Wrap Up About the Author
Copyright © 2003 CMP Media, LLC | Privacy Statement
|
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |