DSP processors have conventionally moved to higher levels of performance through a combination of the following techniques: - Increasing clock cycle speeds
- Increasing the number of operations performed per clock cycle
- Adding optimized hardware co-processing functionality (such as a Viterbi decoder)
- Implementing more complex (VLIW) instruction sets
- Minimizing sequential loop cycle counts
- Adding high performance memory resources
- Implementing modifications including deeper pipelines and superscalar architectural elements
Each of these enhancements has contributed to increased DSP processor performance improvements and performance increases will continue. However, ultimately each of these design enhancements seeks to increase the parallel processing capability of an inherently serial process. The traditional approach for achieving performance beyond the current level of DSP processor performance was to transition the design to an ASIC implementation. The disadvantages of ASIC implementations include relatively large Non Recurring Engineering (NRE) costs, relatively high unit volume requirements, limited design modification options, and extended development schedules. Higher performance implementations of specific DSP algorithms are increasingly available through implementation within FPGAs. Ongoing architectural enhancements, development tool flow advances, speed increases and cost reductions are making FPGA implementation attractive for an increasing range of DSP-dependent applications. FPGA technology advances have increased clock speeds and available logic resources into and beyond the range required to implement many DSP algorithms effectively at an attractive price point. FPGA implementation provides the added benefits of reduced NRE costs along with design flexibility and future design modification options. If further performance improvements are required beyond the capabilities of current FPGA technology, a risk reduced path is available to transition the design into an ASIC implementation. Ultimately FPGAs provide a design platform which offers the flexibility of a general purpose DSP processor implementation with some of the performance increases available with ASIC technology. When to use FPGAs for DSP Potential advantages to implementing a DSP function within an FPGA include: - Performance Improvements
- Design Implementation Flexibility
- System Level Integration
- Reduced Cost and Schedule
Algorithm performance improvement in an FPGA-based implementation over the performance in a conventional DSP processor is usually based on a combination of factors. The most common are an increased data path width and/or an increased operational speed resulting in a higher overall performance. Another performance improvement is the ability to separate the data stream into multiple parallel blocks of data which have limited interdependence. Each data block can then be operated on independently, and the results combined, resulting in higher relative performance. Taking advantage of any architectural opportunity for maximizing the number or speed of operations is essential to maximizing the performance achievable within an FPGA. The critical architectural transformation necessary to maximize an algorithm's performance within an FPGA is the process of translating every serial operation or group of operations into the most parallel implementation possible up to the limits imposed by resources available within the target FPGA device for implementing a specific function. A further performance advantage can be gained if the FPGA can perform operations on multiple channels or streams of data. Example applications include Time Division Multiple Access (TDMA) multiplexing, multiple channel communication protocols, and I/Q math based algorithms. Since each channel can be processed in parallel the performance advantage associated with each channel can be multiplied times the number of channels implemented. Designs which require signal pre-processing can also benefit since filtering and signal conditioning algorithms are generally relatively straight-forward to implement within an FPGA architecture. When an algorithm is implemented in a structure which takes advantage of the flexibility of a target FPGA architecture, the benefits can be tremendous. Algorithms can be customized to adjust to system requirements on the fly. Filter coefficients, implementations, and architectures can be updated to reflect changing system conditions and user requirements. The implementation of an algorithm within an FPGA also provides a range of implementation options. The design team must determine and prioritize their design objectives. It is possible to implement an algorithm as a maximally parallelized architecture, or in a highly serial architecture using a single structure which is fed sequential data elements with the functionality of a loop counter implemented within a hardware counter. A hybrid architecture can also be implemented, which is a parallel implementation of serial structures or a serial chain of parallel architectures. Each of these design options will have its own set of characteristics including the number of devices required to implement a function, resource requirements within a device, maximum speed, and cost of implementation. The design team has the flexibility to optimize for size, speed, cost, or a target combination of these factors. An FPGA device also provides a platform for integrating multiple design functions into a single package or group of packages. Integration of functionality can result in higher performance, reduced real-estate requirements, and reduced power requirements. Resources integrated into the I/O circuitry of FPGAs can further improve system performance by allowing control of drive strength, signal slew rate, and implementation of on-board matching, resulting in fewer required system-level components. Further design integration can be implemented by incorporating hard or soft processor cores within an FPGA to implement required control and processing functionality. The availability of pre-verified design functionality through Intellectual Property (IP) availability can also be used to implement and incorporate common functionality. The ability to incorporate multiple system-level components and design functionality within a smaller quantity of components can potentially reduce risk, cost, and schedule. Critical Design Considerations There are key design considerations associated with any technology which must be implemented with care to enable the highest possible levels of performance. The following sections address a few critical design elements which can significantly affect the reliability or maximum speed of an implemented wide signal path or high-speed algorithm, both of which often apply in DSP applications Clock Sourcing and Distribution Many of the most critical design considerations associated with DSP algorithm implementation within FPGAs are directly or indirectly associated with the clocking architecture of the design. Important clock-related design issues which justify extra design effort include: - Sufficient board-level device decoupling
- Clean low-jitter external clock sources (consider differential clock distribution for higher rate clocks)
- Careful clock source routing to the appropriate dedicated FPGA I/O pins
Some of the advanced FPGA design issues which must be carefully evaluated and implemented from the earliest design stages include: - Synchronous Design Implementation
- Pipelining
- Clock Boundary Transitions
- Routing of Critical Clock and Control Signals
Synchronous Design One of the foundations of successful, efficient, reliable FPGA design in general and DSP algorithm implementation in particular is the implementation of a synchronous design. This requires that no portion of the design is based on routing or functional delays intrinsic to the path between two register sets. Asynchronous design should also be avoided. Clock Boundary Transitions Another critical design consideration is how transitions between clock domains are implemented. The passing of synchronous data across a clock speed boundary must be carefully implemented with effective mitigation of flip-flop related meta-stability effects. Extensive design checklists and guidelines are available from academic text sources and manufacturer literature Pipelining The efficient implementation of DSP algorithms within FPGAs is based on dividing functional operations into their most primitive arithmetic operations and then separating the operations with registers. This results in a much deeper pipeline through the design, with the advantage of the highest speed performance achievable within the FPGA fabric. Pipelining is an essential element of implementing DSP algorithms within FPGAS. This is a result of the register-rich architecture of conventional FPGA fabrics and the need to register data between operations to maximize speed and performance. FPGA device fabrics have been developed to support this tradeoff between heavy utilization of flip-flop resources to implement registers and the ability to handle wide data streams at increasingly high data rates. Critical Internal Signal Routing The routing resources within an FPGA do not all have the same characteristics. Nearly all FPGA architectures have a small subset of "global" signal nets which provide superior performance. These global nets are generally reserved for clock signals and critical control signals. The FPGA design tools will conventionally assign these nets to the signals they perceive to be most critical, i.e. clock and control signals with large fan-outs connected to high-speed circuitry. It is also possible for the design team to identify signals with high performance requirements via attached attributes. Critical DSP algorithm clock and control signals should be checked to determine if they need to be assigned to these low-skew, higher performance nets. Numerical Representation There are multiple numeric representation styles which can be used to represent data as it passes through an algorithm within an FPGA. The two largest classes of numeric representation are fixed-point and floating-point. Informed selection of the numeric representation style can maximize the utilization of available FPGA resources. Traditionally fixed-point implementations are considered first for DSP algorithm implementation within FPGAs. This is due to the perceived ability to operate at higher operational data throughput rates and more efficient utilization of FPGA resources. For more complex algorithms floating-point may also be considered. The advantage of a higher dynamic range and elimination of data path scaling are the primary advantages for floating-point design implementation. The implementation of floating-point numeric representation can come with the penalties of higher resource utilization and less intuitive design implementations. Fixed-point numbers can be represented as unsigned integers or signed magnitude values. The two most popular signed representations are two's complement and one's complement. Two's complement is the more popular of the two formats due to its simplified implementation when considering arithmetic overflow conditions. One's complement has the characteristic that negative and positive numbers have identical bit patterns with the exception of the leading sign bit. There are also more advanced numeric representations and modified implementations of existing standards. Where possible it is desirable to implement designs based on existing defined standards. This supports simplified design comprehension for engineers new to a project and simplified modification for future modifications or enhancements. DSP-Oriented Architectural Features The architecture of FPGA fabric is inherently suited to implementation of parallel structures. The capability to support very wide buses and implement multiple instantiations of complex structures is a key feature of FPGA technology. There are generally multiple options for implementing individual DSP-related operations within an FPGA. The structures which manufacturers continue to optimize for DSP performance include: - Clock management and distribution
- Distributed and block memory within the FPGA
- Access to memory external to the FPGA
- Implementation of low-overhead shift registers
- Embedded wide hardware multiplier blocks
- Advance hardware multiplier blocks with associated accumulator functionality
For an example consider the implementation of the primary DSP algorithm structure; the Multiply Accumulate (MAC) function. A MAC structure can be implemented in one of several different configurations: - Both the Multiplier and the Accumulator can be implemented within the logic fabric of the FPGA taking advantage of dedicated structures such as dedicated high-speed carry chains
- The Multiplier can be implemented in an optimized multiplier block which does not require the use of FPGA logic and the Accumulator implemented within the logic fabric of the FPGA
- Both the Multiplier and Accumulator can be implemented within an advanced Multiplier block requiring the use of no FPGA logic
Each of these approaches has its own characteristics. Depending on the architecture there may be a limit on the number of available optimized hard multiplier blocks. There can be speed advantages for logic level or hard multiplier block implementations in addition to using more or less of the available FPGA logic matrix respectively. DSP Intellectual Property (IP) Intellectual Property (IP) provides access to pre-verified and often optimized DSP functionality. IP can be obtained from multiple sources including manufacturer, third party, open access, and university web sites. In general the IP which has been optimized for a particular FPGA architecture can be found on the manufacturer's web site. There are multiple categories of IP to take advantage of, divided into two groups; lower-level operational implementations and higher-level functional implementations. A few categories and examples are shown in Table 1. Table 1: IP categories Design Verification and Debug Design verification is a critical aspect of embedded design and numerous options are available to help a design team verify designs. Where possible it is advantageous to divide a design into functional blocks and verify each lower-level functional block before integration into the whole. The FPGA design flow is shown in Figure 1, and simulation of the design via defined testbenches should be considered at each design stage. Figure 1. FPGA Design Flow Simulation at the earliest phases of the design before significant effort has been expended on integrating design blocks can help avoid extended design debug and testing later in the design cycle. Further schedule gains can be achieved by implementing modular testbench blocks which can be scaled with the design to verify each subsequent design phase. Conclusion This article presents a number of topics related to implementing DSP functionality within FPGAs, including establishing when FPGA technology provides a viable alternative to general purpose DSP processors. A design team can benefit from using system-level design tools and implementing a hierarchical design block simulation flow verifying elements before they are integrated into higher levels of functionality. By developing an understanding of the overall design cycle, available development tools, and implementing trade-studies for critical design decisions, a design team can avoid common design implementations missteps. By implementing a system which takes maximum advantage of the resources available within an FPGA a higher level of performance is assured and aggressive schedules can be maintained. Taking advantage of the ability of FPGAs to implement an integrated, modular, high-performance design an adaptable, efficient lower-system cost design can be achieved. About the Authors RC Cofer has more than 19 years of embedded design experience, including real-time DSP algorithm development, high-speed hardware, ASIC and FPGA design and systems engineering. His technical focus is on rapid system development of high-speed DSP and FPGA-based designs. RC holds an MSEE from the University of Florida and a BSEE from Florida Tech. He has presented at conferences on DSP and FPGA topics and is the co-author of the book Rapid System Prototyping with FPGAs published by Elsevier. Ben Harding has 15+ years of hardware design experience including high-speed design with DSPs, network processors, and programmable logic. He also has embedded software development experience in areas including voice and signal processing, algorithm development and board support package development for numerous Real-Time Operating Systems. Ben has a BSEE from University of Alabama-Huntsville with post-graduate studies in Digital Signal Processing, parallel processing and digital hardware design. Ben has presented on FPGA and processor design at several conferences and is the co-author of the book Rapid System Prototyping with FPGAs published by Elsevier. |