|
||||||
The four Rs of efficient system design
The four Rs of efficient system design New design languages and new chips and systems mean a whole new set of design gotchas for today's developers. Once-simple tasks become difficult and, thankfully, once-difficult tasks become easy. This article for senior designers looks at newer high-level design techniques and how they can improve logic and system design. New FPGA chips are approaching ASIC-like density and performance, with their inherent cost advantages and reprogrammability clearly in their favor. With embedded DSP and CPU cores now available for use in FPGAs, these programmable logic devices are a real alternative for many embedded systems designers. Many designers from the ASIC world are turning to FPGAs for new designs. According to research firm Gartner Dataquest in its Market Trends report "ASIC and FPGA Suppliers Answer the Call," more than 74,000 "design starts" used FPGAs in 2004, but only around 4,000 used ASICs. It's not easy to switch from an ASIC to an FPGA design flow, however. True, complex FPGA design shares some features with ASIC design, but under the hood, many of the steps are fundamentally different. The prebuilt nature of FPGAs encourages a "use it or lose it" mentality regarding features and capabilities. FPGA design, more often than ASIC design, must therefore match the functional requirements to the chip itself. As high-end FPGAs encroach on ASIC performance, ASIC design techniques are being adapted for FPGA design. Two such techniques, physical synthesis for high-performance timing closure and C++ synthesis for C-based design, can illustrate the subtleties involved. The algorithms for C++ synthesis are the same for ASICs and FPGAs. By leveraging "technology-aware" synthesis (through technology-specific libraries), the same design can be implemented in either or both types of silicon fabric. The fact that C++ specifications aren't tied to the specific hardware is considered a primary advantage. To make full use of physical synthesis, however, you need a tool that understands the FPGA's internal hardware structure. We'll first introduce today's conventional hardware-design flow and examine its associated problems. We'll explain alternative approaches to hardware design using C/C++, comparing the pros and cons of timed design languages with those of untimed, or algorithmic, methods. Toward the end, we'll explain why you must consider interconnect delay and physical effects in the design process to achieve optimal performance. Algorithmic C synthesis A way around this is to design, simulate, and synthesize C representations. By using pure untimed C++ to describe functional intent, engineers can move up an abstraction level for designing hardware, reducing design time, creating a more repeatable design flow, and preserving the option of implementing the design in either ASIC or FPGA. An added benefit is that by exploring multiple microarchitectural solutions, engineers can often produce better designs than those created through traditional RTL methods. Traditional hardware design MATLAB works well for validating and proving the initial algorithm, although many design teams also develop C/C++ models to verify that the whole system meets functional and performance specifications. For subsequent discussion, we'll use the term untimed algorithm to represent those algorithms written either in MATLAB or pure ANSI C/C++. Based on project requirements, system architects then partition the design into blocks of hardware or software. Each hardware block's function is represented by a floating-point algorithm. In this case, either the system designer or the hardware designer quantizes the floating-point algorithm into an integral or fixed-point representation. These fixed-point algorithms are represented in MATLAB using Simulink or in untimed C++ using bit-accurate types. After validating the fixed-point algorithm, the hardware designer starts the manual process of creating Verilog or VHDL for the RTL abstraction. We can subdivide this process into three distinct phases: C flow—the next generation The ideal flow should be based on industry-standard ANSI C/C++, which has been the language of choice for software and system-level modeling for many years. The pure, untimed C/C++ written by system designers is an excellent source for creating hardware because it's devoid of implementation details. After verification, the hardware engineer uses a C synthesis tool to automatically generate optimized RTL from the C/C++ representation. The RTL output is then used to drive existing RTL-synthesis tools as shown in Figure 1. With this flow, you can synthesize the untimed C/C++ directly into a gate-level netlist. This maximizes flexibility and provides a source that is "malleable," that is, capable of targeting ASICs, FPGAs, highly compact small solutions, and highly parallel fast solutions. The translation from MATLAB to C/C++ is still manual, but because these domains are conceptually very close, the translation is relatively quick and easy. Using untimed C/C++ adds a lot of value by providing much faster simulation than the MATLAB Simulink environment, and is therefore ideally suited for system-level validation. Moreover, generating the intermediate RTL provides a timed "comfort zone" for existing flows by allowing you to validate the implementation decisions made by the C synthesis tool. Furthermore, RTL is a useful point to stitch the various functional blocks together. Large portions of today's designs exist in the form of IP blocks delivered as RTL. This means RTL is a useful point in the design flow for integrating and verifying the entire hardware system. Design teams can take full advantage of existing RTL-design tools for test insertion or power analysis, for example. The ideal flow based on algorithmic synthesis of pure, untimed C/C++ addresses all of the traditional bottlenecks: Designers use SystemC for system-level verification, but the complexity of the language creates barriers for system and hardware designers alike. Using SystemC at the RTL provides little (if any) value over VHDL or Verilog as shown in Figure 2. The value comes in at higher levels of abstraction and is useful for system-level verification, but as yet there is no consensus on what should and should not be synthesizable in the SystemC language. It's clear that coding interface definitions in SystemC removes the ability to easily make interface tradeoffs since this requires complex changes to the C++ source (for example, a dual-port memory interface is substantially different from a CPU interface). Editing SystemC models is not an effective way to explore architectural alternatives. SystemC synthesis can be accomplished by using SystemC data types with pure, untimed C++. This "algorithmic SystemC" source is the highest abstraction of SystemC and provides the greatest value to the end user (technology independent, interface independent, microarchitecture independent). Adding the ability to generate a cycle-accurate SystemC model enables an algorithmic C synthesis tool to benefit from the SystemC verification environment, yet avoid issues of hard coding technology intent in SystemC descriptions. Handel-C Using a proprietary language means users cannot use alternative simulation or synthesis tools. As a result, many engineers prefer standards-based alternatives. Theoretically, the manual translation of MATLAB to Handel-C should be relatively painless because the Handel-C representation is close to pure C. In practice coercing Handel-C to adequately capture the design in a form suitable for the synthesis engine requires intensive work by an expert user. Here again, the pseudo-timing constructs required for the synthesis and simulation of Handel-C representations are foreign to both system-level and hardware designers. All of the implementation "intelligence" associated with the design has to be hard-coded into the Handel-C, which therefore becomes implementation-specific. Furthermore, users have minimal control over the Handel-C synthesis engine, which is something of a "black box" to work with and which doesn't take advantage of the target technology (for example, the engine takes no account of elements like multipliers and RAM blocks in an FPGA). This implies some nonintuitive manipulation of the C code to achieve speed and size requirements. In short, design teams may end up taking as much time creating adequate Handel-C as they would hand-creating the RTL, thereby nullifying the advantage of C-based design flows. Higher synthesis abstraction Instead of adding intelligence to the source code (thereby locking it into a target implementation), all of the intelligence should be provided by controlling the synthesis engine itself with user-defined constraints. New tools are available that use C++ source code augmented with SystemC data types, which allow specific bit-widths to be associated with variables and constants. An advantage is that many companies already create an untimed C/C++ representation of their designs for algorithmic validation. They do this because a pure C representation is easy and compact to write and simulates 100 to 10,000 times faster than an equivalent RTL representation. The only modification typically required in newer C-based design tools is to add a single pragma to the source code to indicate the top of the functional portion of the design—anything conceptually above this point is considered part of the test bench. Once the tool has read the source code, the designers can immediately perform microarchitecture tradeoffs and evaluate their effects in terms of size and speed. Ideally, all of these evaluations must be done within a few seconds or minutes, depending on design size. Total size/area must be reported along with latency in terms of clock cycles or input-to-output delays (or, in the case of pipelined designs, throughput time/cycles). Ideally, a C synthesis tool should be able to name, save, and reuse any of these "what-if" scenarios. Conventional, iterative, hand-coded RTL flows would make it almost impossible to perform these tradeoffs in a timely manner. More importantly, the fact that the C source code isn't required to contain any implementation "intelligence"—all such intelligence is supplied by constraints to the synthesis engine itself—means that design teams can easily retarget the same source code to alternative microarchitectures and different implementation technologies. The fundamental difference between the various C-based design flows is the level of synthesis abstraction they support as shown in Figure 4. Physical synthesis for FPGAs Synthesis routines and decisions that are driven by knowledge of the physical layout of the target device tend to achieve a much better result than those that only perform logical synthesis. To reduce design iterations and improve accuracy, the design team must consider interconnect delay and physical effects up front. The following sections outline some of the ASIC-strength algorithms that are effectively used in optimizing complex FPGA designs. Beyond logic synthesis and floorplanning Cell delay (propagation delay) was the dominant delay factor in older FPGAs and the traditional approach to logic synthesis was good enough to meet the timing requirements in FPGA designs. Reducing the number of logic levels or reducing a function's area reduced cell delay and so timing was met. Unfortunately, this formula doesn't extend to the new FPGAs. As we validate Moore's Law and process technology shrinks accordingly, transistors get smaller and more of them fit into a single chip. The timing bottleneck in an FPGA design has shifted from exclusively cell delays to include interconnect delays. In fact, in the newer generations of FPGA designs, net delays regularly exceed 70% of the total delay. Synthesis routines with the goal of producing the most efficient logic don't guarantee better performance after the design has been placed and routed. This is because traditional synthesis methods estimate the route delay based on wire-load models (WLM). A WLM is number typically calculated based on a net's statistical estimated delay, using factors such as parasitics and fanout. Optimization decisions are based on identifying the critical (usually longest) logic path. A wire-load estimate will in many cases identify a different critical path than the one that really exists after technology mapping (in other words, after place and route). This means a significant amount of performance is still left on the table. Floorplanning is a proven and successful approach for ASIC design, but has its limitations when used in FPGAs. In ASICs, routing is not predetermined, so designers can do intelligent preplacement floorplanning to reduce wire lengths and minimize delay. For FPGAs, the prebuilt routing fabric creates specific structural limitations. Fanout-based delay estimates in FPGAs don't model even a simplified version of this physical reality, so calling them timing "estimates" is optimistic. To use an example, consider a highly utilized FPGA with a large number of paths that are missing timing and these paths involve several blocks of the design. The traditional method of solving these timing problems was to use pre-place-and-route floorplanning. Without any timing analysis capability, this process proves to be painfully iterative. Since each place-and-route takes a few hours, the quickest time to view the results after a floorplanning change was most of a day. As such, this process can end up taking weeks or months to meet timing goals. Even then, the physical constraints created by the floorplanning process cannot be transferred to later revisions of the same design. Physical synthesis tools are commonly mistaken for floorplanners. Floorplanning is a process aimed exclusively at efficient handling of large multimillion-gate designs. While both tools have a unique identity, you can achieve clear benefits when using their capabilities in tandem. A user could be allowed to floorplan sections of a design and run physical synthesis on the individual modules until the desired performance is achieved. This "divide and conquer" approach coupled with the predictability of performance saves a lot of time over using the traditional FPGA flow. The Four Rs Many physical-synthesis techniques that are used for ASICs (resizing drivers and buffering signals, for example) don't work well for FPGAs. FPGA optimizations must take advantage of the device's internal structure. Physical-synthesis algorithms fall into four categories: retiming, replication, re-placement, and resynthesis. Let's look at each of them. Register retiming Even in a circuit with very critical timing, there are many paths that easily meet their timing goals. Excess time available for data propagation, or slack, is unevenly distributed, with some circuit paths having negative slack while some have positive slack. Register retiming finds situations where slack on one side of a register is positive, while slack on the other side is negative as shown in Figure 5. Under the right conditions it's possible to move the register, effectively moving some of the delay through the register, without affecting the functionality of the design at the primary output ports as illustrated in Figure 6. Ideally, the result will be positive slack on both sides of the register. Retiming can only occur if it's possible to perform the transformation without modifying the function of the circuit at the boundary pins. An important consideration here is to maintain the initial state (reset state) of the register. Additionally, it's important that design latency not change—the same number of register stages must exist before and after retiming. For retiming to work, the following must be true: Moving registers forward during retiming is better than moving registers backwards. Backward retiming is more expensive since it nearly always means adding extra registers to the design. But some timing problems can only be attacked with backward retiming. Pipeline stage insertion is similar to register retiming with some differences. Inserting pipeline stages or registers into a design is a common way to manually fix timing problems with RTL code changes. Here, the designer no longer needs to restructure the logic to make it appropriate for pipelining, since the register retiming algorithm allows the tool user to infer extra registers at the start or end of the path, and automatically distributes the registers throughout the logic to maximize performance. To distribute pipeline stages, the retiming algorithm must work differently. Figure 8 shows a typical example. The optimum circuit would result by inserting the two registers on the right at Point A and Point B. But once the first register is inserted at point B, any further move is impossible because moving it in either direction would increase circuit slack. Pipeline retiming must move the register the maximum distance, even if slack is temporarily increased. Physical synthesis automatically adjusts to using pipeline retiming rules when it finds serial registers in the design. Register replication Re-placement Resynthesis Path compression is a typical resynthesis technique. When applying this algorithm, you use nodes on the critical path as the basis for expansion of a logic cone with a prescribed number of inputs. The goal is to isolate the critical path in a cone where all other inputs are noncritical. Once this cone is found, the subcircuit is rewired (and LUTs recoded) such that the critical signal is connected to the last LUT in the subcircuit. This process is then repeated for other critical nodes. This algorithm may not always improve the timing, since it relies on finding related noncritical nodes. Some cases, however, show dramatic improvement. Another valuable resynthesis technique is block RAM conversion. Block RAMs are dedicated resources in the FPGA that allow designers to efficiently create large memories. You can also create memories as distributed RAMs by reconfiguring LUTs as small memory blocks. Block RAMs are much more area efficient, but when critical paths interact with block RAMs, it's often impossible to place related cells close enough to meet performance goals. Block RAM conversion will remap critical portions of the block RAM into distributed RAM to allow better placement. We've simplified optimization algorithms for this discussion; these algorithms don't really act in isolation, but interact heavily. For example, register replication often allows retiming or path compression to occur on a circuit where this would not normally be possible. From iterative to coherent Taking advantage of the embedded processors and other SoC features implies that high-end FPGAs require techniques that take the design right from creation all the way to final integration on the system board. Adopting a broader design-flow process instead of the traditional point-tool approach enables professional designers to extract higher capacity and faster performance from today's high-end FPGAs. In summary, the emergence of multimillion-gate, 1,000-pin FPGAs that incorporate embedded processors and innovative memory architectures create the need for an advanced electronic system-level design approach to managing their increasing complexity. Juergen Jaeger is director of marketing, Design Creation and Synthesis Division, Mentor Graphics. He has over 20 years of experience in hardware and software design and verification in the ASIC and FPGA space. He has a BSEE degree from the Fachhochschule of Kaiserlautern, Germany and a masters degree in computer science from the University of Hagen, Germany. Contact him at juergen_jaeger@mentor.com. Shawn McCloud is high-level synthesis product manager at Mentor Graphics. He has been involved with advanced hardware design and EDA software development for 19 years and has a BSEE degree from Case Western Reserve University. Contact him at shawn_mccloud@mentor.com. Copyright 2005 © CMP Media LLC |
Home | Feedback | Register | Site Map |
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. |