The four Rs of efficient system design
The four Rs of efficient system design New design languages and new chips and systems mean a whole new set of design gotchas for today's developers. Once-simple tasks become difficult and, thankfully, once-difficult tasks become easy. This article for senior designers looks at newer high-level design techniques and how they can improve logic and system design. New FPGA chips are approaching ASIC-like density and performance, with their inherent cost advantages and reprogrammability clearly in their favor. With embedded DSP and CPU cores now available for use in FPGAs, these programmable logic devices are a real alternative for many embedded systems designers. Many designers from the ASIC world are turning to FPGAs for new designs. According to research firm Gartner Dataquest in its Market Trends report "ASIC and FPGA Suppliers Answer the Call," more than 74,000 "design starts" used FPGAs in 2004, but only around 4,000 used ASICs. It's not easy to switch from an ASIC to an FPGA design flow, however. True, complex FPGA design shares some features with ASIC design, but under the hood, many of the steps are fundamentally different. The prebuilt nature of FPGAs encourages a "use it or lose it" mentality regarding features and capabilities. FPGA design, more often than ASIC design, must therefore match the functional requirements to the chip itself. As high-end FPGAs encroach on ASIC performance, ASIC design techniques are being adapted for FPGA design. Two such techniques, physical synthesis for high-performance timing closure and C++ synthesis for C-based design, can illustrate the subtleties involved. The algorithms for C++ synthesis are the same for ASICs and FPGAs. By leveraging "technology-aware" synthesis (through technology-specific libraries), the same design can be implemented in either or both types of silicon fabric. The fact that C++ specifications aren't tied to the specific hardware is considered a primary advantage. To make full use of physical synthesis, however, you need a tool that understands the FPGA's internal hardware structure. We'll first introduce today's conventional hardware-design flow and examine its associated problems. We'll explain alternative approaches to hardware design using C/C++, comparing the pros and cons of timed design languages with those of untimed, or algorithmic, methods. Toward the end, we'll explain why you must consider interconnect delay and physical effects in the design process to achieve optimal performance. Algorithmic C synthesis A way around this is to design, simulate, and synthesize C representations. By using pure untimed C++ to describe functional intent, engineers can move up an abstraction level for designing hardware, reducing design time, creating a more repeatable design flow, and preserving the option of implementing the design in either ASIC or FPGA. An added benefit is that by exploring multiple microarchitectural solutions, engineers can often produce better designs than those created through traditional RTL methods. Traditional hardware design MATLAB works well for validating and proving the initial algorithm, although many design teams also develop C/C++ models to verify that the whole system meets functional and performance specifications. For subsequent discussion, we'll use the term untimed algorithm to represent those algorithms written either in MATLAB or pure ANSI C/C++. Based on project requirements, system architects then partition the design into blocks of hardware or software. Each hardware block's function is represented by a floating-point algorithm. In this case, either the system designer or the hardware designer quantizes the floating-point algorithm into an integral or fixed-point representation. These fixed-point algorithms are represented in MATLAB using Simulink or in untimed C++ using bit-accurate types. After validating the fixed-point algorithm, the hardware designer starts the manual process of creating Verilog or VHDL for the RTL abstraction. We can subdivide this process into three distinct phases: C flow—the next generation The ideal flow should be based on industry-standard ANSI C/C++, which has been the language of choice for software and system-level modeling for many years. The pure, untimed C/C++ written by system designers is an excellent source for creating hardware because it's devoid of implementation details. After verification, the hardware engineer uses a C synthesis tool to automatically generate optimized RTL from the C/C++ representation. The RTL output is then used to drive existing RTL-synthesis tools as shown in Figure 1. With this flow, you can synthesize the untimed C/C++ directly into a gate-level netlist. This maximizes flexibility and provides a source that is "malleable," that is, capable of targeting ASICs, FPGAs, highly compact small solutions, and highly parallel fast solutions. The translation from MATLAB to C/C++ is still manual, but because these domains are conceptually very close, the translation is relatively quick and easy. Using untimed C/C++ adds a lot of value by providing much faster simulation than the MATLAB Simulink environment, and is therefore ideally suited for system-level validation. Moreover, generating the intermediate RTL provides a timed "comfort zone" for existing flows by allowing you to validate the implementation decisions made by the C synthesis tool. Furthermore, RTL is a useful point to stitch the various functional blocks together. Large portions of today's designs exist in the form of IP blocks delivered as RTL. This means RTL is a useful point in the design flow for integrating and verifying the entire hardware system. Design teams can take full advantage of existing RTL-design tools for test insertion or power analysis, for example. The ideal flow based on algorithmic synthesis of pure, untimed C/C++ addresses all of the traditional bottlenecks: Designers use SystemC for system-level verification, but the complexity of the language creates barriers for system and hardware designers alike. Using SystemC at the RTL provides little (if any) value over VHDL or Verilog as shown in Figure 2. The value comes in at higher levels of abstraction and is useful for system-level verification, but as yet there is no consensus on what should and should not be synthesizable in the SystemC language. It's clear that coding interface definitions in SystemC removes the ability to easily make interface tradeoffs since this requires complex changes to the C++ source (for example, a dual-port memory interface is substantially different from a CPU interface). Editing SystemC models is not an effective way to explore architectural alternatives. SystemC synthesis can be accomplished by using SystemC data types with pure, untimed C++. This "algorithmic SystemC" source is the highest abstraction of SystemC and provides the greatest value to the end user (technology independent, interface independent, microarchitecture independent). Adding the ability to generate a cycle-accurate SystemC model enables an algorithmic C synthesis tool to benefit from the SystemC verification environment, yet avoid issues of hard coding technology intent in SystemC descriptions. Handel-C Using a proprietary language means users cannot use alternative simulation or synthesis tools. As a result, many engineers prefer standards-based alternatives. Theoretically, the manual translation of MATLAB to Handel-C should be relatively painless because the Handel-C representation is close to pure C. In practice coercing Handel-C to adequately capture the design in a form suitable for the synthesis engine requires intensive work by an expert user. Here again, the pseudo-timing constructs required for the synthesis and simulation of Handel-C representations are foreign to both system-level and hardware designers. All of the implementation "intelligence" associated with the design has to be hard-coded into the Handel-C, which therefore becomes implementation-specific. Furthermore, users have minimal control over the Handel-C synthesis engine, which is something of a "black box" to work with and which doesn't take advantage of the target technology (for example, the engine takes no account of elements like multipliers and RAM blocks in an FPGA). This implies some nonintuitive manipulation of the C code to achieve speed and size requirements. In short, design teams may end up taking as much time creating adequate Handel-C as they would hand-creating the RTL, thereby nullifying the advantage of C-based design flows. Higher synthesis abstraction Instead of adding intelligence to the source code (thereby locking it into a target implementation), all of the intelligence should be provided by controlling the synthesis engine itself with user-defined constraints. New tools are available that use C++ source code augmented with SystemC data types, which allow specific bit-widths to be associated with variables and constants. An advantage is that many companies already create an untimed C/C++ representation of their designs for algorithmic validation. They do this because a pure C representation is easy and compact to write and simulates 100 to 10,000 times faster than an equivalent RTL representation. The only modification typically required in newer C-based design tools is to add a single pragma to the source code to indicate the top of the functional portion of the design—anything conceptually above this point is considered part of the test bench. Once the tool has read the source code, the designers can immediately perform microarchitecture tradeoffs and evaluate their effects in terms of size and speed. Ideally, all of these evaluations must be done within a few seconds or minutes, depending on design size. Total size/area must be reported along with latency in terms of clock cycles or input-to-output delays (or, in the case of pipelined designs, throughput time/cycles). Ideally, a C synthesis tool should be able to name, save, and reuse any of these "what-if" scenarios. Conventional, iterative, hand-coded RTL flows would make it almost impossible to perform these tradeoffs in a timely manner. More importantly, the fact that the C source code isn't required to contain any implementation "intelligence"—all such intelligence is supplied by constraints to the synthesis engine itself—means that design teams can easily retarget the same source code to alternative microarchitectures and different implementation technologies. The fundamental difference between the various C-based design flows is the level of synthesis abstraction they support as shown in Figure 4. Physical synthesis for FPGAs Synthesis routines and decisions that are driven by knowledge of the physical layout of the target device tend to achieve a much better result than those that only perform logical synthesis. To reduce design iterations and improve accuracy, the design team must consider interconnect delay and physical effects up front. The following sections outline some of the ASIC-strength algorithms that are effectively used in optimizing complex FPGA designs. Beyond logic synthesis and floorplanning Cell delay (propagation delay) was the dominant delay factor in older FPGAs and the traditional approach to logic synthesis was good enough to meet the timing requirements in FPGA designs. Reducing the number of logic levels or reducing a function's area reduced cell delay and so timing was met. Unfortunately, this formula doesn't extend to the new FPGAs. As we validate Moore's Law and process technology shrinks accordingly, transistors get smaller and more of them fit into a single chip. The timing bottleneck in an FPGA design has shifted from exclusively cell delays to include interconnect delays. In fact, in the newer generations of FPGA designs, net delays regularly exceed 70% of the total delay. Synthesis routines with the goal of producing the most efficient logic don't guarantee better performance after the design has been placed and routed. This is because traditional synthesis methods estimate the route delay based on wire-load models (WLM). A WLM is number typically calculated based on a net's statistical estimated delay, using factors such as parasitics and fanout. Optimization decisions are based on identifying the critical (usually longest) logic path. A wire-load estimate will in many cases identify a different critical path than the one that really exists after technology mapping (in other words, after place and route). This means a significant amount of performance is still left on the table. Floorplanning is a proven and successful approach for ASIC design, but has its limitations when used in FPGAs. In ASICs, routing is not predetermined, so designers can do intelligent preplacement floorplanning to reduce wire lengths and minimize delay. For FPGAs, the prebuilt routing fabric creates specific structural limitations. Fanout-based delay estimates in FPGAs don't model even a simplified version of this physical reality, so calling them timing "estimates" is optimistic. To use an example, consider a highly utilized FPGA with a large number of paths that are missing timing and these paths involve several blocks of the design. The traditional method of solving these timing problems was to use pre-place-and-route floorplanning. Without any timing analysis capability, this process proves to be painfully iterative. Since each place-and-route takes a few hours, the quickest time to view the results after a floorplanning change was most of a day. As such, this process can end up taking weeks or months to meet timing goals. Even then, the physical constraints created by the floorplanning process cannot be transferred to later revisions of the same design. Physical synthesis tools are commonly mistaken for floorplanners. Floorplanning is a process aimed exclusively at efficient handling of large multimillion-gate designs. While both tools have a unique identity, you can achieve clear benefits when using their capabilities in tandem. A user could be allowed to floorplan sections of a design and run physical synthesis on the individual modules until the desired performance is achieved. This "divide and conquer" approach coupled with the predictability of performance saves a lot of time over using the traditional FPGA flow. The Four Rs Many physical-synthesis techniques that are used for ASICs (resizing drivers and buffering signals, for example) don't work well for FPGAs. FPGA optimizations must take advantage of the device's internal structure. Physical-synthesis algorithms fall into four categories: retiming, replication, re-placement, and resynthesis. Let's look at each of them. Register retiming Even in a circuit with very critical timing, there are many paths that easily meet their timing goals. Excess time available for data propagation, or slack, is unevenly distributed, with some circuit paths having negative slack while some have positive slack. Register retiming finds situations where slack on one side of a register is positive, while slack on the other side is negative as shown in Figure 5. Under the right conditions it's possible to move the register, effectively moving some of the delay through the register, without affecting the functionality of the design at the primary output ports as illustrated in Figure 6. Ideally, the result will be positive slack on both sides of the register. Retiming can only occur if it's possible to perform the transformation without modifying the function of the circuit at the boundary pins. An important consideration here is to maintain the initial state (reset state) of the register. Additionally, it's important that design latency not change—the same number of register stages must exist before and after retiming. For retiming to work, the following must be true: Moving registers forward during retiming is better than moving registers backwards. Backward retiming is more expensive since it nearly always means adding extra registers to the design. But some timing problems can only be attacked with backward retiming. Pipeline stage insertion is similar to register retiming with some differences. Inserting pipeline stages or registers into a design is a common way to manually fix timing problems with RTL code changes. Here, the designer no longer needs to restructure the logic to make it appropriate for pipelining, since the register retiming algorithm allows the tool user to infer extra registers at the start or end of the path, and automatically distributes the registers throughout the logic to maximize performance. To distribute pipeline stages, the retiming algorithm must work differently. Figure 8 shows a typical example. The optimum circuit would result by inserting the two registers on the right at Point A and Point B. But once the first register is inserted at point B, any further move is impossible because moving it in either direction would increase circuit slack. Pipeline retiming must move the register the maximum distance, even if slack is temporarily increased. Physical synthesis automatically adjusts to using pipeline retiming rules when it finds serial registers in the design. Register replication Re-placement Resynthesis Path compression is a typical resynthesis technique. When applying this algorithm, you use nodes on the critical path as the basis for expansion of a logic cone with a prescribed number of inputs. The goal is to isolate the critical path in a cone where all other inputs are noncritical. Once this cone is found, the subcircuit is rewired (and LUTs recoded) such that the critical signal is connected to the last LUT in the subcircuit. This process is then repeated for other critical nodes. This algorithm may not always improve the timing, since it relies on finding related noncritical nodes. Some cases, however, show dramatic improvement. Another valuable resynthesis technique is block RAM conversion. Block RAMs are dedicated resources in the FPGA that allow designers to efficiently create large memories. You can also create memories as distributed RAMs by reconfiguring LUTs as small memory blocks. Block RAMs are much more area efficient, but when critical paths interact with block RAMs, it's often impossible to place related cells close enough to meet performance goals. Block RAM conversion will remap critical portions of the block RAM into distributed RAM to allow better placement. We've simplified optimization algorithms for this discussion; these algorithms don't really act in isolation, but interact heavily. For example, register replication often allows retiming or path compression to occur on a circuit where this would not normally be possible. From iterative to coherent Taking advantage of the embedded processors and other SoC features implies that high-end FPGAs require techniques that take the design right from creation all the way to final integration on the system board. Adopting a broader design-flow process instead of the traditional point-tool approach enables professional designers to extract higher capacity and faster performance from today's high-end FPGAs. In summary, the emergence of multimillion-gate, 1,000-pin FPGAs that incorporate embedded processors and innovative memory architectures create the need for an advanced electronic system-level design approach to managing their increasing complexity. Juergen Jaeger is director of marketing, Design Creation and Synthesis Division, Mentor Graphics. He has over 20 years of experience in hardware and software design and verification in the ASIC and FPGA space. He has a BSEE degree from the Fachhochschule of Kaiserlautern, Germany and a masters degree in computer science from the University of Hagen, Germany. Contact him at juergen_jaeger@mentor.com. Shawn McCloud is high-level synthesis product manager at Mentor Graphics. He has been involved with advanced hardware design and EDA software development for 19 years and has a BSEE degree from Case Western Reserve University. Contact him at shawn_mccloud@mentor.com. Copyright 2005 © CMP Media LLC
By Juergen Jaeger and Shawn McCloud, Courtesy of Embedded Systems Programming
Mar 1 2005 (14:39 PM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=60404381
The conventional flow for high-end electronic designs involves handcrafting Verilog or VHDL representations. These manual methods were effective in the past but many of today's new designs are so complex that traditional design practices are now inadequate. Creating register transfer level (RTL) implementations for high-end FPGAs has become as time-consuming as ASIC design.
Many high-end designs in the communications or video/image processing industries rely on extremely complex algorithms. The first step in a conventional design flow involves modeling and proving the design functions at the algorithmic level of abstraction, using tools such as MATLAB or plain C/C++ modeling.
The hardware engineers manually translate the floating-point untimed algorithm into bit-accurate RTL, either Verilog or VHDL. This RTL is subsequently synthesized into a gate-level netlist using traditional RTL-synthesis tools. The main problems associated with this traditional flow are:
The most important challenge facing the design team is that all of the implementation "intelligence" associated with the design is hard-coded into the RTL, which therefore becomes rigid and implementation-specific.
As we've seen, the shortcomings of the typical RTL design flow (shown in Figure 1) are the inability to explore the design space and the time it takes to write, verify, and synthesize the RTL.
Figure 1: The ideal design flow depicted on the right is based on algorithmic synthesis of pure, untimed C/C++, which addresses the problems associated with the traditional flow (shown on the left) where the untimed algorithm is hand-translated into RTL
SystemC
The SystemC language provides a comprehensive verification environment that allows C++ designs to be simulated at mixed levels of abstraction. It uses C++ class libraries to model hardware structures such as modules, ports, interfaces, and concurrency.
Figure 2: To make a behavioral or RTL SystemC representation suitable for RTL generation or direct C synthesis, you would need to write it at nearly the same level of abstraction as hand-translated RTL
Handel-C is typical of the home-grown C-based simulation and synthesis languages developed by universities and EDA companies. It preserves traditional C syntax and control structures, making it easy for C programmers and hardware designers to understand. In addition to hardware-centric datatypes, Handel-C also includes special keywords/extensions that facilitate dataflow representations and support parallel programming. This flow involves manually translating the untimed algorithm into Handel-C. Following verification via simulation (which requires Celoxica's compiler), the Handel-C representation is directly synthesized into a gate-level netlist as shown in Figure 3.
Figure 3: You may end up taking as much time creating adequate Handel-C as you would hand-creating the RTL, thereby nullifying the advantage of C-based design flows
As we noted previously, the most significant problem with existing C-based design flows is that the implementation "intelligence" associated with the design has to be hard-coded into the C representation, which then becomes implementation-specific. Ideally, the C code should be virtually identical to what a system designer would write to model functional behavior without any preconceived hardware-implementation or target-device architecture in mind.
Figure 4: C-based synthesis design flows that support a higher level of synthesis abstraction accelerate implementation time and increase design flexibility when compared with other C-based flows
Achieving timing closure in the shortest number of design cycles is a huge FPGA design challenge. Timing closure solutions using standalone logical synthesis and place-and-route (P&R) are iterative and nondeterministic by nature. Many alternatives have been proposed. Physical synthesis is one technique that helps designers quickly close on timing compared with other methods such as floorplanning, random modifications to constraints, or repeated place-and-route iterations. Without physical synthesis, designers typically write and rewrite RTL code, provide guidance to the P&R tools by grouping cells, and possibly attempt some floorplanning. An alternative is simply to make numerous P&R runs. Usually, the RTL code/constraints are modified with only some heuristic notion that these changes will improve timing. Designers must iterate through P&R—the most time-consuming step in FPGA design—before learning whether the changes were a step in the right direction or only served to worsen the proble m. This unpredictability reduces the cost benefits and time-to-market advantages of using programmable logic in the first place.
In the ASIC or FPGA world, the word "synthesis" instinctively means RTL logic synthesis. This is the approach of a traditional FPGA synthesis tool: first synthesize, then perform technology mapping.
Tying in the RTL-synthesis tool to its physical-synthesis counterpart can produce a sizable jump in both performance and productivity. Unlike with a standalone physical-synthesis tool, the designer's visibility doesn't stop at the post-synthesis technology level. Instead, the designer is able to cross-probe all the way up to the RTL source code. For instance, designers challenged with larger, high-end FPGA designs today should be able to perform timing analysis on a complex design and extend this functionality by effectively cross-probing between the timing report and the physical view of the design. This way, a design bottleneck can be solved either at the RTL or the physical level (by recoding if necessary). This gives the user a significant amount of control when analyzing a design, be it at the RTL, constraint, technology, or physical level.
Register retiming is one of the strongest algorithms for improving timing—when it can be done. Reductions of up to 15% in overall critical path timing are not uncommon. Though retiming is also done during logic synthesis, the retiming performed during physical optimization is more effective because it uses more accurate timing information.
Figure 5: A simple circuit before retiming
Figure 6: The circuit in Figure 5 after retiming
While control signals must stay consistent to allow retiming, control signals may change based on the function of the combinatorial logic. Figure 7 illustrates an example of retiming a register through inverting logic. Note that the tool must implement the original reset signal as set logic to maintain the same initial state at the outputs.
Figure 7: Register retiming changes the reset connection to preset to preserve initial states
Figure 8: Pipeline retiming must move the register the maximum distance, even if slack is temporarily increased
In the world of register-rich programmable logic it's common to hear designers say, "registers are free in FPGAs." For that reason register replication is a common technique used by all FPGA optimization tools. Logic-synthesis tools have typically used register replication to control signal fanout. But with the ability to control placement, register replication becomes a powerful timing optimization algorithm, too. Physical synthesis can replicate registers and then place them close to related logic clusters as shown in Figure 9.
Figure 9: By leveraging the ability to control placement in physical synthesis, register replication becomes a powerful timing optimization algorithm
Register re-placement, also known as placement optimization, is an important technique in physical synthesis. Cells along the critical paths can be clustered close together to maximize performance. Accurate timing calculations are critical for this operation, since any movement of cells can easily cause other paths to become critical. The physical-synthesis tool solves this two ways. Delays are annotated directly from the first P&R run so timing is completely accurate as the process begins. Then, as cells are moved, new interconnect delays are estimated using both the new placement information as well as detailed knowledge of the routing resources available on the device. Using placement in wire-delay estimates is more accurate than the typical interconnect estimates used by other synthesis tools, where route tables assign the same delay to every net with the same fanout.
Resynthesis restructures logic in the design to improve performance. This is the same kind of restructuring that occurs when performing timing optimization during logic synthesis. The difference is that since the tool has a much more accurate picture of the timing during physical synthesis, these transformations are even more effective.
EDA tool companies will likely extend and improve their tools at both high and low levels of abstraction. The next-generation challenge faced by mainstream EDA vendors is to leverage point tool expertise and thus meld two apparently contradictory trends—higher levels of abstraction on the one hand and greater dependence on specific physical characteristics of FPGAs on the other—into a coherent design process. Designers must take advantage of EDA tools that now address both chip- and system-level challenges of complex FPGAs, and thereby realize the potential of these devices as ASIC replacements in new SoC designs.
Related Articles
- Efficient 'C' Programming and its Effect on the Performance of Embedded Systems
- A Multiprocessor System-on-chip Architecture with Enhanced Compiler Support and Efficient Interconnect
- DPCI: An Efficient Scalable System-on-chip Communication Architecture
- Efficient Verification of CAN based System
- Efficient Free-to-Air DVB-T System Solution Supported by IP-Based SoC Design
New Articles
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- Synthesis Methodology & Netlist Qualification
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
- Demystifying MIPI C-PHY / DPHY Subsystem
E-mail This Article | Printer-Friendly Page |