A primer on processor-based emulation
EE Times: Latest News A primer on processor-based emulation | |
Ray Turner (10/21/2004 4:24 PM EDT) URL: http://www.eetimes.com/showArticle.jhtml?articleID=51000078 | |
FPGA-based emulation is more widely understood by engineers because engineers are used to designing with FPGAs. Much less well understood are processor-based emulators, and ample examples of misinformation abound. This article will attempt to remove the mystery explaining how processor-based emulation works and how design constructs are mapped into it, such as tri-state busses, complex memories, and asynchronous clocking. Early days of processor-based emulation In the early 1990's IBM pioneered processor-based emulation technology, which was an offshoot of earlier work they had done in hardware-based simulation engines. The hardware technology consisted of a massive array of Boolean processors able to share data with one another, running at very high speed. The software technology consisted in partitioning a design among the many processors and scheduling individual Boolean operations in the correct time sequence and in an optimal way. Initially, performance could not match FPGA-based emulators, but compile times of less than an hour, and the elimination of timing problems that plagued FPGA-based emulators, made the new technology appealing for many use models, especially simulation acceleration. Future generations of this technology eventually surpassed FPGA systems in emulation speed while retaining the huge advantage in compilation times and without a farm of a hundred PCs for compilation. Advances in software technology extended the application of processor-based emulators to handle asynchronous designs with any number of clocks. Other extensions supported 100% visibility of all signals in the design, visibility of all signals at any time from the beginning of the emulation run, and dynamic setting of logic analyzer trigger events without recompilation. At the same time that the emulation speed of FPGA-based systems has been decreasing, new generations of processor-based systems have not only increased emulation speed at a prodigious rate, but have also proven scalable in capacity to hundreds of millions of gates.
Processor-based emulator architecture To understand how a processor-based emulator works, it is useful to briefly review how a logic simulator works. Recall that a computer's ALU (Arithmetic-Logic Unit) can perform basic Boolean operations on variables, such as AND, OR, NOT, and that a language construct such as "always @ (posedge Clock) Q = D" forms the basis of a flip-flop. In the case of gates (and transparent latches), simulation order is important. Signals race through a gate chain schematically "left-to-right" so to speak, or "top to bottom" in RTL source code. Flip-flops (registers) break up the gate chain for ordering purposes.
Figure 1 Logic simulation; CPU does Boolean math on signals, registers
One type of simulator, a levelized compiled logic simulator, performs the Boolean equations one-at-a-time in the correct order. (Time delays are not relevant for functional logic simulation.) If two ALUs were available, you can imagine breaking the design up into two independent logic chains and assigning each chain to an ALU, thus parallelizing the process and reducing the time required, perhaps to one half. A processor-based emulator has from tens of thousands to hundreds of thousands of ALUs which are efficiently scheduled to perform all the Boolean equations in the design in the correct sequence. The following series of drawings illustrate this process. For this example we assume a 4-input Boolean primitive in the emulator.
Figure 2 Step 1: Reduce Boolean logic to four-input functions
The set of Boolean equations after reducing the logic to four-input functions is:
S = C & B & E & F M = NOT (G + H + J + S) P = NOT (N + NOT (M & K & L) ) IF (Clock is rising) R = P
One possible scheduling is shown above. This is done for illustrative purposes and may not be the most efficient scheduling.
Figure 3 Result of scheduling logic
Figure 4 Processor-based emulator architecture
An emulation cycle consists of running all the processor steps for a complete modeling of the design. Large designs typically schedule in from 125 320 steps. If the design can use 1x clocking (described later), emulation speed for a processor-based emulator would then be between 600 KHz and 1.5 MHz. During each time step, each processor is capable of performing any 4-input logic function using as inputs the results of any prior calculation of any of the processors and any design inputs or memory contents. Processors are physically implemented in clusters with rapid communication within clusters. The compiler optimizes processor scheduling to maximize speed and capacity.
Figure 5 Processor array architecture example
Design compilation
2) Synthesize memories. 3) Flatten the hierarchy of the design. 4) Reduce Boolean logic (gates) into 4-input functions. 5) Break asynchronous loops in the design by inserting a register at an optimal place. 6) Assign external connections for the target system and any hard IP. 7) Set-up any instrumentation logic required (such as logic analyzer "visioning"). 8) Assign all design inputs and outputs to processors in a uniform way. 9) Assign each cell in the design to a processor. Priority is given to assigning cells with common inputs and/or outputs to the same processor or cluster and to assigning an equal number of cells to each processor. 10) Schedule each processor's activity into sequential time steps. The goal is to minimize the maximum number of time steps. But the compiler doesn't have to cope with the highly variable FPGA internal timing in FPGA emulators. For this reason processor-based emulation compiles much faster and with fewer resources. The compiler maintains all originally designated RT-level net names for use in debugging in spite of the Boolean optimization it performs. This allows users to debug with the signal names with which they are familiar. Tri-state bus modeling Asynchronous loop breaking However, by allowing the user to specify where loop breaks should occur, performance may be enhanced, since the performance of a processor-based emulator is related to the length of long combinatorial paths. By inserting delay elements to break false paths or multi-clock-cycle paths, performance can be improved if these paths are the critical paths of the design. Long combinatorial paths Occasionally a design may have a very long combinatorial path in it which cannot be scheduled into the available number of sequential steps. Note that this does not necessarily mean that the logic path has more "gate-levels," as scheduling must take many time sequence constraints into consideration. In such a case, the scheduler will complete the path by scheduling the remaining Boolean operations in a second pass using the unused processor time steps. This can also be caused by trying to squeeze too many gates into an emulator, but, as a benefit, it provides a user trade-off of emulation speed vs. capacity. Clock handling in processor-based emulators As described earlier, clocking was one of the prime sources of unreliability in FPGA-based emulators. Processor-based emulators completely avoid this problem since they can generate all the clocks necessary for a design or accept externally generated clocks. It is much more convenient to have the emulator generate all the design clocks needed and it runs faster. You can either specify the frequency of each clock, or let the compiler extract this information from the testbench (if there is one). Some processor-based emulators allow you to supply an external clock from the target system to the emulator. In such a case, the emulator will synchronize its internal clock to the external clock. To provide the maximum possible emulation speed, while retaining the asynchronous accuracy required, some processor-based emulators provide two methods of handling asynchronous design clocks: aligned edge and independent edge.
Figure 6 Clocking example three asynchronous clocks
Independent edge clocking With independent edge clocking, a processor-based emulator schedules an emulation cycle for every edge of every clock unless an edge is naturally co-incident with an already scheduled edge. This is very similar to an event-driven simulator.
Figure 7 Independent edge clocking of two asynchronous clocks
Starting with two clocks, the 133 MHz clock transitions from low to high at times (in ns.) of 0, 7.5, 15.0, etc. and from high to low at times 3.75, 11.25, etc. The 100MHz clock transitions from low to high at times 0, 10, 20, etc. and from high to low at times 5, 15, 25, etc. Note that the two are arbitrarily synchronized at time = 0. Emulation cycle #1: 133 MHz clock rises, 100 MHz clock rises (both occur at same time).
Figure 8 Adding a third asynchronous clock with independent edge clocking
So the schedule would be as follows: Emulation cycle #1: 133 MHz clock rises, 100 MHz and 80 MHz clocks rise (occur at same time). Aligned edge clocking In aligned edge clocking, we first schedule emulation cycles for each edge of the fastest clock in the design.
Figure 9 Aligned edge clocking starts by assigning an emulation cycle for each edge of the fastest clock
Then, all other clocks are scheduled relative to this clock with the slower clock edges "aligned" to the next scheduled emulation cycle. Note that no additional emulation cycles have been added for the second (slower) clock. Thus emulation speed is maintained. Also note that while edges are moved to following fastest clock edge, frequency relationships are maintained which could be essential for proper circuit operation.
Figure 10 Add second clock aligning edges to fastest clock
Figure 11 Add additional clocks by aligning edges to fastest clock
Oversampling and undersampling in aligned edge clocking When there is an emulation cycle at both edges of the fastest design clock, it is called "2x" clocking (two emulation cycles per fastest clock period). 1x clocking If one edge is not dominant, this technique can still be applied but the increase in capacity required will be significantly larger. Note that only the circuitry using the fastest clock, and any clock greater than half the speed of the fastest clock, is relevant in this increase in required capacity. 4x clocking oversampling Sometimes called "oversampling", 4x clocking delivers two emulation cycles (or more) for each active clock edge. This can also be helpful when there are complex asynchronous feedback loops that are broken. Latch-based designs may require oversampling because the compiler must insert a delay between the two ranks of the latch. If there are complex feedback paths, this delay may not always be put in the optimal place. .5x clocking
Giving the user the flexibility to switch between aligned edge clocking or independent edge clocking provides both high asynchronous accuracy and the fastest possible emulation speed for a wide variety of design styles. Memory modeling in a processor-based emulator The compiler will recognize most memories written in synthesizable Verilog and VHDL RTL code and handle them automatically including such things as:
If unusual memories must be built by hand, the user will write a "wrapper" around the emulator's primitive memory cells to provide the required response. In-Circuit emulation interface Timing control on output Since processor-based emulators schedule logic operations to occur in sequence, it is easy to add a constraint on the timing within the emulation cycle on individual (or groups of) output signals to control the timing to a very high resolution relative to other output signals. The compiler then schedules this output calculation at the appropriate point in the emulation cycle. This is not possible with FPGA-based emulators since they have no control over timing within a design clock.
Timing control on input Summary Hardware accelerators and emulators provide much higher verification performance than logic simulators, but require some additional effort to deploy. In-circuit emulation provides the highest performance, often 10,000 to 100,000 times faster than a simulator, but requires an emulation environment be built around it with speed buffering devices. Accelerators and emulators require the user to be aware of the differences between simulation and silicon (emulators and chips):
Both types of emulators have demonstrated that they are equally capable of handling a large number of asynchronous clocks in a design without suffering a performance impact. The ability of processor-based emulators to instantly probe new signals and change trigger conditions without requiring a slow FPGA compile greatly improves the interactivity of debugging. Since users spend most of their time debugging, processor-based emulators are able to deliver more design turns per day than FPGA-based emulators. This results in shorter time-to-market for new products and higher product quality. Ray Turner is the senior product line manager for Cadence's Incisive Palladium accelerator and in-circuit emulation systems, part of the Incisive Functional Verification Platform. Before joining Cadence, he was the EDA marketing manager for P CAD products for 7 years. Overall, Ray has 18 years experience in product management for EDA products. He also has 14 years experience in hardware, software, and IC design in the telecommunications, aerospace, ATE, and microprocessor industries.
| |
All material on this site Copyright © 2005 CMP Media LLC. All rights reserved. Privacy Statement | Your California Privacy Rights | Terms of Service | |
Related Articles
- Nextreme Structured ASICs: An alternative for designing cost-optimized ARM926EJ processor-based embedded systems
- Using ARM Processor-based Flash MCUs as a Platform for Custom Systems-on-Chip
- ESL Requirements for Configurable Processor-based Embedded System Design
- Configure, Confirm, Ship: Build Secure Processor-Based Systems with Faster Time-to-Market
- Squeeze power efficiency out of processor-based designs -- Part one
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certifiedβ’ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |