FPGA-based emulation is more widely understood by engineers because engineers are used to designing with FPGAs. Much less well understood are processor-based emulators, and ample examples of misinformation abound. This article will attempt to remove the mystery explaining how processor-based emulation works and how design constructs are mapped into it, such as tri-state busses, complex memories, and asynchronous clocking. Early days of processor-based emulation In the early 1990's IBM pioneered processor-based emulation technology, which was an offshoot of earlier work they had done in hardware-based simulation engines. The hardware technology consisted of a massive array of Boolean processors able to share data with one another, running at very high speed. The software technology consisted in partitioning a design among the many processors and scheduling individual Boolean operations in the correct time sequence and in an optimal way. Initially, performance could not match FPGA-based emulators, but compile times of less than an hour, and the elimination of timing problems that plagued FPGA-based emulators, made the new technology appealing for many use models, especially simulation acceleration. Future generations of this technology eventually surpassed FPGA systems in emulation speed while retaining the huge advantage in compilation times — and without a farm of a hundred PCs for compilation. Advances in software technology extended the application of processor-based emulators to handle asynchronous designs with any number of clocks. Other extensions supported 100% visibility of all signals in the design, visibility of all signals at any time from the beginning of the emulation run, and dynamic setting of logic analyzer trigger events without recompilation. At the same time that the emulation speed of FPGA-based systems has been decreasing, new generations of processor-based systems have not only increased emulation speed at a prodigious rate, but have also proven scalable in capacity to hundreds of millions of gates. Processor-based emulator architecture To understand how a processor-based emulator works, it is useful to briefly review how a logic simulator works. Recall that a computer's ALU (Arithmetic-Logic Unit) can perform basic Boolean operations on variables, such as AND, OR, NOT, and that a language construct such as "always @ (posedge Clock) Q = D" forms the basis of a flip-flop. In the case of gates (and transparent latches), simulation order is important. Signals race through a gate chain schematically "left-to-right" so to speak, or "top to bottom" in RTL source code. Flip-flops (registers) break up the gate chain for ordering purposes. Figure 1 — Logic simulation; CPU does Boolean math on signals, registers One type of simulator, a levelized compiled logic simulator, performs the Boolean equations one-at-a-time in the correct order. (Time delays are not relevant for functional logic simulation.) If two ALUs were available, you can imagine breaking the design up into two independent logic chains and assigning each chain to an ALU, thus parallelizing the process and reducing the time required, perhaps to one half. A processor-based emulator has from tens of thousands to hundreds of thousands of ALUs which are efficiently scheduled to perform all the Boolean equations in the design in the correct sequence. The following series of drawings illustrate this process. For this example we assume a 4-input Boolean primitive in the emulator. Figure 2 — Step 1: Reduce Boolean logic to four-input functions The set of Boolean equations after reducing the logic to four-input functions is: IF (Clock is rising) C = A S = C & B & E & F M = NOT (G + H + J + S) P = NOT (N + NOT (M & K & L) ) IF (Clock is rising) R = P Additionally, the following sequencing constraint set applies: - The flip-flops must be evaluated first
- S must be calculated before M
- M must be calculated before P
- Primary inputs B, E, and F, must be sampled before S is calculated
- Primary inputs G, H, and J, must be sampled before M is calculated
- Primary inputs K, L, and N, must be sampled before P is calculated
- Note: primary input A can be sampled at any time after the flip-flops
One possible scheduling is shown above. This is done for illustrative purposes and may not be the most efficient scheduling. Figure 3 — Result of scheduling logic Figure 4 — Processor-based emulator architecture An emulation cycle consists of running all the processor steps for a complete modeling of the design. Large designs typically schedule in from 125 — 320 steps. If the design can use 1x clocking (described later), emulation speed for a processor-based emulator would then be between 600 KHz and 1.5 MHz. During each time step, each processor is capable of performing any 4-input logic function using as inputs the results of any prior calculation of any of the processors and any design inputs or memory contents. Processors are physically implemented in clusters with rapid communication within clusters. The compiler optimizes processor scheduling to maximize speed and capacity. Figure 5 — Processor array architecture example Design compilation Compilation of an RTL design is completely automated and accomplished in the following sequence: 1) Map RTL code into primitive cells such as gates and registers. 2) Synthesize memories. 3) Flatten the hierarchy of the design. 4) Reduce Boolean logic (gates) into 4-input functions. 5) Break asynchronous loops in the design by inserting a register at an optimal place. 6) Assign external connections for the target system and any hard IP. 7) Set-up any instrumentation logic required (such as logic analyzer "visioning"). 8) Assign all design inputs and outputs to processors in a uniform way. 9) Assign each cell in the design to a processor. Priority is given to assigning cells with common inputs and/or outputs to the same processor or cluster and to assigning an equal number of cells to each processor. 10) Schedule each processor's activity into sequential time steps. The goal is to minimize the maximum number of time steps. The compiler also has to take into account simulation acceleration connections, tri-state bus modeling, memory modeling, non-uniform processor connectivity, logic analyzer probing and triggering, and other factors. But the compiler doesn't have to cope with the highly variable FPGA internal timing in FPGA emulators. For this reason processor-based emulation compiles much faster and with fewer resources. The compiler maintains all originally designated RT-level net names for use in debugging in spite of the Boolean optimization it performs. This allows users to debug with the signal names with which they are familiar. Tri-state bus modeling Tri-state busses are modeled with combinatorial logic. When none of the enables are on, the user can choose "pull-up", "pull-down", or "retain-state." In the latter case, a latch is inserted into the design to hold the state of the bus when no drivers are enabled. In case multiple enables are on, for pull-up and retain-state logic 0 will "win" and for pull-down logic 1 will win. (Note: this is a good place to use assertions.) Asynchronous loop breaking Since emulators do not model gate-level silicon timing, asynchronous loops are broken automatically by a delay flip-flop during compilation. The compiler will automatically break asynchronous loops without user intervention. However, by allowing the user to specify where loop breaks should occur, performance may be enhanced, since the performance of a processor-based emulator is related to the length of long combinatorial paths. By inserting delay elements to break false paths or multi-clock-cycle paths, performance can be improved if these paths are the critical paths of the design. Long combinatorial paths The emulator's processors operate from an instruction "stack" of a specific depth, such as 160 words. These are the time-steps into which the design calculations are sequenced. Occasionally a design may have a very long combinatorial path in it which cannot be scheduled into the available number of sequential steps. Note that this does not necessarily mean that the logic path has more "gate-levels," as scheduling must take many time sequence constraints into consideration. In such a case, the scheduler will complete the path by scheduling the remaining Boolean operations in a second pass using the unused processor time steps. This can also be caused by trying to squeeze too many gates into an emulator, but, as a benefit, it provides a user trade-off of emulation speed vs. capacity. Clock handling in processor-based emulators As described earlier, clocking was one of the prime sources of unreliability in FPGA-based emulators. Processor-based emulators completely avoid this problem since they can generate all the clocks necessary for a design or accept externally generated clocks. It is much more convenient to have the emulator generate all the design clocks needed — and it runs faster. You can either specify the frequency of each clock, or let the compiler extract this information from the testbench (if there is one). Some processor-based emulators allow you to supply an external clock from the target system to the emulator. In such a case, the emulator will synchronize its internal clock to the external clock. To provide the maximum possible emulation speed, while retaining the asynchronous accuracy required, some processor-based emulators provide two methods of handling asynchronous design clocks: aligned edge and independent edge. Figure 6 — Clocking example — three asynchronous clocks Independent edge clocking Since emulators do not model design timing, but rather are functional equivalents, the exact timing between asynchronous clock edges is not relevant. It is only necessary that clock edges that are not simultaneous in "real life" are emulated independently. With independent edge clocking, a processor-based emulator schedules an emulation cycle for every edge of every clock unless an edge is naturally co-incident with an already scheduled edge. This is very similar to an event-driven simulator. Figure 7 — Independent edge clocking of two asynchronous clocks Starting with two clocks, the 133 MHz clock transitions from low to high at times (in ns.) of 0, 7.5, 15.0, etc. and from high to low at times 3.75, 11.25, etc. The 100MHz clock transitions from low to high at times 0, 10, 20, etc. and from high to low at times 5, 15, 25, etc. Note that the two are arbitrarily synchronized at time = 0. Emulation cycle #1: 133 MHz clock rises, 100 MHz clock rises (both occur at same time). Emulation cycle #2: 133 MHz falls (time = 3.75). 100 MHz does nothing. Emulation cycle #3: 100 MHz falls (time = 5.00). 133 MHz does nothing. Emulation cycle #4: 133 MHz rises (time = 7.5). 100 MHz does nothing. Emulation cycle #5: 10 MHz rises (time = 10). 133 MHz does nothing. Emulation cycle #6: 133 MHz falls (time = 11.25). 100 MHz does nothing. Emulation cycle #7: 133 MHz rises, 100 MHz falls (time = 15ns — both transitions) Figure 8 — Adding a third asynchronous clock with independent edge clocking So the schedule would be as follows: Emulation cycle #1: 133 MHz clock rises, 100 MHz and 80 MHz clocks rise (occur at same time). Emulation cycle #2: 133 MHz falls (time = 3.75) 80 MHz, 100 MHz do nothing. Emulation cycle #3: 100 MHz falls (time = 5.00) 80 MHz, 133 MHz do nothing. Emulation cycle #4: 80 MHz falls (time = 6.25) 100 MHz, 133 MHz do nothing. Emulation cycle #5: 133 MHz rises (time = 7.5) 80 MHz, 100 MHz do nothing. Emulation cycle #6: 100 MHz rises. (time = 10) 80 MHz, 100 MHz do nothing. Emulation cycle #7: 133 MHz falls. (time = 11.25) 80 MHz, 100 MHz do nothing. Emulation cycle #8: 80 MHz rises. (time = 12.5) 100 MHz, 133 MHz do nothing. Emulation cycle #9: 133 MHz rises, 100 MHz falls (time = 15ns — both transitions). Aligned edge clocking Aligned Edge clocking is based on the fact that, although many clocks in a design happen to have non-coincident edges because of their frequencies, proper circuit operation does not depend on the edges being independent. In this case, while proper frequency relationships are maintained, clock edges are aligned to the highest frequency clock, thereby reducing the number of emulation cycles required, and increasing emulation speed. In aligned edge clocking, we first schedule emulation cycles for each edge of the fastest clock in the design. Figure 9 — Aligned edge clocking starts by assigning an emulation cycle for each edge of the fastest clock Then, all other clocks are scheduled relative to this clock with the slower clock edges "aligned" to the next scheduled emulation cycle. Note that no additional emulation cycles have been added for the second (slower) clock. Thus emulation speed is maintained. Also note that while edges are moved to following fastest clock edge, frequency relationships are maintained which could be essential for proper circuit operation. Figure 10 — Add second clock aligning edges to fastest clock Figure 11 — Add additional clocks by aligning edges to fastest clock Oversampling and undersampling in aligned edge clocking When there is an emulation cycle at both edges of the fastest design clock, it is called "2x" clocking (two emulation cycles per fastest clock period). 1x clocking Sometimes only one edge of the fastest design clock is active in a design, or one edge is dominant and the other edge clocks a minimal amount of circuitry. In such a case, called "1x clocking," a single emulation cycle for each design clock cycle doubles — or nearly so — emulation speed. There may be a small increase in capacity required. If one edge is not dominant, this technique can still be applied but the increase in capacity required will be significantly larger. Note that only the circuitry using the fastest clock, and any clock greater than half the speed of the fastest clock, is relevant in this increase in required capacity. 4x clocking — oversampling This can provide for quick bring-up of designs that contain complex asynchronous paths. Designs with read-modify-write memories, or back-to-back latches, can be brought into emulation more quickly by using 4x clocking initially. Sometimes called "oversampling", 4x clocking delivers two emulation cycles (or more) for each active clock edge. This can also be helpful when there are complex asynchronous feedback loops that are broken. Latch-based designs may require oversampling because the compiler must insert a delay between the two ranks of the latch. If there are complex feedback paths, this delay may not always be put in the optimal place. .5x clocking This technique, available on some processor-based emulators, is mainly used when there is a small amount of logic running at the fastest design clock. In .5x clocking two of the fastest design clocks are emulated in a single emulation cycle. This can further increase emulation speed over 1x clocking, but with certain restrictions: only one edge of the fastest design clock may be active, there cannot be complex clock gating of this clock, and there cannot be too much design logic running at this speed, as it can significantly increase the capacity required for emulating the design. Giving the user the flexibility to switch between aligned edge clocking or independent edge clocking provides both high asynchronous accuracy and the fastest possible emulation speed for a wide variety of design styles. Memory modeling in a processor-based emulator The compiler will recognize most memories written in synthesizable Verilog and VHDL RTL code and handle them automatically including such things as: - Multiple Read and Write ports
- Port sharing to minimize area/delay
- Varying Read/Write dependencies
- Different write enables (edge versus level sensitivity)
- Synchronous and asynchronous styles
Two examples of memories automatically modeled from the user's Verilog code follow. Figure 12 — Example: Synchronous memory with edge-sensitive clock is automatically generated from user's Verilog code Figure 13 — Example: Asynchronous memory with level-sensitive write-enable is automatically generated from user's Verilog code If unusual memories must be built by hand, the user will write a "wrapper" around the emulator's primitive memory cells to provide the required response. In-Circuit emulation interface Timing control on output When interfacing to the real world it is sometimes necessary to control the relative timing of output signals. A DRAM memory interface is one such example — all the address lines must be stable before asserting Write-Enable. Since processor-based emulators schedule logic operations to occur in sequence, it is easy to add a constraint on the timing within the emulation cycle on individual (or groups of) output signals to control the timing to a very high resolution relative to other output signals. The compiler then schedules this output calculation at the appropriate point in the emulation cycle. This is not possible with FPGA-based emulators since they have no control over timing within a design clock. Figure 14 — Processor-based emulators can adjust output timing with high precision Timing control on input In a similar way — and for similar reasons — timing on input signals ("sampling") may be controllable by the user to meet specific situations. Again, processor-based emulators can simply schedule specific input pins to be sampled before or after others within the emulation cycle. With FPGA-based emulators the user must "tweak" timing by adding delays to certain signals, including a large guard-band because FPGA-based emulators cannot control absolute timing along different logic paths. For FPGA-based emulators, this is a hit-or-miss proposition and may vary from compile to compile. Summary Hardware accelerators and emulators provide much higher verification performance than logic simulators, but require some additional effort to deploy. In-circuit emulation provides the highest performance, often 10,000 to 100,000 times faster than a simulator, but requires an emulation environment be built around it with speed buffering devices. Accelerators and emulators require the user to be aware of the differences between simulation and silicon (emulators and chips): - Simulation has 12 states or more, silicon has but two states.
- Simulation generally executes RTL statements sequentially, silicon "executes" RTL concurrently.
- Simulation is highly interactive, silicon less so.
FPGA-based emulators use commercial FPGAs and are smaller and consume less power, while processor-based emulators require custom silicon designs and consume more power. On the other hand, processor-based emulators compile designs ten times faster on one-tenth the number of workstations — minutes vs. hours. And emulation speed is faster on nearly all designs, an average of 2X faster. Both types of emulators have demonstrated that they are equally capable of handling a large number of asynchronous clocks in a design without suffering a performance impact. The ability of processor-based emulators to instantly probe new signals and change trigger conditions without requiring a slow FPGA compile greatly improves the interactivity of debugging. Since users spend most of their time debugging, processor-based emulators are able to deliver more design turns per day than FPGA-based emulators. This results in shorter time-to-market for new products and higher product quality. Ray Turner is the senior product line manager for Cadence's Incisive Palladium accelerator and in-circuit emulation systems, part of the Incisive Functional Verification Platform. Before joining Cadence, he was the EDA marketing manager for P CAD products for 7 years. Overall, Ray has 18 years experience in product management for EDA products. He also has 14 years experience in hardware, software, and IC design in the telecommunications, aerospace, ATE, and microprocessor industries. |