Clock domain modeling is essential in high density SoC design
Clock domain modeling is essential in high density SoC design
By Jens Rennert, Senior Design Engineer, MobileSmarts Inc., Santa Clara, Calif., Dan Hafeman, Technical Consultant, Advanced Research, Mentor Graphics Corp., Wilsonville, Ore., EE Times
June 6, 2003 (4:10 p.m. EST)
URL: http://www.eetimes.com/story/OEG20030606S0036
Whether designing SOCs with traditional synchronous logic or alternative locally, or self-clocked "asynchronous" blocks, verification has become more important, difficult and time consuming, particularly if there are multiple clock domains to consider. With the increasing number of functional blocks, and associated clock domains, factors such as proper system dimensioning and error-free signal path synchronization between clock domains are a growing concern for designers. In addition to using "best practices" techniques in their designs, it is also important in this much more complex SoC environment for designers to verify clock domain crossing and system dimensioning in system level verification. When an active clock edge and a data transition occur very close together, a flip-flop or latch may go into an indeterminate condition, known as a "metastable" state a statistical event in which the output of the device is not de fined for a small, unknown length of time. Luckily, metastability is well understood and can be characterized for each given process technology. Unfortunately, synchronizers alone do not solve the functional problems that occur in multi-clock designs. Designers must identify where in the design synchronizers are needed to ensure the timing on these nets allow for proper settling. Applying synchronizers is further complicated by domain crossing vectors where all bits of the vector must cross the domain simultaneously. Even when each bit of a vector has a perfectly working synchronizer, it is possible for system failures to occur because not all changing bits are captured on the same edge of the receiving clock. In addition, frequency variation between domains often requires the insertion of buffers or small FIFOs to guarantee data integrity. Sizing and correct implementation of these FIFOs is critical to proper system behavior and performance. No automated techniques can validate the correct placement of synchronizers, implementation of domain crossing vectors, or buffering of data paths. With that said, there are tools that will identify where signals cross domains and verification platforms that correctly model asynchronous behavior in multi-clock designs. The combination of these tools with diligent design and extensive system level verification can eliminate synchronization bugs in SoC designs. Thorough system architecture exploration is fundamental to SoC design. Early in the design process, tradeoffs in the algorithm, partitioning and communication infrastructure must be evaluated. The result is a description that best fits the system requirements as well as estimates of FIFO size, latency and bandwidth. In shared architectures, multiple agents share a centralized FIFO and only contain small buffers to compensate for arbitration latency. While cost efficient, they work only if system load is known and therefore a fixed arbitration scheme can be applied. They are, therefo re, not suitable for systems with random bandwidth peaks, large dynamic demands and many use cases. In more distributed SoC architectures, each agent has its own full FIFO. This approach is usually less cost efficient than the share scheme, and is usually a waste of memory since FIFOs have to be over-designed and memory cannot be allocated dynamically. Outweighing such drawbacks, this approach is very flexible and able to handle bandwidth peaks, and large dynamic demands. Also, flexible arbitration is possible. If the SoC designer opts for the best of both worlds and usings a split bus architecture, it is important to avoid putting too many agents on one bus. It is also essential to keep high latency and slow responding agents off the low latency and high bandwidth memory bus. And if crossover between different busses is needed, implement bridges. A good general design rule in such designs is to group real-time and non-real time agents, and implement flexible arbitration to keep FIFO requirem ents for real-time agents small. While up-front architectural consideration gives an estimate of system requirements, the possibility of under-designing a system with fatal consequences due to unanticipated cross boundary clocking issues still exists. Likewise, a system may be over-designed which would waste expensive silicon real estate. Design teams must understand how the implementation of their system behaves for all use cases to be sure of correct system dimensioning. Fortunately there are a number of design techniques which can be used to validate the system implementation, including bus, FIFO and arbitration monitoring as well as mindful reuse of tested elements to limit synchronization related bugs. Bus controllers should be equipped with monitors that keep a short transaction list of the last bus cycles exercised. These monitors are not only helpful for hardware verification purposes but also ease software debugging in the real silicon. Special bus conditions, like uncompleted trans actions, time-outs, or other protocol errors should be recorded and generate an interrupt of the system. All FIFOs in the SoC should provide access to their respective read and write pointers. The FIFOs should furthermore contain logic that flags overflow and underflow conditions. Verification should monitor these pointers and events while running each use-case to prove correct dimensioning for system latency and bandwidth. The FIFO pointer values could also be used to establish histograms of data traffic through the system to further optimize the FIFO depths. In systems with pseudo random bus access patterns, it is hard to predict the average bus loading and latency. CPUs can easily suffer performance loss due to stalls caused by bus latencies. Monitoring arbitration patterns and establishing arbitration histograms help fine tune the bus arbiters and may point out potential deficiencies in the arbitration scheme. Standard interfaces A useful techniqu e is to enforce all IP modules to use a standardized, generic interface for MMIO and DMA traffic, where appropriate. An interface layer module would then bridge the adoption of the IP module protocol to the bus protocol used. All clock domain crossing logic should be contained in the bus adapter module. There are several advantages to this approach. First, adoption to a different bus protocol can be accomplished by simply replacing the bus adapter module by another one. Second, the bus adopter module only needs to be verified once and then applied to all IP modules system wide. Also, this reduces the chance of clock domain crossings errors and makes the IP more reusable for different projects. Along with careful implementation, it is critical to verify all domain-crossing circuitry. There are no simple methods to statically analyze that proper tradeoffs have been made or that synchronization errors don't exist in a design. These errors can result in reduced system performance, lost data, incorrect functionality and are typically not reliably reproduced in the field. Careful design review is a necessary step, but even this scrutiny can't guarantee a state machine doesn't occasionally violate its gray code or a FIFO overflows under certain conditions. A combination of verification tool that accurately handle asynchronous clock domains and a system level verification methodology is required to comprehensively verify synchronization and dimensioning. As discussed, an important step is to identify all cross domain signals. There are tools on the market that provide a complete analysis of the clocking system and create reports of inter-domain signals, perform clock tree analysis and identify domains driven from multiple clock sources. This information should be used in the design review process and checked regularly as part of regression testing. There are several techniques for properly synchronizing signals that cross clock domains. For example, signals that cross doma ins are easily synchronized by introducing a synchronization flip-flop with no logic in the path "meta," the first flop has the entire clock period to settle before being sampled. There are also ways to easily deal with skew conditions between the uncontrolled phasing of the source and destination clocks. While unlikely, it is definitely probable that some of the bits in a vector will be registered at their new state while others remain at the old state. Take the example of a three-bit vector that changes from binary 011 to binary 100 where each bit of the bus is fed into an individual synchronizer when it crosses between domains. A failure can occur if only some of the bit changes are detected. Possible resulting vectors are 111 or 101. Note that retaining the old value of 011 or completely capturing the new value of 100 is acceptable, but capturing only some of the bit changes is not acceptable. Contents of vectors can, and should, be gray coded to avoid synchronization problem. In doing so, the designer might be able to guarantee that only one bit will change on the vector per clock cycle. This method can be applied whenever the change sequence of the vector is completely specified at design time. Gray coding is particularly useful in FIFO designs were the pointers indicating FIFO-full-level have to be reported to the other domain. Since either address of the FIFO only increments (decrements), it is possible to gray code the address in such a way that only one address bit changes on any FIFO access. Thus, it is safe to put an individual bit synchronizer on each bit of the address. Even when different clock domains drive the read and write side of the FIFO, underflow or overflow conditions can always be avoided. State sequences should be selected such that only single bit changes occur. As before, all sequences must be known at design time. If the designer forgets a particular transition, perhaps during initialization or error handling, it may result in inadvertently causing a mult i-bit change. For example, assume the following simple state machine which sequences between four states: Busses and unconstrained vectors sequence must be handled in a different manner to insure error-free domain crossing. Since it is impossible to code the data so that multi-bit changes do not occur, there are a number of situations in which control and status signals must be introduced to enhance the synchronization. For example, consider what happens when a status signal is coupled with receive latches. Since the synchronizer naturally introduces a one or two receive clock period delay, the data should be ready at the output of the data latch before "Data ready" becomes true. Since the data latch is clocked by the "Source Clock", it will not go into a metastable state. The synchronizer is reset when the latched data is read. This design assumes that the current data will always be read before new data is latched. Because the synchronizer imposes a two receive clock delay and bec ause at least one more cycle is required to actually capture the data on the receive side, three receive clock periods are required per data transfer, reducing the maximum bandwidth by a factor of three. While this is a simple method, the bandwidth constraints often make it ineffective. An alternative synchronization technique would be a double registered handshake. While very robust, the 4-phase handshaking process can take up to eight clock periods, four on the source and four on the receive, plus whatever overhead is introduced by the finite state machines Dual port memories typically have the read and write sides of the memory operate in truly asynchronous clock domains. The memories read and write enables must be generated in a conventional way by comparing the read and write pointers in either clock domain. Here, gray coding can be applied to the state machine to assure safe domain crossing of the pointers and read/write signals. Adding a FIFO to the memory port has the advan tage of not compromising the bandwidth, data can be accessed on each port, on every clock edge. Synchronizer overhead only becomes an issue when the FIFO becomes full, making it important that the FIFOs be appropriately sized. Speed maintenance In order to create accurate test scenarios, it is important that the clock speed ratios of the domains be maintained. Since verification systems do not run at real-time speeds, all the clocks must be scaled down by the same factor in order to maintain the proper ratios. This insures that proper edge sequences and edge procession is modeled functionally correct and that synchronization errors can be detected. Randomization is an important test method in RTL verification. Synchronization problems are typically probability-driven events that are more likely to occur the more the system is subjected to stress. The combination of randomization and emulation is a powerful verification technique. For example, with emulation p erformance, full randomization of read/write access to the MMIO subsystem and memory subsystem becomes possible. Directed stimulus can produce maximum effects in a minimum number of clock cycles. This can be useful to create boundary conditions and corner cases. Combining randomization, directed testing and emulation performance with the monitors described previously a measure of functional coverage is achieved In addition, emulation speeds enable all inter-domain paths to be thoroughly tested. With this level of throughput, empirical data can be gathered on SoC performance in general and, specific to this article, on all domain crossing data paths. FIFO utilization can be measured and bandwidth problems identified. With emulation, the probability of synchronization errors is increased by 10000 times over that of simulation. Potential synchronization problems in a design are best evaluated by runnin g actual system behavior. Often, the full-speed external hardware can be connected to the non-real-time emulation environment. The value of connecting, for example, real memory devices or a slowed down PCI bus to an emulated design, should not be underestimated. These devices produce and consume data across the domain crossings. Since the actual devices are being used, they represent the most accurate model of the real environment. One good verification strategy is to simultaneously emulate all possible use cases intended for the device. The intention of this testing is to identify overloaded or over-designed data paths, providing an opportunity for cost and/or power reduction. A proven verification strategy for a SoC design is to prove first that the SoC infrastructure is working properly. Infrastructure modules encompass all busses and their associated controllers and arbiters. In case of designs with embedded CPUs, they also include the memory and interrupt controllers. With a standardiz ed bus interface, it is possible to substitute the IP modules in the netlist by random transaction generators. These transaction generators randomly exercise the bus systems and memory controller by writing to and reading from all spots in the system aperture map. Bus monitors and transaction generators produce system events when erroneous transactions or wrong read results of previously written data occur. These events can be used as trigger conditions to stop the verification process. The memory subsystem can be verified in a similar way by randomly reading and writing memory locations and exercising standard memory tests with the help of external interfaces, such as PCI, or programs running on embedded processors. Random regressions could be performed with varying bus and transactor clock speeds to stress the system even more. Once the solidity of the infrastructure has been proven, IP modules can safely be deployed on the bus system, and the debug focus can be shifted to the functionality of the IP modules rather than the infrastructure. Key to this technique is performance and clock modeling accuracy.