Multiprocessor, or MP, verification revolves around memory accesses. To this end three main issues face the developer of a random test generation tool for multiple processors — what types of memory sharing to support, how to generate the test for each processor so that it abides by the sharing constraints, and how to control test generation so that an acceptable level of sharing is achieved. Memory sharing modes Memory utilization in multiprocessor systems falls into four categories: 1) non-sharing, 2) false sharing, 3) true sharing with semaphores (also called deterministic true sharing) and 4) non-deterministic true sharing. Each of these is discussed below. In many cases it is useful for a random test generator to support all sharing modes. Each mode is useful at a different stage in processor verification. Verification of complex tests and cryptic processor conflicts can then be delayed until after basic functionality is established using the simpler sharing models. 1) Non-sharing is the simplest memory sharing mode. Although multiple processors exist in the system, memory is divided in such a way that no two processors can access the same memory block. Blocks are defined so cache lines are isolated, preventing interactions between caches of the various processors, eliminating cache coherency issues from the verification problem space. Figure 1 — Non-shared memory 2) False sharing tests allow multiple processors to access memory in the same memory block or cache line, but not at the exact same address. This allows some checking of cache coherency without the possibility of data corruption or the need for semaphores or handling of non-deterministic behavior. These tests can be run quickly and debugged easily. Figure 2 — False sharing of memory 3) Deterministic true sharing is supported through the use of semaphores. Memory accesses by multiple processors to the same address are allowed, but are constrained so that processor state is deterministic at each semaphore boundary. The nature of the access, either a read or write, is key to this mechanism. This level of sharing allows testing of most MP issues including cache coherency. Cross-modifying code can be supported at this level of sharing. Figure 3 — Deterministic true sharing Figure 4 — Use of semaphores in deterministic true sharing Figure 5 — Shared memory accesses in deterministic true sharing 4) Non-deterministic true-sharing allows processors to read and write to identical memory locations potentially in the same cycle. This means that values read from memory and propagated into registers cannot be determined due to the stochastic behavior of real systems. Thus, results may vary with each execution. This level of verification is more difficult, but exercises the processors fully. Non-deterministic behavior must be limited or controlled in some fashion; otherwise the test becomes meaningless. Figure 6 — Non-deterministic true sharing Test generation Two approaches can be taken to test generation for any sharing mode — round-robin or sequential generation. In round-robin test generation, each processor generates a single instruction, then passes generation control to the next processor. After the instruction is generated for the last processor, the first processor will generate its next instruction and so on. At each instruction boundary it is possible for the generating processor to have full knowledge of all instructions generated by all processors up to that instruction number in the test. In sequential generation, the entire instruction sequence for each processor is generated in turn. Each processor has knowledge of the fully generated test for all previously generated processors. For example, in a four-CPU test, CPU1 has no knowledge of any other processors' accesses whereas CPU3 has knowledge of CPU1 and CPU2 but not of CPU4. These two approaches are approximately equivalent. In the round-robin mode, the higher level of communication during generation results in increased overhead for all processors for all instructions, and becomes increasingly more severe as the number of instructions increases. In sequential generation, CPU1 generates as easily as in a single CPU test. Each additional CPU encounters more restrictions on memory usage, and this results in a subsequent slowdown. Division of memory In MP test generation, the memory usage of each processor must be controlled in a way that meets the sharing requirements of the test. There are two possible approaches to this. In on-the-fly memory division all or most memory is assumed to be shared. In regional memory division, memory is divided into regions that are allocated to one or more processors. 1) On-the-fly memory division: On-the-fly memory division addresses two issues — the need for large blocks of data to support some data structures, and the need to mark critical data as non-sharable. In randomly generated tests there are data blocks, such as system tables, that exist as large, contiguous memory blocks. Examples include page tables, interrupt tables or task state segments. The more memory accesses per processor and the more processors, the fewer large, continuous blocks of unaccessed memory remain. This results in significant slow-down of processors generated late in the sequence as the generator searches for required memory. It may also limit the types of instructions and data structures available for later processors during sequential generation, or for instructions late in the test for round-robin generation. Data in page tables, interrupt tables, task state segments and other system tables cannot be allowed to become non-deterministic, or the behavior of the test will become widely divergent, depending on the exact sequence of accesses by the different processors to the critical memory region. Because of this unpredictable behavior, there must be a method of marking memory blocks as reserved for use by a given processor. Marking critical memory blocks as non-shared is not difficult. However, in this paradigm, each memory write by any processor needs to check sharing status for all accessed memory. 2) Regional memory division: In regional memory division, memory is allocated in a way that allows for the appropriate mode of sharing. Memory is divided into shared and non-shared regions. When a processor accesses memory in a non-shared region, the generator does not have to run sharing checks on the access. Some regions may be marked for sharing among a subset of the processors in a given test. For example, read/write access may be granted to processor 1, read access to processor 2, and no access to processors 3 and 4. This type of memory division can also aid in increasing the number of shared accesses. The test generator allows the user to configure the memory regions, the types of supported shared accesses, and the frequency of accesses to the region. These parameters can vary over the different processors in the test. Generation and shared memory accesses 1) Generation of non-sharing tests: For on-the-fly memory division generators, each memory access needs to check memory at adjacent addresses, depending on cache line size, to determine if another processor has made an access to this cache line. If no such access has been made, the generator can use the address. If another processor has accessed the same cache line, the address would be rejected. For regional memory division generators, non-sharing MP tests do not present unique problems for the generator. The single constraint imposed on the generation of processors in these tests is that they do not use any memory region that is marked as shared. This is a simple extension to the maps of usable memory that test generator must support. 2) Generation of false sharing tests: For on-the-fly memory division, false sharing is similar to non-sharing. A memory check is required on all accesses, but only the exact addresses accessed by the current processor are prohibited from having been accessed by previously generated processors. In regional memory division only accesses to shared memory regions need to perform the same check, as for on-the-fly memory division generators. 3) Generation of deterministic true sharing tests: In deterministic true sharing, code execution is divided into sections called zones. Each zone begins at a semaphore or at the beginning of the test, and ends at a semaphore or at the end of the test. Each processor executes the instruction stream within a zone and then enters the semaphore code sequence. In the semaphore code, the processor decrements the semaphore count using an atomic instruction, checks the value of the semaphore count, and loops until the count reaches zero. All memory accesses made in a zone by all processors are guaranteed to have taken place before the first instruction in the next zone is executed. This does not guarantee the value of shared memory locations. While code is executed in a linear fashion for each processor, the various processors may advance more or less quickly through their code sequences depending on memory wait states, processor speed and other variables. Rules need to be applied to memory accesses to prevent non-deterministic behavior. A summary of the rules needed for sequential generation is shown below. These rules vary slightly for round-robin generation. A processor may read an address if: - No previously generated processor has written to this address OR
- In the last zone where this address was written, only a single processor performed a write OR
- The current processor has written to this address in this zone and no other processor has performed a write to this address in this zone.
A processor may write an address if: - No previously generated processor performs a read to this address in this zone AND
- No previously generated processor performs a read to this address in a following zone, unless there is an intervening write from a previously generated processor.
4) Generation of non-deterministic true sharing tests: In non-deterministic true sharing, there is no knowledge of the order of execution of instructions in different processors. Any processor that reads a data address written by another processor cannot determine if the read is done before or after the write by the other processor. When non-deterministic data is read into a register, the register must be marked to indicate that the data contained in the register is unknown. This means that future instructions that use this register as a source will have unpredictable behavior. Non-deterministic behavior may also be propagated into the processor's status or flag registers. The generator may mark the entire flag register or may mark individual bits as non-deterministic. Since most instructions that operate off flags do so using a limited set of flag bits, tracking individual bits results in greater flexibility. a) Determining non-determinism: On-the-fly memory division generation always assumes that all non-reserved memory accesses are shared. In non-deterministic true sharing, all of these accesses must be assumed to be non-deterministic. For regional memory division, all reads from shared memory regions are assumed to be non-deterministic. Whether using sequential or round-robin mode generation, it is not possible to tell at the time an instruction is generated whether another processor will write to a particular memory address. In the round-robin case, even though the processor tests are generated using a stepping process, it cannot be assumed that execution of the processors will exhibit this same lock-step behavior. b) Restrictions on non-deterministic registers: Registers marked as containing non-deterministic data can propagate non-determinism through the processor. For example, an ADD instruction one non-deterministic source can result in the propagation of non-deterministic data to the destination register or memory location and to the processor flags. To allow reasonable processor execution and random behavior, such propagation must be allowed. Restrictions apply to the use of registers containing non-deterministic data under two circumstances: where the non-deterministic nature of the data may affect the instruction stream, and where the non-deterministic data would be used in the generation of a data address. The generator must restore registers to a deterministic state at some points or at some conditions during the test. This is done to allow deterministic access to system information. This can be done periodically (every n instructions) or based on the current processor condition (percentage of registers containing non-deterministic data). Final words When choosing a random test generator for a multiprocessor system, the verification engineer should examine the implementation and behavioral choices addressed above to ensure that verification requirements are met by the generator. In addition, there should be an easy method of determining the amount of sharing present in each generated test. The generator should provide a list of the exact addresses shared, the processor doing each access, the zone of each access for deterministic true sharing and the instruction doing the access. This information is essential to verification and debugging of multiprocessor tests. Melanie Typaldos has 17 years of experience in the microprocessor field ranging from systems to verification to design. She spent 11 years working at AMD in Austin, Texas, first on the Am29000 family of RISC processors and then on embedded x86 microcontrollers. She is currently employed by Obsidian Software and has participated in the development of the RAVEN random test generator for multiprocessor systems, specifically targeting x86 devices. Becky Cavanaugh is one of the founders of Obsidian Software and currently serves as Director of Software Engineering. Becky started her career in embedded systems programming at Datagraf. From there she moved to AMD where she participated in the development of a random test generator for the Athlon processor. Becky has been involved in all phases of the development of Obsidian's RAVEN random test generator. |