Script/simulation approach speeds SoC verification
Script/simulation approach speeds SoC verification
By Michael Orr, Radlan Computer Communications, EEdesign
February 5, 2002 (2:49 p.m. EST)
URL: http://www.eetimes.com/story/OEG20020205S0033
Functional testing of system-on-chip (SoC) software is difficult, especially when mature hardware doesn't exist yet. In this article Michael Orr, vice president of technology and product management at Based in Tel Aviv, Israel, and Santa Clara, Calif., Radlan is a fabless systems provider of enterprise and broadband access communications products. In 1981, just minutes before launch, Space Shuttle Columbia's maiden voyage was delayed by 24 hours because its five on-board computers could not synchronize properly. A software queue that stood a 1-in-67 chance of being occupied when it was expected to be empty was found to be the source of the problem, which was traced back to a diagnostic routine written years earlier, which had passed innumerable rounds of testing undetected. Apollo 13's No. 2 oxygen tank exploded 200,000 miles from Earth on April 13, 1970, putting the lives of three astronauts at great risk. A design modification, carried out three years earlier, raised permissible voltage to oxygen tank heaters, but was not matched by an upgrade of underrated thermostatic switches. These two seemingly unconnected subsystems, both performing as designed and thoroughly tested, combined in an unforeseen way to cause the accident. In September of 1999, the Mars Climate Orbiter spacecraft was lost due to a failure to recognize and correct an error in information transfer between teams in Colorado and California. Apparently, one team used English units -- inches, feet and pounds -- while the other used metric units for a key spacecraft operation. This information was critical to the maneuvers required to place the spacecraft in proper Mars orbit. Again, thoroughly tested systems failed in actual practice. The moral of these stories is not that Murphy's law is alive and well. In all cases extensive testing was performed, but still failed to ensure correct operation of the system as a whole. The problem in each case is that "correct" behavior of the system depends on complex interdependencies of multiple factors and subsystems. As a result, testing is inherently difficult. A similar situation holds true for SoC-based devices whose behavior is a function of many factors interacting in complex ways. Testing each feature or subsystem separately is not enough to ensure correct operation, and testing for all possible factor combinations is infeasible. Testing SoC-based devices, with a focus on software functionality, is inherently difficult. Testing SoC-based devices under construction only makes things worse, adding hardware and software immaturity issues, typically compounded by limited availability of test devices. A solution we've developed at Radlan is a script/interpreter/simulation based approach that allows for s ystematic functional testing of SoC-based devices in respect to the above considerations. Challenges of SoC-based testing Before making the point that testing correct behavior of SoC-based devices is a challenging task, it is worth noting that simply defining correct behavior can be hard. Consider a device serving as a switch/router with advanced quality of service (QoS) and filtering. Aside from implementing IEEE standards, IETF standards, and best current practices, there are many other considerations. Will the device be serving in a corporate setting, where users are allowed and probably encouraged to share information, or will it be deployed by an ISP to serve individual customers, whose privacy and security must be maintained at all cost? Device behavior may be affected by such considerations as network topology, changes in user requirements, and external conditions, such as connection rates. In assessing the difficulty of defining and testing "correct" behavior for a SoC-base d device, it should be realized a multitude of factors are of significance. Consider the following examples: While testing "finished" devices is a daunting task, testing a device under development is much harder. We not only face all the issues listed above, but must work with subsystems that have not sufficiently been tested. Hardware may be immature and subject to changes, and when any problem is found, it may even be hard to tell if hardware or software is to blame. Regression testing must also be considered, as devices are being tested even as functions are being added and modif ied, requiring testers to constantly go back and check that previous functionality remains unaffected. To compound matters, first-batch units are typically limited in quantity, especially working units. So, the number of tests that may be performed in parallel is limited, and engineers may have to work in shifts, or simply wait in line for a chance to test their subsystem. Time pressure is also of importance. If functional testing and debugging of software can only start after hardware is available, the process becomes sequential, and any hardware problem delays all software advancement. A script/simulation based approach to functional testing Our suggested approach to overcome the issues described above involves the following combination of elements: 1. RTG - the regression test generator The cornerstone of our approach is a software interpreter governed by a test description script language. This language permits the user to build test scenarios by describing frames and packets to be sent or received over interpreter interfaces. As this system is geared towards testing data-centric SoC-based devices, the frames/packets involved are Ethernet frames carrying a multitude of protocols typic al to IP/IPX environments. The test writer may specify the contents of each packet generated, or the desired/expected contents of received frames/packets. The script language supports such common elements as loops, conditional statements, and variable assignments, among others. This system allows the test generator to produce any desired traffic content mix, including specific protocol emulation. For example, packets conforming to various routing protocols, such as RIP, OSPF and BGP4, may be sent to test device behavior resulting from layer 3 topology changes. For layer 2 testing, Spanning Tree messages (BPDUs - Bridge Protocol Data Units) may be generated to test device conformance to the IEEE 802.1D standard. Several independent RTG interpreters may simultaneously be run, each using a separate testing script to emulate the effects of several independent systems generating traffic. 2. SLANs - "pipes" between the elements RTG employs software-only interfaces called "SLANs" ("Simu lated LANs"). These software objects function as an Ethernet shared segment, or more precisely, as an Ethernet hub. By design, RTG and RTG scripts have no way of "knowing" what is connected to these SLANs. In fact, SLANs may be used to connect RTG interpreters, Device Under Test (DUT) ports, and arbitrarily selected devices on the test lab's LAN. In short, SLANs really do function like Ethernet hubs, and the tester may "plug" any suitable Ethernet (or simulated) device into each SLAN. 3. Operational software on simulated hardware To carry out software development, debugging, and testing without the need for working hardware for each developer, heavy use of an "operational software on simulated hardware" approach is made. Target hardware, specifically packet-processing "switch on a chip" VLSI, is simulated down to the register level. Operational software is then developed on this simulated hardware, the only difference being the use of native compilers/debuggers as opposed to cross-develop ment tools used for target hardware. The result is a simulated target device, capable of sending and receiving frames/packets over SLANs. Software development in this environment requires that the software run on the simulated hardware with no modifications as compared to the target hardware. Thus, the simulated device is a very close approximation of the eventual device. Using carefully written testing scripts, frames/packets may be sent to selected "ports" on this simulated device, and frames/packets generated by the simulated device may be examined for correctness. It is, of course, easy to simulate multiple, similar or different devices, configure each one separately, and arbitrarily interconnect them by way of either SLANs or actual lab networks. Thus, network topologies and setups of any complexity may easily be constructed, allowing network-level behavior to be tested --for example, how the effects of a link disconnection propagate through various devices. Unique to this testing approach is the fact that no differences exist between the control software and test scripts used for simulated hardware and real hardware. Simulated hardware differs from its real-world counterparts only in its control software host component, which dictates how it interfaces with the underlying operating system. All functional aspects of a simulated device's control software are identical to their real-world counterparts, so that if a simulated device has passed tests with flying colors, the chances that the real device will fail when completed are low to non-existent. 4. Close - But Not Close Enough? At this point, the reader must be thinking: "but it's only a simulation! What if the simulation does not match real life?" Glad you asked. While the simulation is written according to the hardware specification of the target hardware, experience and common sense shows that when moving to actual devices differences will show up. This is not only normal, but also beneficial. When real world and simulated results differ, we can trace the cause to either: 5. "Repeater" - connecting to the real world As noted above, the RTG sends and receives packets over SLANs, which are essentially emulated Ethernet hubs. This would seem to limit RTG to testing simulated devices, seemingly contradicting our desire to have as li ttle difference as possible between simulation and reality. Therefore, a separate "repeater" software object transfers frames from SLANs to Network Adapters on the testing workstation and vice-versa. Using the repeater, RTG frames can be sent over actual Ethernet links (UTP or Fiber) to actual devices. As the RTG does not "know" if the SLAN is connected to simulated or actual device ports, RTG test scenarios are run unmodified when applied to actual devices. It is, in fact, common to have a network configuration composed of a mix of simulated and real devices interacting with a number of RTG instances over a mix of real and simulated LAN connections. The testing process - all together now! The suggested testing process actually begins during software development, and should be performed in the following steps: Are We There Yet? At this point in time, it may be tempting to think that we are done -- but in fact we are not. This is when QA testing just begins. Up to this point, the functionality of the system was tested, but we have yet to test it under stress. Now is the time for the system to undergo independent testing by professional testers, working separately from developers. Experience shows that even systems seemingly functioning flawlessly under the rigorous testing process described above br eak under professional testing. However, the same experience shows that they break less often, and once a problem is found, the scenario that uncovered it is encapsulated in the form of an RTG script, which is then made part of the base testing set for any future development. As shown in the first part of this article, testing SoC-based devices is hard. Experience shows that the elaborate system described, while illustrating this difficulty, simplifies matters significantly. This is mainly due to the following advantages: Michael Orr has over fifteen years of experience in R&D, project and product management. His past experience includes military R&D project management, compiler writing and UNIX system programming. He has been actively involved in embedded real-time software design and development, as well as product management and marketing for a LAN equipment manufactu rer.
The above is not a comprehensive list. The examples above do demonstrate, however, that testing each subsystem separately, while not trivial, is simply not enough. Combinations and system-level operation are of greater significance. In light of the above, we believe that the claim that testing SoC-based device functionality is "inherently difficult" is easily justified.
Before describing the suggested testing methodology, a description of the various components is in order.
Use of exactly the same software on the target device when it becomes available significantly increases the usefulness of functionality testing on simulated hardware. Even so, one should not consider flawless software performance in simulated environments as any indication that our work is even halfway done.
While the good Dr. Murphy may still be lurking out there, and will probably be sending the odd space probe off course from time to time, the author strongly believes that the approach to SoC-based device testing described in this article will at the very least help keep him on his toes.