Script/simulation approach speeds SoC verification

Script/simulation approach speeds SoC verification
By Michael Orr, Radlan Computer Communications, EEdesign
February 5, 2002 (2:49 p.m. EST)
URL: http://www.eetimes.com/story/OEG20020205S0033

Functional testing of system-on-chip (SoC) software is difficult, especially when mature hardware doesn't exist yet. In this article Michael Orr, vice president of technology and product management at Based in Tel Aviv, Israel, and Santa Clara, Calif., Radlan is a fabless systems provider of enterprise and broadband access communications products.

In 1981, just minutes before launch, Space Shuttle Columbia's maiden voyage was delayed by 24 hours because its five on-board computers could not synchronize properly. A software queue that stood a 1-in-67 chance of being occupied when it was expected to be empty was found to be the source of the problem, which was traced back to a diagnostic routine written years earlier, which had passed innumerable rounds of testing undetected.

Apollo 13's No. 2 oxygen tank exploded 200,000 miles from Earth on April 13, 1970, putting the lives of three astronauts at great risk. A design modification, carried out three years earlier, raised permissible voltage to oxygen tank heaters, but was not matched by an upgrade of underrated thermostatic switches. These two seemingly unconnected subsystems, both performing as designed and thoroughly tested, combined in an unforeseen way to cause the accident.

In September of 1999, the Mars Climate Orbiter spacecraft was lost due to a failure to recognize and correct an error in information transfer between teams in Colorado and California. Apparently, one team used English units -- inches, feet and pounds -- while the other used metric units for a key spacecraft operation. This information was critical to the maneuvers required to place the spacecraft in proper Mars orbit. Again, thoroughly tested systems failed in actual practice.

The moral of these stories is not that Murphy's law is alive and well. In all cases extensive testing was performed, but still failed to ensure correct operation of the system as a whole. The problem in each case is that "correct" behavior of the system depends on complex interdependencies of multiple factors and subsystems. As a result, testing is inherently difficult.

Testing SoC-based devices, with a focus on software functionality, is inherently difficult. Testing SoC-based devices under construction only makes things worse, adding hardware and software immaturity issues, typically compounded by limited availability of test devices. A solution we've developed at Radlan is a script/interpreter/simulation based approach that allows for s ystematic functional testing of SoC-based devices in respect to the above considerations.

Challenges of SoC-based testing

Before making the point that testing correct behavior of SoC-based devices is a challenging task, it is worth noting that simply defining correct behavior can be hard. Consider a device serving as a switch/router with advanced quality of service (QoS) and filtering. Aside from implementing IEEE standards, IETF standards, and best current practices, there are many other considerations.

Will the device be serving in a corporate setting, where users are allowed and probably encouraged to share information, or will it be deployed by an ISP to serve individual customers, whose privacy and security must be maintained at all cost? Device behavior may be affected by such considerations as network topology, changes in user requirements, and external conditions, such as connection rates.

In assessing the difficulty of defining and testing "correct" behavior for a SoC-base d device, it should be realized a multitude of factors are of significance. Consider the following examples:

Network topology and setup. Devices joining and leaving the network, and links failing and being restored -- even at a remote location -- may affect the functionality deemed "correct" for the device under test. Which protocols are enabled/disabled on which devices? Which services and applications are being used? Are links full or half duplex? Everything is probably relevant.

Sequence of events. The sequence in which the device in question becomes aware of events can determine what the device should do - and it generates diverse results. Consider a change in network topology causing a Spanning Tree to block some ports, and then routing updates arriving on these same ports.

Traffic composition. The device behavior for various traffic profiles may differ greatly. For example, entire functional areas may not come into play at all until traffic exceeds certain threshol ds.

Clashes between different protocols. For example, IEEE standard 802.1p defines CoS (Class-of-Service) support for Ethernet packets crossing a switch. IETF DiffServ RFCs define QoS (Quality-Of-Service) support for IP packets crossing a router/L3 switch. Consider a combined L2/L3 switch, providing CoS/QoS service. Each subsystem assigns packets to queues on the output ports - but completely independently. The result of these two separate decisions can easily get tangled, overrunning available queue space.

Hardware/software interaction. The actual hardware available may differ from the ideal assumed by the standards, leading to side effects that have to be considered. For example, on some switching VLSI chips we worked on, assigning a particular queuing discipline to a specific port affected all ports of the same VLSI, so asking for "minimum delay" service on one port resulted in using strict-priority queuing on all ports of the same switch.

Interoperability. While all ven dors claim to support the standard (any standard), different implementations may vary, sometimes enough to cause either permanent or intermittent misbehavior.

The above is not a comprehensive list. The examples above do demonstrate, however, that testing each subsystem separately, while not trivial, is simply not enough. Combinations and system-level operation are of greater significance. In light of the above, we believe that the claim that testing SoC-based device functionality is "inherently difficult" is easily justified.

While testing "finished" devices is a daunting task, testing a device under development is much harder. We not only face all the issues listed above, but must work with subsystems that have not sufficiently been tested. Hardware may be immature and subject to changes, and when any problem is found, it may even be hard to tell if hardware or software is to blame. Regression testing must also be considered, as devices are being tested even as functions are being added and modif ied, requiring testers to constantly go back and check that previous functionality remains unaffected.

To compound matters, first-batch units are typically limited in quantity, especially working units. So, the number of tests that may be performed in parallel is limited, and engineers may have to work in shifts, or simply wait in line for a chance to test their subsystem. Time pressure is also of importance. If functional testing and debugging of software can only start after hardware is available, the process becomes sequential, and any hardware problem delays all software advancement.

A script/simulation based approach to functional testing

Our suggested approach to overcome the issues described above involves the following combination of elements:

A script based test generator that can simulate traffic matching any desired network topology, device configuration(s) and traffic composition.
A simulation environment allowing development, functional testing and debugging o f the actual operational software, independent of target hardware.
A systematic transition from simulated hardware to real hardware and back - with no changes to either operational software or testing scenarios.
A systematic process where these tools are used continuously, with cumulative testing experience in the form of scenarios encapsulated in test generator scripts being applied to each element, and to the system in its entirety.

Before describing the suggested testing methodology, a description of the various components is in order.

1. RTG - the regression test generator

The cornerstone of our approach is a software interpreter governed by a test description script language. This language permits the user to build test scenarios by describing frames and packets to be sent or received over interpreter interfaces. As this system is geared towards testing data-centric SoC-based devices, the frames/packets involved are Ethernet frames carrying a multitude of protocols typic al to IP/IPX environments.

The test writer may specify the contents of each packet generated, or the desired/expected contents of received frames/packets. The script language supports such common elements as loops, conditional statements, and variable assignments, among others. This system allows the test generator to produce any desired traffic content mix, including specific protocol emulation. For example, packets conforming to various routing protocols, such as RIP, OSPF and BGP4, may be sent to test device behavior resulting from layer 3 topology changes. For layer 2 testing, Spanning Tree messages (BPDUs - Bridge Protocol Data Units) may be generated to test device conformance to the IEEE 802.1D standard.

Several independent RTG interpreters may simultaneously be run, each using a separate testing script to emulate the effects of several independent systems generating traffic.

2. SLANs - "pipes" between the elements

RTG employs software-only interfaces called "SLANs" ("Simu lated LANs"). These software objects function as an Ethernet shared segment, or more precisely, as an Ethernet hub. By design, RTG and RTG scripts have no way of "knowing" what is connected to these SLANs. In fact, SLANs may be used to connect RTG interpreters, Device Under Test (DUT) ports, and arbitrarily selected devices on the test lab's LAN. In short, SLANs really do function like Ethernet hubs, and the tester may "plug" any suitable Ethernet (or simulated) device into each SLAN.

3. Operational software on simulated hardware

To carry out software development, debugging, and testing without the need for working hardware for each developer, heavy use of an "operational software on simulated hardware" approach is made. Target hardware, specifically packet-processing "switch on a chip" VLSI, is simulated down to the register level. Operational software is then developed on this simulated hardware, the only difference being the use of native compilers/debuggers as opposed to cross-develop ment tools used for target hardware.

The result is a simulated target device, capable of sending and receiving frames/packets over SLANs. Software development in this environment requires that the software run on the simulated hardware with no modifications as compared to the target hardware. Thus, the simulated device is a very close approximation of the eventual device. Using carefully written testing scripts, frames/packets may be sent to selected "ports" on this simulated device, and frames/packets generated by the simulated device may be examined for correctness.

It is, of course, easy to simulate multiple, similar or different devices, configure each one separately, and arbitrarily interconnect them by way of either SLANs or actual lab networks. Thus, network topologies and setups of any complexity may easily be constructed, allowing network-level behavior to be tested --for example, how the effects of a link disconnection propagate through various devices.

Unique to this testing approach is the fact that no differences exist between the control software and test scripts used for simulated hardware and real hardware. Simulated hardware differs from its real-world counterparts only in its control software host component, which dictates how it interfaces with the underlying operating system. All functional aspects of a simulated device's control software are identical to their real-world counterparts, so that if a simulated device has passed tests with flying colors, the chances that the real device will fail when completed are low to non-existent.

4. Close - But Not Close Enough?

At this point, the reader must be thinking: "but it's only a simulation! What if the simulation does not match real life?" Glad you asked. While the simulation is written according to the hardware specification of the target hardware, experience and common sense shows that when moving to actual devices differences will show up. This is not only normal, but also beneficial.

When real world and simulated results differ, we can trace the cause to either:

A bug in the operational software not found under simulated conditions (in which case it is worth investigating why not).
A simulation artifact - for example, the simulated device does not faithfully represent the actual device. This may be due to a bug in the simulation, or to the hardware not matching the specification the simulation was based on.

Use of exactly the same software on the target device when it becomes available significantly increases the usefulness of functionality testing on simulated hardware. Even so, one should not consider flawless software performance in simulated environments as any indication that our work is even halfway done.

5. "Repeater" - connecting to the real world

As noted above, the RTG sends and receives packets over SLANs, which are essentially emulated Ethernet hubs. This would seem to limit RTG to testing simulated devices, seemingly contradicting our desire to have as li ttle difference as possible between simulation and reality. Therefore, a separate "repeater" software object transfers frames from SLANs to Network Adapters on the testing workstation and vice-versa. Using the repeater, RTG frames can be sent over actual Ethernet links (UTP or Fiber) to actual devices.

As the RTG does not "know" if the SLAN is connected to simulated or actual device ports, RTG test scenarios are run unmodified when applied to actual devices. It is, in fact, common to have a network configuration composed of a mix of simulated and real devices interacting with a number of RTG instances over a mix of real and simulated LAN connections.

The testing process - all together now!

The suggested testing process actually begins during software development, and should be performed in the following steps:

1. "Show me" -- whenever a new software feature is developed (usually on simulated hardware, as this is typically faster and more convenient), the developer will write R TG test scenarios designed to prove that the new feature actually works as designed. This is done independent of the QA team, who will later use their own testing scripts to test each feature.
2. "Did you break anything?" -- the scripts written to test any feature are available to all developers, and they are required to run their new modifications against all existing testing scenarios before release. That is, each team performs full regression testing to make sure newly added changes do not break any existing code.
3. "Target practice" -- both new functionality and regression testing are repeated on actual hardware. The same software is used, except for changes resulting from differences in underlying operating systems.
4. "Let an experienced tester show you how" -- professional testing personnel from the QA team work with developers to test their newly developed functionality, using any mix of simulated and real devices, RTG interpreters and other testing equipment. Note that this is not con sidered a QA test for the system, and is only a late stage of development and debugging.
5. "It's a team game" -- after integration of the work performed by all teams into a system-level whole, the entire process is repeated. Again, specific test scenarios are used to test new functionality, and full regression testing is carried out to make sure no previous functionality has been compromised. Testing involves both simulated and real hardware, and typically includes interoperability testing.

Are We There Yet?

At this point in time, it may be tempting to think that we are done -- but in fact we are not. This is when QA testing just begins. Up to this point, the functionality of the system was tested, but we have yet to test it under stress. Now is the time for the system to undergo independent testing by professional testers, working separately from developers. Experience shows that even systems seemingly functioning flawlessly under the rigorous testing process described above br eak under professional testing. However, the same experience shows that they break less often, and once a problem is found, the scenario that uncovered it is encapsulated in the form of an RTG script, which is then made part of the base testing set for any future development.

As shown in the first part of this article, testing SoC-based devices is hard. Experience shows that the elaborate system described, while illustrating this difficulty, simplifies matters significantly. This is mainly due to the following advantages:

Software development and even functional testing may be carried out in parallel with hardware development, dramatically shortening project duration.
Developers have the resources to develop and test their own code, and even carry out regression testing to verify no previous functionality has been damaged.
There is better differentiation between software and hardware issues, and when hardware is finally available, the software it runs is more mature and stable. Testing is repeatable and cheap. Hardware availability and elaborate, time-consuming lab setups are no longer an issue. RTG tests may be run overnight or over weekends (a full regression test may last some 60 hours).
Testing experience is accumulated. If a test scenario flushes out some hard-to-test issue, it immediately becomes available to all developers.

While the good Dr. Murphy may still be lurking out there, and will probably be sending the odd space probe off course from time to time, the author strongly believes that the approach to SoC-based device testing described in this article will at the very least help keep him on his toes.

Michael Orr has over fifteen years of experience in R&D, project and product management. His past experience includes military R&D project management, compiler writing and UNIX system programming. He has been actively involved in embedded real-time software design and development, as well as product management and marketing for a LAN equipment manufactu rer.