Small but Deadly - the Life Cycle of an I/O Bug
Some years ago, Duolog worked with a customer to develop a verification infrastructure for system-level validation of a large multimedia chip. Duolog developed a modular, programmable chip-level testbench, incorporating peripherals, memories, reset, clocks and control. The testbench was used for system validation and its main targets were RTL simulation, emulation and FPGA, meaning that the whole infrastructure needed to be synthesizable. While challenging to develop, the testbench was delivered on schedule to coincide with an RTL release of the chip database that was to be used by embedded software engineers ramping up for initial HW/SW integration. As we were providing a verification service, we needed to be highly responsive and wanted to ensure that any problems with the testbench were dealt with immediately. Therefore, while the system testbench was developed off-site, our engineers accompanied its delivery on-site to ensure there was nothing blocking this critical stage of the design flow.
The next three weeks were a revelation. On a daily basis, the Duolog engineers were summoned to some random cubicle in the customer’s site. There, we typically found two engineers hunched in front of a workstation. One was a hardware engineer responsible for IP design and implementation. The other was an embedded software engineer responsible for writing the device driver for that particular IP. The software engineer was trying to integrate and test his code within a co-simulation environment but was not getting the response expected from a simple test case. For example, in the case of a UART IP, the first test – a HW/SW interface test of the IP’s registers and bitfields – had passed with some difficulties. The second test was a loopback test, which involved getting the UART to transmit a byte out of the chip on the TX line and, using Duolog’s testbench, looping it back to the RX pin so that the same byte would be received in the UART. This simple integration test would give a good degree of confidence that the UART was where it was supposed to be in the memory map and was behaving as expected. The software integration could then progress from there. There was, however, a problem with the integration as the expected byte never came back. There was a bug in the flow!
BUG!!!
The following dialogue is representative of what happened next and was replayed on many occasions during those three weeks. Names have been changed to protect the innocent and swear words have been removed.
The software engineer, with the Duolog verification engineer and the IP designer in situ, replayed the problem by stepping through his software code:
From the UART’s perspective, the chip design and verification infrastructure was as follows:
Figure 1: UART loopback path
The UART IP was buried deep within the chip core, under several layers of hierarchy. The signals on the core were multiplexed onto several different pins and went through an I/O layer to the SoC boundary. This I/O layer contained the functional and test muxes, BSR cells, some power isolation logic and finally the I/O cells themselves.
The SoC boundary was where Duolog's testbench domain started. Duolog instantiated the SoC top-level in our testbench and did some de-multiplexing to get the correct pins to the correct testbench signals. The testbench provided functionality to loop back certain signals, including the ability to connect the UART TX port back to its RX pad. This allowed a byte to be transmitted serially over the TX pin and straight back through the RX pin where it would be routed through multiple levels of SoC hierarchy to the UART block. Within the UART, it would be received as a byte which the software could then view through a read access. This was the functionality that was not working for the software engineer.
The dialogue continued:
The IP designer immediately, and understandably, defended his design and assured us that it was indeed working. He pulled up a screen of waveforms and showed the following:
Figure 2: RX not looped back
IP DESIGNER (more assured): “Your testbench is not looping the signals back correctly - see the flat-line on the RX signal.”
SW ENGINEER (nodding in compliance): “Yep. The testbench is not working. It’s delaying my testing!”
TB DESIGNER: “What exactly are we looking at here?”
IP DESIGNER & SW ENGINEER (in unison): “The UART interface.”
TB DESIGNER: “But from where?”
IP DESIGNER (dismissively): “The UART, of course.”
The UART was transmitting down 6-7 levels of hierarchy and a lot could go wrong in those intermediate levels.
TB DESIGNER: “Which ones are the UART TX and RX?” After consulting various specifications they figured out that, in that particular configuration mode, the TX and RX should be on pins 72 and 73 respectively. We honed in on pins 72 and 73 and saw the following waveforms:
Figure 3: Chip boundary
After selecting new trace signals and re-running the simulation, the UART signals on the SoC core boundary appeared as follows:
Figure 4: SoC core boundary
‘It’s an I/O BUG!’
The search had only just begun. The engineer responsible for the I/O layer integration was summoned to help diagnose the root cause.
Figure 5: Is it an I/O bug?
The symptoms of the problem and how the bug was isolated were explained. More information was needed and all of the signals in the I/O layer had to be traced. The huge RTL source file for the I/O layer was opened, along with several excel sheets containing the I/O specification. After consulting the excel sheets, it was confirmed that indeed pins 72 and 73 should contain the UART signals.
The integration engineer checked the RTL to make sure it was the latest version. Inside the RTL code, he went to the pin representing the UART TX. There were multiple concurrent statements and several instantiations. He flicked between the RTL code and the waveform viewer for the I/O layer. He grouped several signals together in the waveform viewer and analyzed them. He made the following brief summary:
Figure 6: Output of pin multiplexing
After spending several minutes analyzing the code and the values, the integration engineer picked up the phone and asked the DFT engineer to come and take a look at the RTL. He promptly arrived and they pored over the code until it all became clear. The functional multiplexing was working but was being overridden by the DFT multiplexing which was forcing the output to a test signal instead. The associated expression was using the wrong polarity for the test enable signals. The DFT engineer re-coded it on the spot and the co-simulation was re-run. A positive result came back:
Figure 5: Chip pin working correctly
This meant that the testbench was working as expected, which was a great relief! Satisfied, the integration and DFT engineers left. However, the software test continued to fail. They followed the signalling to the SoC core and again encountered some bad news.
Figure 7: SoC core has no toggling
It seemed as if the input multiplexing path was not working. The integration engineer was called again. He referred to the I/O specification, which stated that the UART_RX could be sourced from several different pins, depending on a mode register set by software. He found the problem quickly – an incorrect mode decode had been coded into the input multiplexer – and within 30 minutes, they had the RX correctly toggling at the IP boundary.
Figure 6: Correct at last at the IP boundary!!
This had taken most of the day, but they had found and fixed two I/O bugs. However, this story was replayed on an almost daily basis, with a whole variety of bugs, over the next three weeks. I/O bugs were found hiding in the following habitats:
- Functional Hardware Environment
- Incorrect signal or incorrect modes coded on output path
- Incorrect signal or incorrect modes coded on input path
- Software setup not configuring the device properly
- Test logic interference
- Testbench Environment
- Testbench muxing not correctly configured
- Testbench control not set up correctly
- Specification Environment
- Bugs re-appearing because of frequent changes to the I/O specification
We eventually published a set of slides showing where to look for signals toggling, and what to check for at various stages of the path through the chip and out to the testbench.
Click to enlarge
Figure 9: Debug slides
Cost of an I/O bug
I/O bugs are typically ‘blocking’ bugs. For the project in question, the embedded software was on the critical path and an unstable I/O layer stood between the software engineers and their integration goals. The I/O layer is also at the cross-functional boundary of a number of disciplines, so determining the root cause of problems in the I/O layer can be quite time consuming. Every time something didn’t work related to the chip-level environment, Duolog was called in and inevitably there were bugs in the I/O layer. Every bug had to be located, assigned, fixed, validated, closed and checked, adding delays of days, or even weeks, to the embedded software schedule. Not only were the software engineers working flat-out to solve the problems, so too were the IP team, the integration team, the architecture team and the testbench team.
The cost of all the I/O bugs could be measured in:
- Resource cost of handling bugs – analyze, find, fix (exterminate), validate, close – across all teams
- Delays in turnaround of I/O bugs resulting in delays to the HW/SW integration schedule, which ultimately resulted in delays to the product release
- Cost to re-spin a chip if an I/O bug makes it to silicon!
Exterminating I/O bugs
If one bug gets into the system, even a small one, then it becomes critical. How do you catch these bugs? Do you only uncover them when they have already done the damage? Do you hire exterminators to get rid of them temporarily, and repeat this every so often? Obviously, the most effective, and environmentally friendly, way of getting rid of bugs, is not to let them in to begin with!
Figure 10: Stop the bugs from entering in first place
In order to keep I/O bugs out of your integration flow, there are few fundamental characteristics that are required:
- Use a single-source executable specification to capture all of your I/O data, and derive all outputs from this source. Ensure that all I/O users are within this scope as bugs will quickly infiltrate a flow that has data originating in different places. This allows gaps in your bug barrier!
- Use a comprehensive and rigorous suite of design rule checks to ensure a high level of quality for the I/O specification. This will ensure that even the smallest bugs can’t get through the barrier.
- From your validated central specification, auto-generate all views of the I/O layer, including RTL, verification infrastructures, software configuration, die and package netlists, I/O register descriptions and documentation. Cover everything possible with this automated process as anything that is exposed can be infected by a bug.
An I/O integration flow with these characteristics is the only effective way to eliminate bugs by not letting them into the system in the first place. As the old proverb says, ‘An ounce of prevention is worth a pound of cure’. Automate your I/O flow – keep the bugs out.
http://www.duolog.com/
David Murray, is CTO of Duolog Technologies and was the original designer of Spinner, an award-winning tool that fully automates the I/O layer of a chip. With more than 18 years experience in the IC design industry across a wide range of disciplines, David has written and presented several papers on topics from functional coverage to algorithmic IP.
Sean Boylan is the Spinner product manager and has 14 years experience developing innovative software for the EDA and telecommunications industry. Prior to Duolog he worked for 3Com and Ericsson.
|
Related Articles
- Hitless I/O: Overcoming challenges in high availability systems
- Why using Single Root I/O Virtualization (SR-IOV) can help improve I/O performance and Reduce Costs
- Let's Get Intimated with IBIS
- Wide I/O driving 3-D with through-silicon vias
- Basics of SoC I/O design: Part 2 - Hot swap & other implementation issues
New Articles
- Quantum Readiness Considerations for Suppliers and Manufacturers
- A Rad Hard ASIC Design Approach: Triple Modular Redundancy (TMR)
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- UPF Constraint coding for SoC - A Case Study
- Dynamic Memory Allocation and Fragmentation in C and C++
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
E-mail This Article | Printer-Friendly Page |