Root Cause Analysis (RCA) of Soft Digital IP to improve IP Quality & Reusability
By Vivek Singh, Ankur Krishna , Amardeep Punhani Digital-IP Group, Freescale Semiconductor India Pvt. Ltd Noida-India
Abstract
Silicon Defects are extremely expensive! The cost of quality is very high in an ASIC company. Finding and correcting defects is one of the most expensive part of an ASIC development process. This cost increases exponentially, if these errors are discovered at the later stages of the life cycle.
A mature “in process” quality assurance returns substantial cost savings by preventing or at the least detecting and correcting defects at early stages. Any mature process has to be a control circuit which is constantly fed back with process defect misses. Root Cause Analysis (RCA) is a commonly used to tool identify process loopholes. Corrective steps only be defined when ”Why the failure occurred?” is answered. The objective remains to learn from the defects detected in past projects and eliminate these pitfalls by process improvements and new methodologies establishments.
This paper shares our learning’s while defining a repeatable RCA methodology for soft digital IP.
This paper will focus on the following areas of the process:
- Collecting samples
- Classifying the defects (Defect Analysis)
- Identification of RCA stake holders
- Identification of corrective actions (Process Improvements)
The specified RCA methodology in this paper helps designers in identifying root causes of any issue and in recommending steps for correction. Any RCA resulting in vague recommendations like “Improve adherence to general design guidelines,” is a failure. The RCA recommendations should target the specific issues in a design process and should also preferably define the recommended steps. The proposed RCA methodology will allow development of systematic design process improvements and assessment of the impact of corrective steps based on recommendations.
Keywords
Defect, Defect Analysis, Defect Prevention, Root Cause Analysis
Introduction
Consistent quality of soft IP with increasing complexity while implementing a lean and effective design process , is an expectation from any mature design group. To meet above goals it’s almost mandatory to prevent recurring defect. The cost of quality increases significantly when the defect is missed at the early stages of the design. It is therefore advisable to put in measures that prevent the defect from being introduced in the product.
The process where defects are analyzed at IP or technology domain or Soc level yield good practices and immediate gap fixes. While these methods are recommended in sharing good practices but to find an overall process improvements Root Cause Analysis (RCA) is recommended. Any good process should be based on the pattern observed rather than the single incident observed. These preventive measures, after consent and commitments
from team members, are embedded into the organization as a baseline for future projects. The methodology is aimed at providing the organization a long-term solution and the maturity to avoid repetitive defects.
In this study, we share our learning in defining and implementing our RCA methodology while maintaining the goal of lean and effective design process. The paper provides a comprehensive view of the challenges we faced and the results we achieved.
The paper has been divided into different heading each defining vital steps for the RCA.
Collecting samples
Any quality process is as good as the data it’s based on. The more apt the data the more relevant and accurate is the analysis. Though the sampling logic will be based on what you are analyzing, the data selection should be based on the following three simple conditions:
- Not too old: Though older data gives us an understanding of the issues and history but too old data may lead us to focus on areas which we faced during maturing as a design process or resources. The “too old” data can be defined by lifecycle of a design process. In our analysis we took last 2.5 years of data, as our typical soft IP design lifecycle was 8 months.
- Same Maturity: The data analyzed should be of same maturity group. Having data from a wider group may lead to recommendations which may not be very effective to solve the problem we are working on. The maturity group selection depends on the focus area of analysis. In our study we were trying to minimize the Customer reported issues. We took customer reported Errata and the post Alpha* defects of Digital IP. Post- Alpha Maturity is defined as the 100% verified Soft IP. This 100% verification is a combination of the code coverage and functional coverage achieved by verification and validation.
- Same design process: To identify any gap in existing design process it’s must to sample only that data which followed the same/similar processes. Having data from different processes may lead to comparative analysis of the two processes rather than identifying a unique process. It’s also assumed that the output of similar processes had same exit criteria/ expectation.
From our study we found that having the following data is helpful in the analysis:
- Description: The detailed description of the issue, the scenario and the observed behavior.
- Ticket Dialog: The related discussion on the defect. This will help in initial classification of the defect.
- Date of Origin: The origination date of the issue. This will help to filter all the issues which were filed before the maturity stage achieved.
- Affected version: The design IP and its version on which the defects were filed. This will help to understand the complexity of the IP.
- Originator: Access to the originator – the person who filed the defect.
- Analyst: The design owner, verification owner and the validation owner of the design IP on which the defect was found.
- Defect ID: The unique identification of the defect/issue we are analyzing, for ease of communication.
All the above data in one way mandates that there is an already defined bug tracking system which captures all this data online. Relying on memory of people involved for collecting data at time of analysis, may lead to the wrong analysis.
Due to variation in the kind of soft IP are done in our organization we also collected following information from each Analyst.
- Was the IP verified at module level or at SOC level?
- Was the verification environment random or directed?
Classifying the defects
“Comparing two defects is like comparing apples to oranges”. Wrong!!
Each defect may have different complex steps to reproduce, but the underlying issue will be very similar. Each defect is the result of miss at some stages of the design process. The objective of our analysis is to find the first gate where it could have been caught. If an IP is defined mature after post-si validation, then all defects ended up at customer site can be accounted to failure of post-si process. This may be a stop-gap measure for strengthening the post-si validation process to catch “only similar” type of issues. The long term solution would be to find the first process check in the existing gate where this defect should have been caught.
Classifying the defects will vary from one process to another. Though the works like Orthogonal Defect Classification – IBM* is a classic example of a generalized classification. In our analysis we wanted the classification which is easily understandable to user and directly maps to actionable items.
In our analysis of defects we went through all defects once to understand each issue and classified the areas where we thought each defect lie. The principle we used:
- Who (functional domain role and not person) could have caught it?
- How could he have caught it?
- What may have led to him/her missing it?
Any good process is people independent. It doesn’t rely on heroism of people but at the process checks and guidelines. The processes should be well equipped to guide any person with the minimum skill set needed for the job to produce the expected quality work product.
Defect analysis generally seeks to classify defects into categories and identify possible causes in order to direct process improvement efforts. The multi-layered classification approach helps to map the defects to pre-defined quality gates in design process and to identify the actionable steps to improve IP-design process/ Quality. In this paper, we are proposing a three layered approach to identify the gaps in IP-design process and help to identify the required actions.
Level: 01 defect classification
First level defect classification segregates the defects on functional domain level and helps to map these defects to corresponding IPdesign process gates. We can also define this classification as “Source of Defect” based on functional domain level.
Level: 02 defect classification:
The goal of RCA is to identify the root cause of defects and to initiate actions so that the source of defects is eliminated. The defect categories based on IP-design process events/gates define the 2nd-level defect classification and propagate the defects from “source of defect” classification (functional domain) to IP-design process event level.
Level:03 defect classification:
The 3rd level defect classification covers the sub-functional gaps at specific IP-design process gates and will help to prevent recurrence of harmful defects by suggesting preventive actions. The root causes to 3rd level events helps to identify the specific corrective actions in design process.
Defect mapping from Level01 to Level03
Translation shortfalls:
Defects in IP design requirements and design documents are more frequent due to translation shortfalls than errors in the source code itself. Defects introduced during the requirements and design phase are not only more probable but also are more severe and more difficult to remove. Table shows the defects introduced due to Translation shortfalls during IP design life cycle.
Source of Defect (Level01) | RCA Classification (Level02) | Shortfall – Categories (Level03) |
Translation Shortfalls | Systems specification | IP Feature requirements |
System environment /expectations | ||
Use-case scenarios | ||
Custom Interfaces | ||
Standard protocol specification reference & exception | ||
Design Understanding | IP functional description | |
Register Map & descriptions | ||
Operational sequences | ||
Error flags/Interrupt recovery mechanism | ||
IP Interface & Protocol requirements |
Review /Testing shortfalls:
Team-review is one of the most effective activity in uncovering the defects which may later be discovered by a verification/validation teams or by a customer. Often, a team-review of the IP architecture, RTL code and Verification environment helps reduce the defects related to algorithm implementations, incorrect logic or certain missing conditions. The review of IP architecture, RTL, traceability of requirements till the fag end of design process may bring out the gaps in the design process.
Source of Defect (Level01) | RCA Classification (Level02) | Shortfall – Categories (Level03) |
IP RTL Design and Verification Methodology Shortfalls. | Lack of Reviews/ Basic fault | Basic RTL design mistake |
Lack of Basic Verification Plan | ||
Lack of Basic verification TestCase | ||
Lack of Extensive Verification | Code coverage analysis | |
Corner Case: Intra-modes coverage | ||
More simulation time | ||
Parameter Range Coverage | ||
Fault injection coverage |
Documentation Shortfalls:
This classification broadly covers documentation updates /clarity and description related defects. The source of defects in this category confines within the IP-design team and isolate from translation issues propagated due to cross-functional dependencies.
Source of Defect (Level01) | RCA Classification (Level02) | Shortfall – Categories (Level03) |
Documentation Shortfall | IP Design documentation | Lack of clarity on functional description-No RTL bug |
Documentations not update after fix of Defects/Change Requests | ||
Lack of clarity on limitations - No RTL bug |
IP Tools and Methodology Shortfalls:
This defect classification covers the IP Tools and methodology related aspects. This RCA classification helps to map the functional defects to Methodology specific gaps in IP-design process.
Source of Defect (Level01) | RCA Classification (Level02) | Shortfall – Categories (Level03) |
IP Implementation Tools and Methodology Shortfalls | Tools & Methodology Shortfalls | Gate Level Simulation shortfall |
Clock Domain Crossing shortfall | ||
Design For Test shortfall | ||
RCA-Others | Errata entered in spite of behavior was well documented | |
Legacy Driver code, scripts or Rule-set errors | ||
IP under development (and defect discovered late in cycle) | ||
Not a defect |
Identification of RCA stake holders
It’s a team work!
RCA can be successful only if you have committed team. Each team member should bring in an area of expertise to the table. The stake holders are not only the ones who are directly affected by it but also who may be indirectly involved. The group together should be able to answer the following:
- What was the issue at hand? This can be best answered by the one who analyzed and debugged it.
- What checks were done around the functionality? and How we missed it ? This can be answered only by those who verified or validated it.
- What was the recommended process and did we follow it? The process expert is needed to answer this.
- What change is needed in the process to catch these? This needs a forum of process experts.
Identification of corrective actions (Process Improvements)
RCA Classification (Level02) | Shortfall – Categories (Level03) | Recommendations |
Systems specification | Use-case scenarios | Application Use-case scenarios to be part of IP-Requirements. |
IP requirements to be tagged with Verification Plan | ||
IP-Requirement document must be considered as single requirement list for Design team | ||
System environment / expectations | Need more collaboration with Systems, if design team new to particular functional domain. | |
Design Understanding | IP Interface & Protocol requirements | Specific interface protocol requirements should be part of IP- Requirement document. |
Mutual agreement on IP Interface requirements with Systems team. | ||
Error flags/ Interrupt recovery mechanism | A separate section to be specified on “Error or Interrupt recovery mechanism” in Design document. | |
Section to be tagged with Verification Plan. | ||
Lack of Reviews \Basic fault | Strengthen the IP review process including Design & Verification teams. | |
Lack of Basic Verification Plan | Publish review & exception report on defined link. | |
Lack of Basic Verification Test Case | Identify the review expectations /guidelines and report log mechanism | |
Basic RTL design Mistake | Randomization will cover these scenarios | |
Lack of Extensive Verification | Parameter Range coverage | Parameter /define combination that has been used for testing should be mentioned in Integration Guide. |
Too much of `defines should be avoided | ||
Large simulation time required | Pre-silicon is need for such kind of IPs. It should be caught by Verification flow | |
Fault injection | Negative scenarios should be checked i.e. behavior of the system should be intact even after the fault injection |
Conclusion:
This paper provides a comprehensive view of the challenges we faced and the results we achieved. The specified RCA methodology in this paper helps designers in identifying root causes of issues and in recommending steps for correction. The multi-layered classification approach helps to map the defects to predefined quality gates in design process and to identify the actionable steps to improve IPdesign process/ Quality. In this paper, we have proposed a three layered approach to identify the gaps in IP design process and help to identify the required actions. The RCA recommendations target the specific issues in a design process. The proposed RCA methodology allows development of systematic design process improvements and assessment of the impact of corrective steps based on recommendations. Implementation of proposed RCA methodology and corrective actions enhance the product quality by preventing errors based on past mistakes. This RCA methodology improve design process and cycle time , which helps to reduce product cost with highest customer satisfaction.
References
[1] Chillarege, R., Bhandari, I.S., Chaar, J.K., Halliday, M.J., Moebus, D.S., Ray, B.K., and Wong, M.-Y., "Orthogonal Defect Classification - A Concept for In-Process Measurements," IEEE Transactions on Software Engineering, vol. 18, no. 11, November 1992, pp.943-956
[2] Sakthi Kumaresh & R Baskaran October 2010, “Defect Analysis and Prevention for Software Process Quality Improvement” International Journal of Computer Applications (0975 – 8887) Volume 8– No.7,
|
Related Articles
- SoC clock monitoring issues: Scenarios and root cause analysis
- Design-Stage Analysis, Verification, and Optimization for Every Designer
- Radiation Tolerance is not just for Rocket Scientists: Mitigating Digital Logic Soft Errors in the Terrestrial Environment
- Out of the Verification Crisis: Improving RTL Quality
- Semiconductor Reliability and Quality Assurance--Failure Mode, Mechanism and Analysis (FMMEA)
New Articles
Most Popular
E-mail This Article | Printer-Friendly Page |