Holistic Methodologies: Odd bed partners (Six Sigma and PMLC/SDLC), but Harmonious Relatives
Failure Modes and Effect Analysis (FMEA)
When I first started using Six Sigma tools and was introduced to FMEA, I thought it would be as complex as the name implies. I was wrong. The concepts behind FMEA are really quite simple and logical. Looking at risk from a hazard perspective just takes a slightly different mindset. Using the FMEA mindset can benefit all phases of the risk management lifecycle.
Fundamentally, FMEA and Project Risk Management are one and the same, but FMEA uses a more systematic approach to failure modes analysis. While traditional risk management does cover failure modes, typically it's concerned with the inherent risk of project execution on a more holistic level. In contrast, FMEA focuses on the impact of product feature failure modes. This results in a different approach to the risk analysis; some questions are very similar, and some may feel very different from the traditional process:
- What are the failure points? (What can go wrong; the hazard)
- What is the mode of the failure? (How it can go wrong)
- What is the hazard's effect on the product or users?
- What is the impact on the stability of the product or process?
- What existing controls are in place to measure compliance with the existing controls within the process, to ensure that the hazard cannot occur?
- What is the hazard's response impact on the product or process? (What does success look like once the risk has occurred and the response action has been implemented?)
- How is success measured, and what does the risk profile look like after the initial risk has occurred and the team has responded? (How does the team validate that the response did fix the problem?)
- What other failure modes can this failure create?
Think of FMEA, often referred to as a Hazards Analysis, as an analysis tool that looks at the likelihood of failure. To use this approach, you need to add a few more columns to your project risk log, and ask a few more questions when you put it together.
This is the description of the hazard. Make this description as SMART as possible to drive specificity, which will help with deriving the severity, likelihood of occurrence, and detectability. Vague failure mode descriptions can lead to ineffectual responses and weak success criteria. (Question to ask: What is the hazard?)
What is the impact of the risk or hazard on the customer, the process, the efficacy of the product, or the credibility of the producer? Naturally, in areas where there are safety concerns the failure effect is of paramount importance as it could mean a life or death defect. Detailing these effects can lead to detecting additional failure modes as you brainstorm with the team. (What if it happens?)
Depending on the overall priority of the risk it may be worth asking up front about the potential causes of this risk or hazard. Quite often, the risk that you are discussing may only be a symptom of another risk or hazard that you have not even identified yet. An example is when the Voice of the Customer is pointing out a risk in a subjective manner and the Voice of the Process is pointing out metrics regarding how the process actually performs. (An example of this would be response time within an application. If the customer says it "takes too long" to load a web page, and the tools say the response time is within but close to the upper limit of the Service Level Agreement, exploring possible causes may raise previously unidentified failures.) Asking why a risk might occur is a very powerful question. (Why/how could it happen?)
Ask the team what current controls are in place to prevent the risk or hazard from occurring. Are those controls sufficient? Does your project or product have enough controls around it to eliminate the risk, or are the controls sufficient to detect the trigger when the hazard has occurred? This is called "Proactive prevention." (What could be done or is being done now to stop it from happening?)
Now we add Detectability to the Risk Prioritization Number formula. In typical risk quantification methods, risk priority is calculated using this formula:
Likelihood of Occurrence * Severity = Risk Priority
In FMEA, we add another variable to represent how hard it is to see that the hazard has occurred:
Likelihood of Occurrence *Severity * Difficulty of Detectability = Risk Priority
Difficulty of detectability gets a low score when the hazard is easy to detect once it has been triggered. When the trigger is more subtle and the hazard is harder to spot, difficulty of detectability gets a high score. Here are some examples to illustrate the difference:
A defect in the code that shows up after the product is shipped but was not detected prior to shipping would have had a higher score for Difficulty of Detectability (if it was detectable at all).
If the defect was easily seen but was not considered to be severe enough to stop the product from shipping, it probably had a low score for Difficulty of Detectability, a low Severity score, and most likely a low Likelihood of Occurrence.
A defect occurring after shipment that could cause additional indirect consequences might have a higher Severity Score along with the highest possible Difficulty of Detectability score (for the associated failure modes created by the defect).
Another example of lack of detectability would be a defect that shows up after excessive use or wear (something that relates to mean time between failures and acceptable tolerances).
The failure modes that show up after a product has been shipped should become direct input into the next release cycle, since they were unknown unknowns in this release cycle. The quality assurance test harness should then incorporate appropriate controls to catch those failure modes going forward.
In my experience using FMEA, I have found that weighting this score ensures that the Difficulty of Detectability does not have equal ranking with Likelihood of Occurrence and Severity. Instead of using a score of 1-10 on Detectability, I have used 1-5 in 0.5 increments. That way, while Detectability plays an important role in risk prioritization calculation, it isn't given equal weight to severity or likelihood of occurrence (more important values, to my mind).
Below are some example operational definitions of Severity, Likelihood of Occurrence, and Ability to Detect. Naturally, you can create your own definitions and scoring model. Even a Fibonacci series could work, as long as it is used consistently. (How will I know if and when the risk occurs?)
|RATING||DEGREE OF SEVERITY||LIKELIHOOD OF OCCURRENCE||DIFFICULTY OF DETECTABILITY|
|1||Adverse effect not noticeable by the Customer or insignificant||Likelihood of occurrence is remote||Certain that the potential failure will be found or prevented before reaching the next customer|
|2||Customer will probably experience slight irritation or effect will be cosmetic in nature||Low failure rate with supporting documentation||Almost certain that the potential failure will be found or prevented before reaching the next customer|
|3||Customer will experience irritation due to the slight degradation of performance||Low failure rate without supporting documentation||Low likelihood that the potential failure will reach the next customer undetected|
|4||Customer dissatisfaction due to reduced performance||Occasional failures||Controls may detect or prevent the potential failure from reaching the next customer|
|5||Customer is made uncomfortable or their productivity is reduced by the continued degradation caused by the effect||Relatively moderate failure rate with supporting documentation||Moderate likelihood that the potential failure will reach the next customer|
|6||Warranty repair or significant manufacturing or assembly complaint||Moderate failure rate without supporting documentation||Controls are unlikely to detect or prevent the potential failure from reaching the next customer|
|7||High degree of customer dissatisfaction due to component failure without complete loss of function. Productivity impacted by high scrap or rework levels||Relatively high failure rate with supporting documentation||Poor likelihood that the potential failure will be detected or prevented before reaching the next customer|
|8||Very high degree of dissatisfaction due to the loss of function without a negative impact on safety or governmental regulations||High failure rate without supporting documentation||Very poor likelihood that the potential failure will be detected or prevented before reaching the next customer|
|9||Customer endangered due to the adverse effect on safe system performance with warning before failure or violation of governmental regulations||Failure is almost certain based on warranty data or significant testing||Current controls probably will not even detect the potential failure|
|10||Customer endangered due to the adverse effect on safe system performance without warning before failure or violation of governmental regulations||Assured of failure based on warranty data or significant testing||Absolute certainty that the current controls will not detect the potential failure|
When the risk response actions are taken, what impact will they have on the product or process? Do the actions taken create or pose any additional risks/hazards? Are you fixing something that might cause something else to break? This is why it is important to measure the environment again after a hazard response has been enacted to verify the process is back in control -- or in the case of a project, to verify that the issue been fully addressed or replaced with something else. (What should I do?)
Measure Twice Cut Once
Thinking back to Ross Perot's memorable mantra in relationship to true Six Sigma ideology, when a change has occurred in the process you should always validate that the response has fixed the problem. Typically the team will re-score the reduction of the impact, likelihood of occurrence, severity, or detectability once the change is made. (Is it still broken?)
Measure of Success
When the response is enacted do the Voice of the Customer and the Voice of the Process approve of the outcome of the response? What does success look like in the eyes of the customer or process, and does your plan of action take that into consideration? Are there any tradeoffs that might be unacceptable? Is the treatment worse than the cure? (Is the customer satisfied?)
Usage of FMEA across the SDLC
Risk is a constant companion across all phases of the SDLC lifecycle. The comments below are thoughts and ideas for you to consider in addition to what you are already doing with project risk management during the project lifecycle.
There is a philosophy that projects fail at the beginning and not at the end. Have you considered doing a pre-mortem to try to identify what hazards or failure modes you can run into before starting the project? Taking the time to do this analysis before starting the project will significantly improve your planning and estimation. Identifying and quantifying risks/hazards prior to or at initiation will help drive the scope selection. From there, you can deliver that scope with better insight into the "known unknowns," and validate the efficacy of the proposed delivery approach.
If your project is a transition from a Six Sigma project to a SDLC project, there may be many technology related risks/hazards missing from the Six Sigma project FMEA. Don't overlook these by assuming they are covered in the original FMEA when building your project risk log.
Risks that will be accepted (included in the estimates for Schedule and Cost) and contingencies for mitigated risks will both be driven by the project risk management analysis. Focus on detectability and trigger points so you have a clear indication when a risk/hazard has actually occurred during planning or execution. The onset of a risk occurrence can be gradual or subtle. Keep in mind that the team may be in denial that trigger characteristics have been met and the risk/hazard has become an issue, especially when this realization has a significant impact on cost or schedule.
When developing requirements, look at the inherent risks are associated with delivering features such as a new technology or any feature that can have a profound impact on the product or business process. When used in conjunction with a Quality Function Deployment (QFD) for requirements analysis, the additional FMEA columns in your risk matrix can be a very powerful tool to enhance your project risk management planning.
Monitor which risks have become issues, re-evaluate the prioritization of the already identified risks, and retire risks that are no longer relevant. Look at risks that are focused around transitioning to an operational state once the product has been delivered. With Agile projects, where the scope can be somewhat fluid, look at each new feature or delivery approach as a potential hazard to the project and consider it from a FMEA perspective before you accept it as part of the project scope.
Focus on which risks occurred during the earlier phases. Could the team have predicted any risks that weren't identified until they occurred? Look at why the current controls did not detect the risk/hazard. What risks were predicted but did not occur, and were either accepted or had contingency set aside for mitigation? Think of the reason why they were selected and did not occur. Look at the original planning assumptions that may have led to the risk being selected.
All of these thoughts are input into the project retrospective. Also, look at new controls that need to be put in place given the new capability of the product produced by the project. Looking at those new controls will help you determine how to measure the product's success and ensure that the process remains in control after implementation.
Finally, if you document all the risks that became issues and add them to a risk knowledge base, that database can be used as a starting point for selecting candidate risks/hazards for future analogous projects.