An empirical study of introducing the Failure Mode and Effect Analysis technique to Norwegian business critical software developers
Department of computer and information Science,
Norwegian University of Science and Technology
Torgrim Lauritsen, Tor Stålhane
Abstract
This article describes an experiment with three Norwegian IT companies, who develop business critical software. The goal of the experiment was to evaluate if it is beneficial to use safety analysis techniques when developing business critical software. The participants in the experiment tried to identify possible failure modes from a class diagram. Half of the participants used the Failure Mode and Effect Analysis (FMEA) method that is widely used in the development of safety critical systems, while the other participants used ad hoc brainstorming. The number of failure modes is used as an indicator for the effectiveness of each technique. Our experiment showed that the participants that used ad hoc brainstorming wanted a method that could help them to reveal more problems. The participants who used the FMEA method found the method useful because it was easy to understand and helped them to identify failure modes in a structured way.
In the current business climate, companies of every industry, large or small, must have some kind of data protection as part of their business continuity plan [1]. It is therefore important that software developers consider how they can reduce product risk in the software, so that their customers can avoid loss of assets, such as vital information, reputation and money.
The extensive use of computers and software has drastically improved the functionality and efficiency of many companies, but has also made software systems a significant risk factor for those companies [2]. Risk is defined as the product of an event’s consequence and its probability of occurrence or as its hazard level (severity and likelihood of an occurrence) combined with 1) the likelihood of the hazard leading to an accident and 2) hazard exposure or duration [3].
Our starting point is to look at safety analysis techniques that are used to assess the risk associated with using the system, and to prevent accidents from happening in the system. The techniques analyse why accidents occur; that is, the mechanisms that drive the processes leading to unacceptable losses, and they determine the approaches we can take to prevent such accidents [4].
Just as for general safety, business-safety is not a characteristic of the system alone – it is a characteristic of the system’s interactions with its environment. Safety is freedom from unacceptable risk of physical injury or damage to the health of people, damage to property or to the environment [5]. Business critical software is safe when it does not fail in such a way that it causes a mishap [6], which results in loss of financial assets, such as reputation and business interruption. The second biggest threat to business is reputational risk, while the biggest threat is business interruption [7].
We designed an experiment where we wanted to compare the Failure Mode and Effect Analysis (FMEA) technique to ad hoc brainstorming. Our goal was to study which effect the FMEA have on the process of developing business critical software. In the experiment we asked the participants to identify possible failure modes in a system based on a class diagram. The identified failure modes can be used further in the development phases as additional safety requirements and as a basis for testing, and to mitigate or eliminate the failures by building in compensating efforts like redundancy, alarms or barriers that helps to avoid the failures to arise.
We have to be aware of the fact that software may be highly reliable and correct [4] but still be unsafe if the software:
Unfortunately, meeting safety requirements is not a simple matter such as meeting a set of written specifications [8]. The design effort needed to make a system safe is one of a series of coordinated activities needed to assure that the final product will be safe. We believe that developers who develop business critical software must, in addition to satisfying the functional requirements, also add safety requirements to their solution, [9, 10], or else, the software will undermine the prospects for creating value and delivering profits to businesses [7].
The rest of this paper is organized as follows: First we give a short description of the FMEA technique. Thereafter we describe the experiment and the results from the experiment. Finally we conclude the paper and discuss some further work.
The Failure Mode and Effect Analysis (FMEA) is a method that is widely used for reliability analysis of systems, subsystems, and individual system components [11]. FMEA was introduced in 1954, and formalized in 1968. FMEA has been used with success for many years in safety-critical systems like avionics, trains, and nuclear plants and for the process industry. FMEA allows a systematic analysis of possible hazards and failures, and also allows us to assess the effects of these hazards and failures on the components of a system.
In object oriented software development this can, for instance, be classes and their methods [12]. A method is formally a part of the object structure and as long as all methods of an object are executing in accordance with their specification, the object has not failed. Conversely, when a method does not execute in accordance with its specification, the object has failed. The failure effect will depend on the conditions under which the method failed. For example, look at the class diagram shown in figure 1, where objects are uniquely characterized by their methods. Analysing and searching for failure modes in a class diagram using FMEA is done by filling out the FMEA table shown in table 1.
Class / Method |
Failure mode |
Effects of failure |
Action or barriers |
Severity |
Customer. |
creditRating is too high |
Customer places orders for more than he can pay for |
1) Manual check when setting or changing credit rating |
High |
creditRating is too low |
Customer is not allowed to buy as much as he wants and can pay for |
Medium |
||
No |
The company can lose a lot of money selling goods to customers who will not be able to pay for them |
High |
Table 1. A FMEA table for creditRating()
In the FMEA table we start with identifying what class and which method we are going to analyse. Thereafter we try to identify the possible failure modes. In this example, for the creditRating() method, we found three failure modes: the credit rating is too high, the credit rating is too low and no credit rating is performed. In the next column we try to see what effects these failure modes can have. In the next column we try to identify possible actions and barriers (countermeasures) to avoid that these failure modes can arise. Last, but not least, we need to prioritize the identified failure modes in such a way that we know which one is the most critical, so we know where to start.
The FMEA method is easy to understand and easy to use. The developers will be able to identify and document possible failure modes, and will be able to implement failure mitigation solutions based on the action and barriers in the FMEA table, which will help to avoid asset losses and thus lead to more business safe software. In design, FMEA serves two roles. Firstly, it helps us to identify possible hazards and failure modes associated with the system. Secondly, it helps to verify that all failure modes leading to hazardouse events or mishap are mitigated by the design modifications made to the system [7].
The most important part of the FMEA process is a systematic walk-through of components to identify possible failure modes such as; “fails to operate on demand”, “calculates a wrong result”, etc. Since each failure can produce a different effect, depending on the level at which it is detected, it is important to do an analysis of each method in a class. Using FMEA will not make it cheaper to develop software, at least not in a short term perspective. Applying FMEA to increase the products’ business-safety must be viewed as an investment. The return of investment will be software products with higher quality, which again will lead to more business from existing customers and new business from new customers. In addition, we will have less need for fire-fighting. The workload will be larger in the beginning of the project. This bigger workload will reduce the rework needed in the project, since latent hazards are identified and the developers can use their new knowledge to limit, reduce or eliminate them.
We wanted to evaluate the effect FMEA could have in a business critical software development environment. Our experiment was designed as an exploratory and qualitative study. The goal of the experiment was to see if the participants would
The experiment was executed during June 2005. We executed the experiment in three Norwegian IT companies. Two of the companies are IT consultancies, and the third is a privately held company that has its own software development department. In each company we used four software developers that have worked in the IT industry for two to thirty years.
All of the participants are familiar with the Rational Unified Process (RUP), and most of them work in accordance with that methodology in their daily work. Only one of them was familiar with agile methods, and uses test driven development in his daily work.
We have the following research questions:
RQ1: Did the FMEA help the developers to find more failure modes?
RQ2: Did the developers find FMEA useful?
RQ3: Did the developers believe that they would profit from using FMEA?
RQ4: Did the developers want to involve the customers in the FMEA work?
RQ1 can give us an indication of how useful FMEA is when we want to identify possible hazards and problems compared to the techniques the developers use today. RQ2 gives us a subjective answer of how effective the participants felt the FMEA was, and we will compare these answers to possible issues the ad hoc brainstorming group missed in the experiment. We know that introducing a new technique like FMEA in the software development will lead to an extra learning effort and more work. In RQ3 we want to see if the FMEA participants would use FMEA despite the fact of the increased work effort. Based on the success in XP where they want to have the customer on-site the whole time, we wonder if the customer could help by participating in the failure mode analysis since they have the domain knowledge – RQ4.
We offered a short introduction into safety analysis, and a copy of this article as compensation to the companies involved. We emphasized that all answers and other information would be treated as strictly confidential.
In each company we started the experiment by dividing the four software developers into two groups - later called A and B - with two persons in each group. We gave group A an introduction to safety analysis of design diagrams, while group B – the FMEA group – filled in a background questionnaire. When group A was finished with the introduction and group B had filled in their questionnaire with background information, the groups switched tasks. Group B got an introduction of the FMEA technique in addition to the importance of considering safety issues during the software development, while group A filled in the background questionnaire.
In both cases we showed the participants the class diagram in figure 1 and guided them through an example of the analysis based on the creditRating() method in the Customer class.
Group A received a list of possible failure modes and consequences for the creditRating() method:
In addition, we mentioned possible countermeasures such as manual checks, obtain credit information from external sources, etc. Group B got the FMEA table shown in table 1, together with a detailed walkthrough of the table.
After the introduction and the completion of the background questionnaire, we asked both groups to identify possible failure modes and consequences when customers purchase goods from a company based on the class diagram in figure 1.
The experiment lasted for approximately 1.5 hours at each company. Filling in the questionnaires took approximately 30 minutes, while the experiment itself took 30 minutes. The remaining 30 minutes were spent on introducing the method and discussing the participants’ experiences from the experiment.
We coded the results from the experiment, and the failure mode categories found by each group are shown in table 2. The failure modes can, if used as a basis for additional safety requirements, help to reduce the hazard of using the software.
Category |
FMEA – |
Ad hoc – |
Wrong order and customer information |
6 |
8 |
The article does not exist or is temporarily sold out |
4 |
3 |
Incorrect execution of an order |
1 |
2 |
False order |
2 |
1 |
Network connection does not operate normally |
1 |
2 |
Generate an order only once |
|
1 |
Information on the wrong path |
1 |
|
No credit validation |
|
1 |
Price changed |
|
1 |
Table 2. Failure modes
We use a paired t-test because we wanted to determine whether the two techniques are likely to have the same mean of the two samples. The paired t-test (p-value one tail = 0.27) shows no evidence that the FMEA group identified more failure modes than the ad hoc brainstorming group. The conclusion from our first research question (RQ1) is therefore that there is no difference between the FMEA technique and the ad hoc method when it comes to the number of failure modes identified.
Eleven of the twelve participants considered their work to be business critical. In table 3 we see their answers to the question “How do you analyze and reduce failures when you develop software systems?”
Firm 1 |
Firm 2 |
Firm 3 |
Unit tests |
Unit tests (NUnit / JUnit) (3) |
Unit test (2) |
Hire testers |
Integration tests (2) |
Integration test |
System tests |
System tests (2) |
Acceptance test |
Use cases - test cases |
Regression tests |
Communication with the customer |
Describes tests |
Design work with UML (2) |
Create and update a test plan |
Testing vha. Mercury |
Testing |
Testing, static and dynamic |
A running system |
Test planning – Test Director from Mercury |
Write code that works immediately |
Follow development techniques |
Factory Acceptance Testing (FAT) |
Continuous consideration of failure situations |
Divide into packages |
Site acceptance test (SAT) |
Map all problems in the specification phase |
Verification |
Flow chart |
Code reading |
|
Use cases - deviation |
Software Metrics |
|
|
Follow development techniques |
|
|
Testing to reveal failures (2) |
|
|
Pilot period |
|
|
Analysis of failure situations |
Table 3. How do you analyze and reduce failures in your software systems?
We see that the participants do some preventive work, e.g. analysis and continuous considerations of failure situations, but their main focus is on testing. 19 of 36 answers consider testing when they want to analyze and reduce failures in their software systems. It is, in our opinion too late to consider safety when the implementation is finished. We asked the participants if they were familiar with any safety analysis techniques such as Hazard and Operability analysis (HazOp) or FMEA, but got only two positive answers, and only one of them said he uses safety analysis techniques actively in his work but without specifying which safety analysis techniques he uses.
After the experiment, we gave the two groups (A and B) tailored questionnaires. The answers from the ad hoc group are presented in section 4.3, while the results from the FMEA group are presented in section 4.4. In section 4.5 we discuss possible threats to the validity of our conclusions.
Five out of six participants wanted a deeper and broader description of the system they should analyze in the experiment (use cases, sequence diagrams and more background information of what technology is going to be used). One person said that it could have helped if he had been given a checklist related to the addressed issues. In their daily work, the participants use use cases instead of UML diagrams to identify possible problems.
Two of the groups were unsure if they had found all possible problems, while the last group were sure that they had found all problems within the problem addressed in the experiment. The rest of the results from the ad hoc group are summarized in table 4.
|
Firm 1 |
Firm 2 |
Firm 3 |
How did you execute the experiment? |
Created a use case. Connected failures to use case. |
Used “list method” to identify failure situations and consequences. |
Divided problem into phases. Then found failure sources and countermeasures. |
How could you have done a better safety analysis? |
Missed system and technology description. Missed sequence diagram, shows consequences of failures. |
Missed use case with background information. |
Failure situations must be identified before the class diagram is made. Class diagram too fuzzy |
Do you use UML diagrams to identify possible problems? |
Sometimes. Use cases are our basis, usually. |
UML diagrams not our starting point maybe sequence diagrams. |
Textual use cases. UML diagrams as a supplement. |
Do you think you covered all possible problems? |
Unsure, but maybe on this superior level. |
Unsure, too little time for the analysis. |
Yes, within the limited problem addressed. |
Table 4. What is your opinion of your approach to this problem?
As we see from the first question, the three groups did the analysis in three different ways. None of the groups was satisfied with doing the analysis from just a simple class diagram. They seldom use UML diagrams to analyse the problem, they use use cases instead. Because of this they were not sure if they had covered all possible problems in this experiment.
We asked the participants if they thought a systematical method would be useful when analyzing UML diagrams and all participants answered yes. Their reasons are summed up in table 5:
|
Firm 1 |
Firm 2 |
Firm 3 |
How can a systematic method be helpful and result in a better software system? |
All or at least more problems should be revealed. |
Numerous failure modes, countermeasures must be identified as early as possible. |
Correct text and graphical models as basis. Ensure all problems are analyzed. |
How can you use the results from a systematic method later in the development? |
Be instructive. Should ensure a better software solution. |
Input to design and test basis. |
The form of results simple to use in further development. |
How would you earn back the extra cost for using a systematical method? |
Early exposing of problems. Reduced development time and management. Analysis continuous during development. |
Early revealing of failures lead to cost savings. Cheaper to correct in design than in implementation and test phases. |
After-work will be reduced, cheaper solution. |
Table 5. In what way do you think a systematically method would help you?
All participants felt that they need a systematic method that could help them to cover all possible problems and failures, and also give them an easy way to identify and document countermeasures. The method should be able to handle both textual and graphical descriptions. The participants knew that early exposing of problems is beneficial. It is cheaper to correct failures during analysis and design than in the implementation and test, and this will reduce development time.
When we asked the participants if they wanted to involve the customers in the analysis of the UML diagrams, they were positive, on the condition that the customers have the necessary skills to understand the design diagrams. In the developers’ opinion, the customers have the required know-how of the problem area and do in general have a better understanding of how failures could arise. They might also have ideas on how failures should be dealt with.
Three participants were sure that they had found more problems and failure modes with the FMEA method than without it. The other three participants, however, were not totally sure that they had found more problems and failure modes with the FMEA method. When we asked them if they found the FMEA technique helpful, we got the answers summed up in table 6.
All participants felt that the FMEA method helped them to structure the failure identification process. The FMEA method was easy to understand based on the problem we showed them during the experiment. The answer to our second research question (RQ2) is thus yes, the software developers found the FMEA method useful.
|
Firm 1 |
Firm 2 |
Firm 3 |
What is your opinion of the FMEA method? |
Easy to understand. Helps to identify failures and reduce failure effect. Improve our work process and final software product. |
More failure modes will be handled. Document failure modes. Systematic. Improve design and testing. Easy to understand. |
Early identification of failure modes and problems. Collaborative work during analysis. Simple structure. |
Could FMEA be used in other phases? Do you see other range of use of the FMEA method? |
Used on high level design and on design review of business critical areas like transaction handling. |
In design work and during test and fault correction. On state charts and use cases.
|
In design phase, before the class diagram is made. During the whole development process. Other diagrams, such as database diagrams. |
How would you use the results from the FMEA method further in the development? |
Further used in design and as a basis and support for tests. |
Verify that all failure modes are safeguarded against during code reading and testing. |
Implement in the final software product. Use as a check list during pilot- and acceptance testing. |
Will it be worth using the FMEA method even if it gives you more development effort and extra cost? |
Yes, if the results are used actively in the implementation. |
Earned back by taking out most of the future failures in the beginning of the development process. |
Absolutely worth in bigger projects, especially if the person who does the modeling and the failure assessment is not the same person that does the implementation. |
Table 6. What is your opinion of the FMEA method?
The participants had a few improvement suggestions for the FMEA table. Firm 2 wanted to switch two columns, and put the “Action or barrier” column at the end of the table. Firm 3 thought it was easier to identify the failure effect, rather than describe the failure mode. This could either be because they are relatively new to the concept of failure mode, or that it is easier to identify and describe the effect of the failure rather than the cause.
When we asked the participants of their opinion on whether the FMEA method will help them to be more attentive to possible problems and failure modes, they all answered “Yes, definitely. It will be an important addition to our previous experience within similar areas”. They also believed that the FMEA method might reveal problems and failure modes that they otherwise would have discovered much later.
The participants clearly saw the advantages of using the FMEA technique to improve their work process and reduce project and product risk. In their opinion, FMEA will give them better control of the development process and a better basis for software testing. The FMEA method gave them a possibility to develop “safer software”, because the software will be able to handle more failure modes than before, based on the results from the analysis. They identified the more structured documentation of the failure mode results as the most important contribution.
Thus, the answer to our third research question (RQ3) is yes, the FMEA will help the developers to develop better and safer software systems. The analysis helps the developers to use their previous experience in a systematic way, and to identify problems and failure modes earlier than when using testing alone. It will also be a lot cheaper than discovering the hazard or failure during testing.
The participants want to use the FMEA method during design, implementation and fault correction, and want to update the FMEA tables continuously during the project. The basis for the analysis could be state charts, use cases, other (UML) diagrams and database diagrams. The results from the analysis should be used as additional requirements for further development and as a basis for testing, in addition to a verification tool during code reading, to verify that all failure modes are handled.
Compared to other risk and failure identification methods, the participants did not consider FMEA more labor intensive or expensive. The extra cost will be earned back by removing failures at an early stage, before they become problems. One participant found the FMEA especially efficient when one person does the modeling while another person does the implementation.
The participants’ opinion was that it would be smart to involve the customers, due to their domain knowledge and experience from similar software systems. This makes the customers important contributors to the failure modes and barrier identification process. Problems will be revealed in the requirements phase, and this will lead to better control over the development work. The FMEA analysis can also be a helpful supplement when detecting failure modes during demonstration of the GUI.
A possible problem is that some customers concentrate on irrelevant details. The FMEA method might be too hard for a typical customer to handle, and the class diagram might be too difficult for them to read - use cases may be better. The FMEA analysis might be performed on a level that the customers will not be comfortable with and this might cause a delay in the development process. Thus, the answer to our last research question (RQ4) is a conditional yes.
In this experiment we used twelve IT professionals, and even though their experience varied from two to thirty years, we think that our sample reflects the real IT community quite well. Compared to a similar experiment with students where most of them only have little real time experience, this gives a reliable result.
The participant’s knowledge of the domain varied, but it does not seem to influence the results from the experiment noticeably. It would have been better to let the participants do an analysis on some of their own design documents. The grouping of the participants was done randomly. We tried to design the introduction, the experiment and the questionnaires as neutral as we could, but we probably influenced the participants with our focus of safety.
In the coding of the experiment results and analysing the answers from the questionnaire we might have misinterpreted and misunderstood some of the results and answers the participants have written down, but we have done our best to interpret them.
We know that it is a hard task for the participants to do a brainstorming process on unfamiliar design documents, and also use a technique that they just recently have learned. This should, however, not favour one method over the other.
We think that even though this is a small sample, it is from a realistic environment and will thus give us a good indicator to that it is wise to use safety analysis techniques during the initial phase of a software development project.
Since there was no significant difference between the numbers of failure modes found by those who used the FMEA technique and those who used an ad hoc technique (RQ1), we must look at the other research questions to conclude. In section 4.3 we saw that the ad hoc group wanted a tool that would help them reveal more problems than their current method do. In section 4.4 saw we that the software developers that used the FMEA technique found the technique useful because it is easy to understand and it helps them to identify failure modes in a structured way early in the development process (RQ2).
The resulting software will be able to handle more failure modes than before, based on the results from the FMEA. This leads to software process improvement of the development process in addition to reduced project and product risk. Since the FMEA technique offers better control of the development process, and gives a better testing basis, the software developers will profit using FMEA (RQ3).
All participants agreed that failure identification is a continuous process that should be carried out during the whole development process. The FMEA, together with the customer’s domain knowledge will help the developers to develop better and safer software systems. The customers have the know-how of the problem area and do in general have a better understanding of how failures can arise and how best to deal with these failures. The failure mode analysis table can be updated continuously, and the customer involvement should be close all the way (RQ4). This is closely related to the XP’s customer on-site principle.
FMEA helps software developers to develop software that will be more dependable and have a higher resilience than those developed without this analysis. The customers will then be able to trust that the software they are using as a supportive tool for their business.
In our next experiment we will give the participants more time to do the experiment and supply them with a use case in addition to a sequence diagram, see section 4.3 and 4.4. We will also further explore the observation that the developers considered it easier to identify the failure effect, rather than describe the failure mode.
[1] J. Reuvid, editor, Managing business risk: a practiocal guide to protecting your business, Kogan Page, ISBN:074944228X, 2005.
[2] http://www.dnv.com/consulting/systemsandsoftware/buscriticalss/index.asp
[3] N. Leveson, Safeware: System Safety and Computers, Addison-Wesley, ISBN: 0201119722, 1995.
[4] N. Leveson, System Safety Engineering, http://sunnyday.mit.edu/book2.pdf, 2006.
[5] International Electrotechnical Commission, SFunctional safety of Electrical / Electronic / Programmable Electronic Safety-Related Systems, 1st edition, International Standard IEC 61508, Parts 1-7, 1998-2000.
[6] Department of Defence, Standard practice for system safety, MIL-STD-882D
[7] A. Jolly, consultant editor, Managing business risk: a practical guide to protecting your business, Kogan Page, ISBN: 0749440813, 2003
[8] W. R. Dunn, Practical Design of Safety-Critical Computer Systems, Reliability Press, ISBN: 0971752702, 2002
[9] I. Sommerville, “Extreme Programming for Critical Systems?”, guest lecture at the 6th International Conference on eXtreme Programming and Agile Processes in Software Engineering, http://www.xp2005.org/speakmtrl/XPForCritSys.ppt, 2005
[10] J. A. Børretzen, et. al., “Safety activities during early software project phases”, Norsk Informatikk konferanse, Stavanger Forum, Stavanger, 2004
[11] Y. Y. Haimes, Risk modelling, assessment and management, Wiley Series, ISBN: 0471480487, 2004
[12] M. Hecht and H. Hecht, “FMEA as a Validation Tool for Hardware and Software systems”, Proc. ISA Analysis Div 2002, February 2002
[13] M. Fowler, K. Scott, UML distilled, second edition, Addison-Wesley, ISBN: 020165783X, 2000
Source: http://www.idi.ntnu.no/grupper/su/publ/torgrim/~$uritsen-fmea-npp-nov06.doc
Web site to visit: http://www.idi.ntnu.no
Author of the text: indicated on the source document of the above text
If you are the author of the text above and you not agree to share your knowledge for teaching, research, scholarship (for fair use as indicated in the United States copyrigh low) please send us an e-mail and we will remove your text quickly. Fair use is a limitation and exception to the exclusive right granted by copyright law to the author of a creative work. In United States copyright law, fair use is a doctrine that permits limited use of copyrighted material without acquiring permission from the rights holders. Examples of fair use include commentary, search engines, criticism, news reporting, research, teaching, library archiving and scholarship. It provides for the legal, unlicensed citation or incorporation of copyrighted material in another author's work under a four-factor balancing test. (source: http://en.wikipedia.org/wiki/Fair_use)
The information of medicine and health contained in the site are of a general nature and purpose which is purely informative and for this reason may not replace in any case, the council of a doctor or a qualified entity legally to the profession.
The texts are the property of their respective authors and we thank them for giving us the opportunity to share for free to students, teachers and users of the Web their texts will used only for illustrative educational and scientific purposes only.
All the information in our site are given for nonprofit educational purposes