Applying Manufacturing Testing Techniques to Software Development (FMEA and HALT)

August 2020

SE is a relatively young form of engineering(when compared to manufacturing or construction) and still develops processes and methods to improve and validate its quality.

I guess you could argue that software engineering IS a form of manufacturing, except the product is digital.

Have you ever wondered how new physical products are tested, especially complicated ones? Designing a physical product is hard - you have many moving parts that need to perform per spec, and the fallout from poor reliability is significant.

And once engineers address the defects and lock down the design, production run issues can cause unexpected failures.

The manufacturing process uses a variety of testing methods and techniques to assist with this complexity.

HALT (Highly Accelerated Life Testing)

HALT improves product design by locating the breaking point.

A product (or electronic component) is placed in the test chamber that applies increasing temperature ranges, vibrations, humidity, etc. - until the failure occurs. The cause of failure is then analyzed, and design changes are suggested.

If failure had not occurred, then HALT had failed, because the whole point of this discovery technique is to create conditions causing complete loss of function.

HALT is very effective in reducing time to market and increasing profit potential. By discovering weak links early, it allows sufficient time to address design problems. And not waiting until users report early failures, triggering returns, and warranty service improves the customer experience while keeping expenses manageable.

HALT analogy in Software Engineering

Your architecture will fail, given the right conditions. Do you know the parameters of this failure?

Would throwing more resources into your solution solve the problem - or would you have to redesign and rewrite everything?

Software alternative to HALT is Stress Testing and Load Testing.

Stress testing validates the system recoverability - for example, would the application reconnect if the database restart is triggered or network connection dropped?

Load testing checks how the system performs under unusually high loads - flooding the API with hundreds of requests per second, trying to process a huge feed file, keeping the queue running for days.

Both of these non-functional testing methods are excellent at discovering defects in already produced software - but at the very least, you need a functional prototype. We know that the most significant cost savings in the software project budget come from making the majority of changes in the design phase. Testing early and not waiting until the release week are excellent practices, but is it possible to find faults in a product during the design stage?

FMEA

HALT is a cost-saving method - but it is expensive by itself. HALT requires access to test chambers and results in destroyed components and products. A specialized skill set is necessary to develop, run, and interpret tests. More importantly, you need to have something already build to test it.

In contrast, FMEA (failure mode and effects analysis) takes a few days of your cross-functional team's time and can be scoped down to design only.

FMEA is a thought exercise and typically performed before the product is created in the physical form. If weaknesses are discovered, it's not too late to go back and adjust.

FMEA is a guided process, so the team moderator is required. In the beginning, all possible product functions are identified, along with their inputs and outputs. Then the team begins to brainstorm, writing down reasons each of these functions can fail. The findings (which are called failure modes) recorded and rated.

Not all faults can or should be fixed. Product development is an investment, and some failures may not be worth addressing. There is a simple formula that can be used for prioritization: calculate the following rankings using the scale from 1 to 10:

  • The severity of the failure, with ten meaning catastrophic events may occur, resulting in loss of life or significant property damage.
  • Occurrence probability, the score of ten indicating that the failure will most definitely occur.
  • Detection ranking, where the score of one shows that failure mode can be detected or prevented by existing controls. For example, the new IoT sensor on Overlook's hotel boiler will alert of increasing pressure before the major blow-up.

Multiplying these values produces a risk priority number, used to rank the failure modes.

The additional ranking value may include the importance of function itself - if certain features may cause catastrophic failure and be hard to detect but is not of high importance, it may make more sense to remove them.

The result of FMEA is a list of actions addressing design or processes that can do one or more of the following:

  • Reduce the probability of failure
  • Lower severity
  • Improve detection and prevention

Using FMEA in Software Development

There are two separate views - FMEA is terrible for Software Development and simply does not work, or FMEA can significantly improve Software Quality.

I am on the "FMEA is great and should be adopted more" team.

  • It helps to walk through your software design and validate non-functional requirements from the "how can this break?" perspective, as opposed to the "how should this work?". Looking back at the most dramatic production software failures that I've witnessed, it's clear that the code always worked as designed and was tested well. Until one day, some events from the list of "this will never happen" happened, causing significant fallout.
  • It improves communication between the teams and clears up any incorrect assumptions regarding functionality that is being developed.
  • It makes it painfully obvious when the proposed solution is terrible because of the overabundance of high-risk priority number failure modes.
  • It produces better results than design reviews, which are concentrating on specific modules. FMEA forces to take a high-level view of the design and architecture.
  • The robustness of the design can be significantly improved even if only a few top failure modes are addressed.

Example of FMEA in Software Engineering

How would this process look when applied to software engineering?

Let's imagine you have a logging library. Its functions are:

  1. Write log message at the specified level to disk
  2. Roll over when logs are larger than 2MB
  3. Format messages in easily parseable format

The first function can fail if (brevity is for simplicity reasons only, more reasons can be listed here):

  • Disk is full
  • Inadequate permissions
  • Given logging level not recognized

The second function can fail if:

  • Disk is full
  • Revoked permissions to create a new file
  • The current logging file is in use by a different thread
  • Desired log file name already exists

The third function can fail if:

  • The message passed to the logger contains unexpected characters

Notice that different functions can share the same failure modes.

While ranking can be subjective, let's assign "Disk is full" failure mode the severity ranking of 9, occurrence ranking of 7, and detection ranking of 1 because we've already thought of this condition and added logic to delete old log files while showing a warning to the user.

On the other hand, "Current logging file is in use" would have a detection ranking of 8 because none of the existing controls can catch this condition. Therefore this failure mode ranks higher and needs to be addressed first.

The result of FMEA is the list of actions along with responsible parties and the date by when the action should be taken.

But wait, there is more!

FMEA is not just for Software Design.

By changing the scope this early analysis can be applied to all phases of SDLC. The best software products start at requirements - try applying FMEA principles to requirements analysis and see if hidden defects can be located. Perform security FMEA to catch the most serious defects. Analyze your implementation plan and deployment architecture.

The process can be as compact or complex as you want to obtain measurable quality improvement benefits. Give it a try for your next software manufacturing project.