Silent Death – Protecting Against Undetected Failures

BY Daniel Birket
Birket Engineering, Inc., December 2001

Synopsis

Undetected failures permit us to blindly trust safety systems that no longer work and can lead to situations that we thought couldn’t happen. This document explores strategies to identify and prevent Silent Death through automatic and manual validation.

What is Silent Death?

Silent Death is the undetected failure of a device that we rely upon to insure safety. The device may be anything: sensor, brake, motor, mechanical interlock, etc. that protects us against some risk. If a device fails without anyone noticing it, it has an undetected failure.

Why is Undetected Failure Important?

If the thing stops working and nobody can tell the difference – why do we need it?

Remember that the device was put there to guard against some significant risk. A significant risk will eventually become very significant reality. Ask Murphy. But now the gadget that should have handled the problem didn’t work when we needed it… and we didn’t have a clue. We didn’t even think the resulting mess was possible. Unsuspected Failure might be a better description.

A Silent Death Scenario

Imagine a simple ride system that starts using an electric motor and stops using a pneumatic brake. Bad things could happen if we can’t stop the ride promptly, so we want to be sure that we have air pressure at the brake before we run the motor. Simple: put a pressure switch on the air supply line and interlock it with the motor. Now we can’t start the ride unless there is enough air pressure to stop it again. We can stop worrying about air pressure.

Not really. Now we have to worry about the pressure switch – and we still have to worry about air pressure.

Over a season or two, the pressure switch quietly rusts in place, freezing the switch contacts where they are – ON. Nothing visibly changes. Nobody notices. The ride works fine and the switch even agrees with the actual pressure in the air line. Everything is fine until one day when a backhoe hits the air line, or maybe the compressor’s circuit breaker trips.

Now the air pressure for our brakes is gone, but we don’t know it because the pressure switch rusted shut last winter. Even worse, we’re sure that we’re protected from that problem – so go ahead and press the [Start] button. We can always trust the system to keep us out of trouble… Hey! Why won’t this thing stop?

Expect the Unexpected

WThe first step in solving any problem is to realize that we have one. In order to eliminate possible undetected failures we must first identify where they may occur. This can be a big task because just about everything fails eventually. We can trim the job a bit if we only worry about things that might impact safety.

One accepted system for finding out what can go wrong is an F.M.E.A.: Failure Mode & Effects Analysis. (Not F.E.M.A.; they clean up after disasters. An F.M.E.A. prevents them.) Although this subject can get very technical, it is still useful in its basic form. Even a casual safety analysis is better than none – just don’t assume that you’ll find all the problems by yourself.

This kind of safety analysis starts with making a list of the components in our system and sitting down with a pencil to think about each piece.

The questions to ask about each component are:

  1. How might this thing fail? (Failure Mode) Our pressure switch above might fail stuck in the ON or OFF position.
  2. What will happen when it fails in each mode? (Effect) It’s very helpful to categorize these by severity: “possible injury”, “safety loss”, “downtime”, “expensive”, or “complaint”. Effects of “bad appearance” or “no effect” don’t need the same attention. Our stuck ON pressure switch above goes in the “safety loss” category because it no longer protects us against an air pressure failure. If it were stuck OFF, we’d use the “downtime” category because the ride wouldn’t start.
  3. How likely is it to fail? This can be tough to figure out, but categories will help again. We can count on “Human Error”. (Don’t forget the human “component” of the system.) “Factory Defect” and “Fatigue” will nail us regularly. Murphy defined the criteria: “Anything that can go wrong, will.” If we can think of a way it can break without an absurd concatenation of improbable events, we should consider it.
  4. How do we detect the failure? Here is our first chance to make a difference. If we choose a good way to detect the failure, we’ll know when we need to do something. See below.
  5. How do we correct the failure? The simple (and usual) way to keep our system safe is to “correct” the failure by shutting everything down: E-Stop. If we want to enhance reliability too, we must find a way to safely continue without the component until we can fix it.

This component-by-component analysis of our system may turn up a variety of interesting things that we had not realized before. But, since our subject is undetected failure, our interest now is any failure with an effect of “safety loss” and a detection type of “undetected”.

Solve Silent Death

Now that we know where to expect undetected failures, what can we do about them? There are just two ways to eliminate undetected failures: detect the failure or prevent the failure.

Prevent the Failure

There is a hard way and an easy way to prevent the failure of a component:

  • The hard way is to make the component 100.0 % reliable. This is really tough and most designs settle for a one-in-a-billion chance. Even if we use two components, there is still the chance that a shared event will wipe out both components. Lightning. Corrosion. Power Failure. Redundancy doesn’t solve everything and it always introduces new and more complex problems.
  • The easy way is to eliminate the component. Simplify the system. The biggest problem with undetected failure is that we think we’re protected when we’re not. If we can do without the gadget we won’t depend on something that will let us down. The guy in the rental car with the broken gas gauge will run out of gas. The guy on the motorcycle without a gas gauge will never let his tank go dry – he’ll unscrew the cap and look.

Detect the Undetected

There are two major ways to detect a failure: Automatically using a sensor (and maybe a control computer) or manually using our hands and eyes. Automatic detection has the advantage of saving us from trying to watch everything at once, but brings complexity that is itself susceptible to failure. The problem of detecting the failure of the sensor that we installed to detect a failure is validation. Our pressure switch above got us into trouble because we didn’t confirm that it was still valid – that is, still reporting meaningful information.

There are three major ways to automatically validate a sensor (or other component), plus manual validation makes four.

  1. Continuous Automatic Validation: This is the best form of validation and usually involves buying two sensors instead of one. By comparing the two sensors we will know instantly when one has failed. (But like a man with two watches, we generally won’t know which one is right.)
  2. Cyclic Automatic Validation: The next best way to check our sensor is every time (or cycle) that we use it. If we have a sensor to detect when the load gates are ajar, we should check that the sensor notices each time we open and close the gates. (Note: it can be important to detect a failure prior to each cycle of operation instead of during; otherwise we may be too late to correct it.)
  3. Periodic Automatic Validation: The remaining automatic means of validating a sensor is simply to check it on some regular schedule: daily, weekly, etc. We could have automatically checked our pressure switch aboveby cutting off the air each morning.
  4. Periodic Manual Validation (Inspection): Here’s the method so familiar to ride maintenance people – open it up and look – every day or month or season. This is as good a method as the people, time, and money we devote to it.

If we have to resort to manual inspection to validate the pressure switch we installed to detect a failure of air pressure, we might be better off without the sensor. It may be no more difficult to manually check the air pressure directly than it is to check that the pressure switch is still working. At least we won’t have a false sense of safety.

Conclusion

Adding sensors and interlocks to a system may not make it any safer and will always add complexity and new problems. The solution is to design the system so that it naturally validates its own operation. A poorly designed system will at best require a lot of time-consuming inspection and at worst may give a false appearance of safety. A well-designed system will use the intrinsic characteristics of the equipment, crosschecks between components, and feedback to insure that nothing fails undetected.

Notice

This document is presented as a service to the entertainment community for informational and promotional purposes only. It is not intended as engineering advice or opinion and is not guaranteed to be current, correct, or complete. Links to other web sites are not endorsements of those sites.

You are welcome to forward the link to this document to others interested in the safety of rides and shows or control systems in general. You’ll find similar safety articles in the Reading Room.

Discussion

Amusement-safety and Show-control are professional, spam-free discussion groups. They are hosted at http://groups.yahoo.com by and for industry professionals. Birket Engineering, Inc. does not operate either group.