Design Guidelines for Safety Critical Systems

Birket Engineering Standards, 1993

This document establishes a company standard for the design of systems that have a bearing on human safety. The material in this document is compiled from over twelve years of our design notes and discussions with our associates in other companies who also design the same type of systems. We have compiled this information due to the absence of any other written control system design standard that addresses the unique safety requirements of the theme park industry. This is not a comprehensive document; it is a growing document that collects generally accepted safety design philosophy for the theme park industry. It is intended to be used with our other standards for design, the design process, and assembly.

This document addresses control system safety issues only. Clearly most systems have significant mechanical and structural safety issues that are often more critical than the control system safety issues. We are not qualified to address mechanical and structural safety issues. We must avoid the appearance of responsibility in this area while not ignoring the probability of these failures and their effect on the design of the control system. In fact, it is typically the control system that provides a second line of defense against harm from mechanical or structural failure of the controlled system. This document is not an exhaustive treatment of this subject by any means. From time to time we address safety issues that are not covered by this document. We do so by applying the design process and fundamentals that are encouraged by this document.

The design of all safety systems must contain the following three elements:

  1. involvement of the owner in the operational and implementation details of the design,
  2. peer review of the design within our company, and
  3. a safety acceptance test procedure performed on the completed system with the owner’s participation.

This document begins by establishing a broad goal. A safety system design tool is discussed. A definition of a safe system is presented. Finally, some details of implementation are offered.

Safety Design Goal

As a broad goal, we believe that a person that rides an attraction vehicle, sits in the theater, or participates in an effect on stage, should experience no more risk than when that same person rides a mass transit vehicle. Trained Client personnel charged with maintaining elements of the attraction should experience no more risk than when trained personnel maintain a mass transit vehicle. Client personnel charged with operating elements of the attraction or equipment should experience no more risk from the attraction or equipment than operators of a mass transit vehicle experience from the equipment that they operate. We make the comparison to mass transit systems because this is an area with a long established design history that is well documented. There are design process standards that have been created for the mass transit industry that we draw from when designing for the theme park entertainment industry. We believe that this is desirable because there are no similar widely accepted design process guidelines in the theme park or entertainment industry.

Similarly, an actor in a live show such as a stunt show should not experience any more risk as a result of his/her interaction with the equipment than he/she would experience when properly trained to operate equipment typical of an industrial manufacturing workplace designed and operated in accordance with OSHA requirements.

This goal seems reasonable, but how can it be determined that a design meets the goal? Design tools and statistical methods exist that allow the probability of harm causing events to be accurately determined. These methods include hazard analysis, risk analysis, failure mode and effects analysis, fault tree analysis and sneak circuit analysis. Each of these tools have their own goals and field of application. For example, some are oriented toward determining the probability of a failure of an existing design so that insurance rates may be established. Others are oriented toward predicting maintenance costs and schedules. One tool is particularly appropriate for discovering and avoiding harm causing events of systems and subsystems during the design phase. This tool is called Fault Tree Analysis.

Fault Tree Analysis

Fault Tree Analysis is a design tool that structures relationships between the events in a system into a Boolean logic model that leads to accident causation. These events are structured so that they lead to a specified outcome. This approach to analysis is called deductive. A deductive approach assumes the failures and examines lower order events to determine all of the combinations that could cause the specified harm-causing event. In the design phase, this approach to determining the causation of a harm causing event is superior to other approaches that use inductive logic. Failure mode and effect analysis, and hazard analysis use inductive logic.

Deductive reasoning is a logical process in which a conclusion is drawn from a set of premises containing no more information than the premises taken collectively. For example: a relay can only fail open or closed; if the relay fails open the system is safe because…; if the relay fails closed the system is safe because…; therefore, the relay’s failure can only result in a safe situation.

Inductive reasoning by contrast is a logical process in which a conclusion is proposed that contains more information than the observation or experience on which it is based. This type of relay has never been seen to fail; this type of relay will never fail. Note that the truth of the conclusion is verifiable only in terms of future experiences. Certainty is attainable only if all possible instances have been examined. There is no certainty that this relay will not fail tomorrow, though it would seem very unlikely.

Inductive logic must not be used in an analysis of a safety control system. Deductive logic shall be used in the safety analysis of the system. Fault tree analysis encourages a deductive approach.

Fault tree analysis is explained in several texts including System Safety Engineering and Management, 2nd ed., Harold E. Roland and Bryan Moriarty, Chapter 29. It is not a requirement that a complete quantitative fault tree analysis be performed for a system. To do so would require that the probability of failure of each single point in the system be known.

Fault tree analysis, however, should be understood and applied as a qualitative tool. This qualitative analysis is more commonly used because it does not require quantitative knowledge of the probability of failure for the system components. An understanding of qualitative fault tree analysis enhances the mental process when designing system logic to avoid harm causing events. By understanding fault tree analysis and following the mental process that it encourages, the analyst will be forced to understand the system beyond the level of the normal system designer. Accordingly, the probability of harm causing events will be reduced to an acceptable level. The definitions and design specifics presented in this paper were derived within the design process encouraged by Fault Tree Analysis.

Safety Requirements and Definitions

This section presents our definition of a safe design and then explains each element of the definition in detail.

The systems that we design shall be failsafe. We define failsafe to mean that every single point failure and critical multiple point failure that may occur in a system results in a safe state.

single point failure is simply any single thing that can fail in the system. It must be assumed that any single point that might fail will fail. Only points that have a probability of failure that is extremely low can be considered infallible. Examples of single point failures include everything that can go wrong. Examples are: broken switches and wires, fatigued metal, software defects, processor errors, stuck output drivers, and non-deliberate operator error.

Extremely low probability of failure is a subjective characteristic. It means that it reasonably appears that the equipment would have to be operated for over one hundred times the design life of the equipment before a single failure would be expected. For example, an apparent mean time between failures of over one hundred million operations, where there are expected to be less than one million operations over the life of the equipment, would be the worst acceptable failure rate. Accordingly, if an installation has a ten year design life, the equipment would have to be operated for one thousand years for a single failure to be likely. This guideline, coupled with the fact that such failures are more probable at the end of a design life, insures that the design will consider all failures that might occur during the life of the installation. Note that this discussion does not rely on a periodic inspection program. If a periodic inspection program can be trusted, allowances may be made for it that allow less reliable equipment to be considered infallible.

Critical multiple point failures are combinations of more than one single point failure where the probability of concurrent failure of the combination of single point failures is equal to or greater than the probability of a normal single point failure. It is not sufficient to design a system to be failsafe for single point failures alone. It must also be designed to be failsafe for the occurrence of double or multiple point failures if the probability of the multiple failure is not extremely low.

The meaning of a safe state is usually obvious. Systems with moving parts are generally in a safe state when all moving parts have become unpowered, detached from any sources of stored energy, and motionless. Power buses or other items with exposed electrical parts are safe when they are unpowered. Sometimes, stopping a device once it has begun to move may cause greater harm than to allow the motion to proceed to completion. In such a case, the selection of the safe state must be carefully considered. Unusual situations shall be identified and presented to the Client for consideration prior to electing an non-obvious definition of a safe state.

We shall attempt to prevent harm from operation of the system with malicious intent, and from operation of the system after alterations to the delivered design, but the safety of the system under these circumstances cannot be assured. Where possible, we shall design the system to be self checking for alterations such as jumpers installed to override safety circuits.

Validation Requirements

In the analysis of a system to determine if it is failsafe, any point which may fail undetected must be assumed to have already failed. Therefore, any point in the system that may have a bearing on safety and which may fail undetected, shall be validated every cycle as not failed if it is to be relied upon. This requires that systems that achieve safety through redundancy incorporate periodic validation of the redundant components. Failure of the validation of a redundant element as operational in one cycle shall prevent operation of the equipment in the next cycle.

Points in the system that may have a bearing on safety and which may fail undetected, but do not cycle periodically under normal conditions must be forced to cycle periodically to insure that they remain capable of producing the desired result. Examples are emergency stop buttons and circuits, over-pressure switches and over-travel limit switches. This is best enforced by system design, but in some cases the only practical way to enforce such cycling will be by written operational requirements.

Consider an over-limit switch. The system may operate for years before the primary limit fails allowing the over-limit to come into play. Since the over limit switch may have been disconnected, bypassed, or otherwise failed in the days or months prior to the failure of the primary limit, the over limit switch cannot be relied upon to provide any additional system safety. Another means of achieving the backup shall be used if human safety is involved, and is recommended for equipment safety.

Another example of a device requiring validation because it may fail undetected is an enable switch. In this example, a potentially hazardous event will be permitted only if the momentary action enable switch is depressed. How can it be known that the enable switch has not failed in a way that makes it appear to always be depressed? What if the contacts have been shorted? To address this and many similar situations, the system must be designed so that the enable status is granted only when the transition of the switch from “not-enabled” to “enabled” is seen. Stated another way, system permissives are required to be “edge sensitive” rather than “state sensitive”.

Unless special circumstances dictate otherwise, a design should be based upon the assumption that there is a higher probability of failure due to open signals than to shorted signals. Stated another way, the design shall be failsafe for wires that become disconnected rather than wires that are shorted together unless this is inappropriate to the safety of a particular situation. If human safety is an issue, the system shall be failsafe in both cases.

Emergency Stop System Architecture

An emergency stop system shall respond to appropriate events (identified during the design process) by causing the system to enter a safe state as defined previously. The design of the emergency stop system shall be failsafe as defined previously. See the sections on Emergency Stop Circuit Design for additional details on the design of the emergency stop system.

Each emergency stop button shall incorporate a red light. This red light shall be illuminated when an emergency stop condition exists, and not illuminated under normal operating conditions. To accomplish this, these lamps are powered from an “emergency stop not” bus. Where possible, a means of recording the time and location of an emergency stop event shall be incorporated into the design.

All emergency stop systems shall be periodically validated to insure that they remain capable of performing as designed. This periodic verification must be enforced, (preferably by an automated process, but possibly by written operational policy) such that each button, contact, etc. that is able (by design) to drop the emergency stop bus is demonstrated to be presently capable of dropping the emergency stop bus. In a processor based architecture, this process shall be automated so that at each start-up of the system the processor(s) demonstrate the capability to drop the emergency stop bus.

This discussion assumes the more complex situation where there is a Master controller which is responsible for the control of one or more (typically many) Subsystem controllers. The Master manages the enable, trigger, monitor and emergency stop signals of all of the subsystems. In a system with out a Master-Subsystem architecture, a subset of these notes will apply.

Any Subsystem with human safety issues shall include a failsafe emergency stop circuit. This emergency stop circuit shall control an emergency stop bus within the Subsystem controller. The emergency stop bus shall power all devices that may cause harm. The emergency stop circuit shall be designed to power the bus whenever the 24VDC emergency stop signal from the Master is powered and other Subsystem conditions are met.

The Subsystem emergency stop bus shall become unpowered any time that the emergency stop signal from the Master becomes unpowered. The Subsystem emergency stop bus must also become unpowered as a result of pressing the emergency stop button on the face of the Subsystem control enclosure, or at other locations remote from the Subsystem control enclosure. Other fault conditions within the Subsystem may also cause the Subsystem emergency stop bus to become unpowered.

The local error conditions that cause the Subsystem emergency stop bus to become unpowered shall not cause the Master emergency stop bus to become unpowered unless this is appropriate to the situation. However, two normally closed contacts of the emergency stop button on the face of the Subsystem shall be returned to the Master so that pressing this button does cause the Master emergency stop bus to become unpowered. This addresses the requirement that pressing “any red mushroom emergency stop button” associated with the attraction will cause the Master and all attraction Subsystem’s to become unpowered.

A possible exception to the global nature of the emergency stop system described above exists. This exception shall be considered on a case by case basis and used only after the Client’s approval. The exception addresses the desire to operate a Subsystem in the manual mode while the system wide emergency stop bus from the Master is down, thus preventing all Subsystems from operating. This exception will be granted only in the circumstances where the control panel of the Subsystem is within easy line of sight of all of the equipment controlled by the Subsystem. Further, a red warning placard will be required on the face of the Subsystem adjacent to the Subsystem’s red emergency stop button explaining the exception. The red warning shall read: “when this Subsystem is in local (manual) mode it will not respond to the attraction emergency stop bus”. Under this exception, the Subsystem’s emergency stop bus becomes powered when the Subsystem is in the local or manual mode and the emergency stop button on the face of the Subsystem is not depressed, without regard for the status of the emergency stop signal from the Master. This allows local maintenance to be performed on the equipment controlled by the Subsystem without regard for the status of the Master emergency stop bus.

The emergency stop button on the face of the Subsystem also contains a red light. This red light will be powered only by the Master. It will be on when the emergency stop bus is not powered.

If the button on the face of the Subsystem is not pressed, and there are no other error conditions within the Subsystem which are causing an emergency stop condition, the Subsystem’s emergency stop bus should become powered as soon as power is returned to the emergency stop signal from the Master. A determination must be made for each Subsystem as to the result of powering the emergency stop bus. In some systems, powering the emergency stop bus may immediately result in motion of equipment. In other systems it would be more safe to design a system such that equipment will not move when the emergency stop bus is powered. In these systems a separate reset or start signal coupled with an enable signal from the Master will be required to initiate motion after recovery from an emergency stop.

The following presents the previous discussion in a tabular form:

 

 

 

Signal

 

 

 

 

Normal

 

 

Master
Internal
EStop

 

 

Master
EStop
Button

 

 

OCC
EStop
Button

 

Subsystem
Internal
EStop
(Type A)

 

Subsystem
Internal
EStop
(Type B)

 

 

Subsystem
EStop
Button

 

Master’s EStop Bus

 

24vdc

 

0

 

0

 

0

 

24vdc

 

0

 

0

 

Subsystem’s EStop Bus,

Subsystem in Auto

 

24vdc

 

0

 

0

 

0

 

0

 

0

 

0

 

Subsystem’s EStop Bus,

Subsystem in Local or Manual

 

24vdc

 

0

 

0

 

0

 

0

 

0

 

0

 

Subsystem’s EStop Bus,

Subsystem in Local or Manual

(Exception granted.)*

 

 

24vdc

 

 

24vdc

 

 

24vdc

 

 

24vdc

 

 

0

 

 

0

 

 

0

 

 

Indicator

 

Lamp on all EStop buttons

that did not initiate the EStop.

 

Off

 

n/a

 

On

 

On

 

n/a

 

n/a

 

On

 

Lamp on the EStop button

that did initiate the EStop.

 

Off

 

n/a

 

+Flash

 

+Flash

 

n/a

 

n/a

 

+Flash

 

 

EStop is an abbreviation for emergency stop. A “Type A” Subsystem Internal emergency stop is one that causes the Subsystem emergency stop bus to drop, but does not cause the Master emergency stop to drop. A “Type B” Subsystem Internal emergency stop is one that causes both the Subsystem and the Master emergency stop to drop. See the earlier text.

*The “Exception granted” note implies that it has been determined by the Client that it is acceptable for this Subsystem to bring its emergency stop bus up even when the global emergency stop bus is not up. See the earlier text describing this situation and associated requirements.

+Special cases where each emergency stop indicator is controlled independently by a PLC digital output.

Emergency Stop Circuits in Master Systems

The Master shall receive two normally closed contacts from each emergency stop button, regardless of the location of the button. One contact shall be wired in series with all the other buttons, creating a chain that powers the Master emergency stop relay. The other contact shall be monitored by a discrete input of the PLC in the Master. PLC logic shall both drop the emergency stop bus and cause the same result by logically preventing further operation, when any button is pressed.

In special cases where a Subsystem has different operating voltages from the voltage used for the Emergency stop chain (normally 24VDC) a relay may be used whose coil is controlled directly by the remote Emergency stop switch. A contact of this relay shall be connected in series with the Emergency stop chain and a separate contact shall be directly monitored by a discrete input of the processor. As with normal Emergency stop switches, this logic must be tested daily to insure the normal functionality of both the remote switch and its interposing relay. The test must be monitored by the processor to insure that the combination of hardware is capable of causing an Emergency stop.

The emergency stop signal derived by the Master shall control the emergency stop relay in the Master. The 24VDC status bus power or a separate 24VDC power supply shall pass through normally-open contacts of this relay to create an emergency stop bus within the Master. The corresponding normally closed contacts of the form-C contact set of this emergency stop relay shall be used to short across the Master emergency stop bus when the emergency stop relay is not energized.

All Master system outputs that cause motion or are otherwise safety related shall be powered by this emergency stop bus. Systems controlled by the Master shall be designed and built failsafe such that loss of power on the emergency stop bus makes them safe.

The Master shall make its internal emergency stop bus available to each Subsystem so that the Subsystem can create a local Subsystem emergency stop bus that tracks the bus in the Master. The Subsystem emergency stop bus is specified in the Subsystem Controller General Specification.

The Master shall also make an “emergency stop not” signal available to each Subsystems that is used by the Subsystem to illuminate the emergency stop button when the emergency stop bus is not powered. In special cases a unique signal shall be generated by a PLC digital output for each and every destination button. The signal sent to the button that caused the emergency stop shall be a square wave with a one second period so the that offending button flashes. This facilitates finding which emergency stop button needs to be pulled out after an emergency stop has been manually initiated.

Emergency Stop Circuits in Subsystems

The emergency stop signal from the Master shall be terminated at the coil of a 24VDC emergency stop relay in the Subsystem. The 24VDC status bus power or a separate 24VDC power supply shall pass through normally-open contacts of this relay to create an emergency stop bus within the Subsystem. The corresponding normally closed contacts of the form-C contact set of this emergency stop relay shall be used to short across the Subsystem emergency stop bus when the emergency stop relay is not energized.

All Subsystem outputs that cause motion or are otherwise safety related shall be powered by this emergency stop bus. Their safety related systems shall be designed and built failsafe such that loss of power on the emergency stop bus makes them safe.

Figures 1 and 2 below present an implementation of a power bus structure and emergency stop system in a subsystem controller that follows the design guidelines presented in this document.

Bus Control Logic - Figure 1

Figure 1: Sample Bus Control Logic. (Part 1)

Bus Control Logic - Figure 2

Figure 2: Sample Bus Control Logic. (Part 2)

Watchdog Timer

The purpose of a watchdog timer signal is to insure that a processor is still processing and that it is processing the code that it is intended to be processing. Barring a deliberate attempt to circumvent this safety feature, only the intended code will create a watchdog timer signal of the appropriate frequency on the appropriate output.

In a one processor system the watch dog signal shall be monitored by a watchdog timer relay configured to maintain a contact closure output if the input signal is within a reasonable tolerance of the intended frequency. There shall be a means of periodically validating the proper function of the watchdog timer relay.

If the Master and the Subsystem both contain a processor, they shall exchange watchdog timer (WDT) signals. This is accomplished by generating the signal at the Master and echoing it back from the Subsystem after an inversion by the Subsystem. To accomplish this, the Subsystem shall receive the incoming WDT signal on an input point of its intelligent controller. The intelligent controller software shall invert this signal, and apply it to an output of the intelligent controller for return to the Master for verification. This test may also be implemented as a part of the information conveyed via a serial link, if such may be agreed upon with the designer of the Master. The ability to receive, invert and return this signal will serve as a test to the intelligent controller’s valid operation. The inversion prevents hard wiring the signal back to the Master.

In a redundant controller architecture, watchdog signals shall be exchanged by the processors.

Placards

Procedures involving safety shall be summarized on black placards placed on the face of the Subsystem. System conditions and erroneous procedures which pose a threat to safety shall also be described on red warning placards placed on the face of the Subsystem. The text used for each of the placards shall also be contained in a special “placard” section of the system documentation. Such warning placards shall be in a language or languages suitable and appropriate to the installation site.

Power Cycling

The system shall be designed such that power may be applied or removed at anytime without causing damage to the controlled equipment. No motion shall result in any of the controlled equipment, when power to the system is applied or removed.