20

I have heard the following terms related to safe system design but I cannot really see a difference between fail-safe and fail-soft (graceful degradation).

To get a common understanding I will just write out the terms that I've heard. Please feel free to correct me, if I don't explain something properly:

  • Safe life: A system does not show any fault during its lifecycle
  • Fail-Safe: A system goes into a safe operating mode with reduced functionality after a failure
  • Fail-Soft: A system goes into a degraded mode after a failure
  • Fail-Operative: A system still has the full functionality after a failure

From those explanations, it seems to me that Fail-Safe and Fail-Soft are actually the same concepts. I would be glad if someone could explain to me the difference between the concepts by using some examples from aviation.

Pondlife
  • 71,714
  • 21
  • 214
  • 410
MrYouMath
  • 649
  • 1
  • 6
  • 17
  • 8
    Fail safe does not require system to operate at all. – user3528438 Sep 18 '17 at 14:16
  • 6
    Good example of fail-safe system outside aviation context are railway brakes. In passenger car force is applied to engage brakes, so when hydraulic system fails, you cannot use brakes to stop your car. In trains, however, brakes in neutral state are engaged and force is applied to hold them off. So any failure in power delivery engages them and halts the train pushing it into safe state - that's what makes them fail-safe, despite they are not "operable" nor have any functionality in failed state. – el.pescado - нет войне Sep 19 '17 at 06:02
  • And safe-to-fail? – neverMind9 Jul 11 '18 at 10:22

1 Answers1

31

Fail-safe does not necessarily imply that the system will continue operating after a fail. If the system stops operating but does not create a dangerous situation, it is still fail-safe. A non-essential service on board an aircraft such as the entertainment system can be fail-safe if it just stops operating because a fuse blows. If upon a failure the fuse does not blow and as a result the system catches fire after a short-circuit, it is not fail-safe.

Fail-soft does indeed mean that after a failure, essential services are still functional, although in aviation context this is mostly referred to as graceful degradation. A fly-by-wire system such as on board the A320 is fail soft:

  • If all flight computers are functioning, the system flies with normal law active, providing protections against getting into unsafe regions of the flight envelope.
  • Upon certain detected failures, the system switches to alternate law, still with some protections in place.
  • Upon further failures, the system switches to direct law: no more flight envelope protections, just surface deflection proportional to stick deflection.

So this flight computer system is fail-soft and fail safe. The electronic flight control system as a whole is fail safe: if all flight computers are lost, the system has hydro-mechanical connections from pedals to the rudder and from the trim wheel to the stabiliser.

Hydraulic systems require special attention regarding fail-safety, since a stuck servo valve can command a constant velocity: the actuator runs into one of its stops and has a hard-over failure. An elevator that is stuck in full up deflection is not safe, so the pitch control system needs to make provisions for fail-safety if this fault occurs. For instance by allowing left and right elevators to normally work in unison, but uncoupling them after a detection of a hard-over failure. The working elevator can then be commanded to full opposite position, making the elevator system fail-safe but not fail-soft: pitch command must now be done with the stabiliser trim. One elevator stuck in the mid position would be fail-safe, and control with the other elevator provides graceful degradation.

The flight control system on board an F-16 is fail-operational: upon a system failure, full system function continues, no degradation. It is triple redundant, meaning there are four systems on board that carry out the same function, while only one is required. The four systems function independently, and a monitoring system determines if all four outputs are within a pre-determined range. If three are and one is not, this system is switched off and the other three continue. This can happen once more, but if there are only two systems operational the voting system would not know which of the two has failed.

Koyovis
  • 61,680
  • 11
  • 169
  • 289
  • 3
    Good description and examples. In my discussions with the FAA, they've very seldom used the term fail-soft. They typically refer to the key behavior of a fail-soft system which they term "graceful degradation" to the fail-safe state. – Gerry Sep 18 '17 at 16:23
  • +1 Thank you for your clear and extensive answer. Finally, I can make sense out of these terminologies. – MrYouMath Sep 18 '17 at 16:25
  • @Gerry indeed, fail-soft does not sound good in a military or aviation environment. Have included in the answer. – Koyovis Sep 19 '17 at 02:37
  • 3
    A loss of all hydraulics is not a common failure mode. It also is definitely not fail-safe. Because there is no cable connection on A320. The “mechanical” connection means a hydraulic one. If you lose all hydraulics, you are in serious trouble. – Jan Hudec Sep 19 '17 at 08:57
  • @JanHudec have amended. – Koyovis Sep 19 '17 at 10:26
  • 1
    I strongly suspect that the behavior of a F-16 after failure of one or more flight control system(s) depends very strongly on the cause of the failure. Some of those failure causes may not apply to typical airliner operation. – user Sep 19 '17 at 11:40
  • 2
    Slightly OT I find it odd that the F-16 has 4 independent systems. What if 2 are in agreement on one value and 2 are in agreement on another value? Granted, that's unlikely but not impossible - if weather is coming in from the right side, I could see the possibility of pitot tubes on that side freezing over while those on the left side remain clear. /OT thoughts – FreeMan Sep 19 '17 at 11:48
  • 1
    @FreeMan The system is likely a sophisticated hybrid running two overlapping TMRs or a single TMR with spare and or various split breaking systems either at the circuit elector level or within a controller.

    I'm not familiar with the system but I do understand quad N redundant systems, and they are usually anything as simple as an R(4,0),

    Examples of deadlock breaking can be rejecting the longest service hour or elector rejection counts to remove a circuit, basically looking at its historical performance when no other measure is available. There are many options.

    – jCisco Sep 19 '17 at 15:23
  • 1
    @Koyovis, is the F-16 really 4 similar systems with voting? That seems like a rather poor choice, because it does not guard against software bug (systems running the same code and seeing the same input would likely all make the same error). The Airbus system actually never votes on anything—it has dissimilar (both in hardware and software) primary and check board and if they disagree, they report fault and the next pair takes over. – Jan Hudec Sep 19 '17 at 17:55
  • 1
    @JanHudec The original analog FBW was. – Koyovis Sep 19 '17 at 18:26
  • 3
    @FreeMan: The F-16 has four so it can still hold a majority vote if one fails. This is actually triple-redundant (one system, three more for redundancy) but colloquially (and wrongly) called quadruple-redundant. – Peter Kämpf Sep 20 '17 at 16:15
  • 1
    In my understanding, graceful degradation means that you are still able to limp home. No more capability to fulfil the mission, but enough to not crash immediately. For example, a single loss of a control surface would degrade handling from level A to level B (in the old MIL 8587 terminology) with more control forces and less authority. Or take the F8F, where the wingtips would snap off at 7.5 g instead of the full wing. – Peter Kämpf Sep 20 '17 at 16:19
  • @JanHudec ('s first comment): A better example would be the 737, which does have manual reversion capability for the ailerons and elevators via ye olde mechanickale cables. – Vikki Dec 13 '19 at 22:18
  • @Sean, I am not sure of what it is supposed to be a better example—the point was that backup for complete hydraulic failure is not required. – Jan Hudec Dec 13 '19 at 23:38