High Availability Mark McIlroy 29.11.2025 Some notes on systems that must have high availability, such as vehicles and data centres. 1. In the year 2017, there were over one million commercial jet aircraft flights, and not a single person died. I believe that this is one of the greatest achievements of the human race. However there have been fatal crashes since that time so the system is not perfect. 2. Redundancy. A primary method of maintaining high availability is redundancy. In this system design, there are two or more components that can perform the same function, with one performing the full system functions if the other unit fails. This is a very effective method however it has some limitations. Statistically, there is a non-zero chance that both units will fail at the same time. Also this adds cost, and in the case of vehicles, weight. There are two types of redundancy. A. Backup units. These units are not generally used, and are standing by to be started if needed. In these situations the backup should be tested and put into use at least once per month. B. Continuous use systems. This is a much better arrangement where both units are continuously in use. For example the engines on a two-engine aircraft. This is a much better arrangement than type A, in which case backup units frequently fail just when they are needed. 3.Maintenance. There should be a regular maintenance schedule of the equipment that is properly implemented. 4. End-of-life One method used in extreme high-availability systems such as aircraft is to throw out and replace parts before they come to the end of their useful life. 5. Simplicity All else being equal, a simple system will have greater reliability than a more complex system that has more parts. The more complex system has a higher number of parts that can fail, and also there is a greater chance of an error in the design of the system. 6. Complacency The most dangerous issue of all in system availability is complacency. If things occur as wished, a long time will go by without failure. As time goes by people will become complacent. Maintenance schedules may not be implemented fully, and warning signs may be ignored. It is important to guard against this attitude developing. Some options to guard against complacency - regular practice drills, periodic training refreshers. Points for discussion - staff turnover (not having one person in the same safety position for an extremely long period of time). 7. Warning signs. Sometimes there are warning signs that something is wrong. These situations should be investigated fully, don't wait for an actual incident to occur. 8. Near misses In other cases an incident occurs that could have been serious, or fatal, but nothing bad actually happened. These are very serious incidents. The reason is that if a situation occurred once, then it can occur again, and the next time might be fatal. There can be a tendency for near-misses to be given a low priority if nothing bad actually happened. In reality a near miss should be investigated fully and given the same priority as if a serious accident had occurred. 9. Contained failures. Systems need to be designed so that if one part of a system fails, it does not damage other parts of the system and the rest of the system can continue to operate normally. 10. Testing. Testing cannot correct a system that has been poorly designed. However, assuming the design is good, then the more testing that can be conducted on a system the more reliable its final form will be. 11. Quality components and materials. Use quality components and materials, whether it be physical items, software components, or entire subsystems, otherwise your system will never work.