First, before to talk about the definition of High Availability, I’d like to introduce the notion of availability. A system is considered unavailable when it becomes unusable or even usable but with too limited functions.
There are many reasons for this and some of these are commonly called Single Point of Failure (SPOF), others are most of the time due to bad operations, or Murphy’s Law
High Availability is a global concept used to make the system globally available as much as possible, i.e., with the fewest downtimes possible. Prior to think your system architecture, you must define your Service Level Agreement (SLA). The SLA is the document where you define the availability of your system, may it be planned or unplanned, but also response times, and monitoring to be performed so that preventive actions can be taken.
Availability is often expressed in percentage of the time the system must work. This percentage is calculated on a year average then re-processed in a monthly rate giving the amount of time the system can be down each month.
For example, if you want your system to be up five nine (99.999%) this means it will be up 525594.744 minutes a year. According to this calculation, you are allowed to have 5.25 minutes of service disruption each year, i.e. 44 seconds each month.
Once you decided your availability rate, you can design, or improve, your architecture. To do so, you should identify any SPOF and make it a non-SPOF component.
They can be, but not limited to, Database servers, Network components (such as load-balancer, switch, router, firewall, etc…), Application servers, Filers, Disks, Authentication servers.
Having identified those SPOF will allow you to decide the gain vs. cost of having them redundant.
Here is an example of what could be an Highly Available architecture
Management
High Availability is not only a concept for the architecture of your system, but also includes best-practices on management. This management part is as important as the architecture, because you may have the best architecture design ever, if you don’t make continuous improvement and monitoring on each element, you will never be able to do preventive actions.
As I mentioned before your SLA is here to help you define what has to be monitored, triggers on critical/non-critical events and various Key Performance Indicator (KPI) to ensure your system is going well.
Of course, the SLA is not enough to ensure availability, you have to make sure your Operation Manuals are well documented and always up-to-date, your Disaster Recovery Plan is existing, tested at least each year and also well documented, and finally, backups stored in a safe place, duplicated and always correct. Finally, have your operational teams, very well trained and available 24×7.
Remember that having your systems redundant is not sufficient; make also sure to have a great data-replication policy, even more when using RDBS to avoid data corruption on all your Database nodes.
Tools and Techniques
Some of the commonly used Tools include AIX HA-CMP, Sun Cluster, Heartbeat (Open Source), Cisco Load Balancers, EMC2 filers, Network Appliance, and so on…
Some of the techniques used to make a system Highly Available are:
- Load Balancing
- Reverse Proxy
- Clustering
- Virtualization
- Failover
I will not enter into the details of each techniques and tools in this post, but stay tuned as I’ll write on these in a near future.








