Sunday, April 15, 2007

failure probability and clusters

When running a high-availability cluster of two nodes it will generally be configured such that if one node fails then the other runs. Some common operation (such as accessing a shared storage device or pinging a router) will be used by the surviving node to determine that the other node is dead and that it's not merely a networking problem. Therefore if you lose one node then the system keeps operating until you lose another.

When you run a three-node cluster the general configuration is that a majority of nodes is required. So if the cluster is partitioned then one node on it's own will shut down all services while two nodes that can talk to each other will continue operating as normal. This means that to lose the cluster you need to lose all inter-node communication or have two nodes fail.

If the probability of a node surviving for the time interval required to repair a node that's already died is N (where N is a number between 0 and 1 - 1 means 100% chance of success and 0 means it is certain to fail) then for a two node cluster the probability of the second node surviving long enough for a dead node to be fixed is N. For a three node cluster the probability that both the surviving two nodes will survive is N^2. This is significantly less, therefore a three node cluster is more likely to experience a critical second failure than a two node cluster.

For a four node cluster you need three active nodes to have quorum. Therefore the probability that a second node won't fail is N^3 - even worse again!

For a five node cluster you can lose two nodes without losing the cluster. If you have already lost a node the probability that you won't lose another two is N^4+(1-N)*N^3*4. As long as N is greater than 0.8 the probability of keeping three nodes out of four is greater than the probability of a single node not failing.

To see the probabilities of four and five node clusters experiencing a catastrophic failure after one node has died run the following shell script for different values of N (0.9 and 0.99 are reasonable values to try). You might hope that the probability of a second node remaining online while the first node is being repaired is significantly higher than 0.9, however when you consider that the first node's failure might have been partially caused by the ambient temperature, power supply problems, vibration, or other factors that affect multiple nodes I don't think it's impossible for the probability to be as low as 0.9.

echo $N^4+\(1-$N\)*$N^3*4|bc -l ; echo $N^3 | bc -l
So it seems that if reliability is your aim in having a cluster then your options are two nodes (if you can be certain of avoiding split-brain) or five nodes. Six nodes is not a good option as the probability of losing three nodes out of six is greater than the probability of losing three nodes out of five. Seven and nine node clusters would also be reasonable options.

But it's not surprising that a google search for "five node" cluster high-availability gives about 1/10 the number of results as a search for "four node" cluster high-availability. Most people in the computer industry like powers of two more than they like maths.

3 comments:

janfrode said...

I think there is some flaw in your logic comparing 2 nodes to 3 nodes, as they don't really have the same set of failure causes.

A 3-node cluster is a "real" cluster, and can avoid split brain simply by having 2 nodes agreeing on who's up or not.

A 2-node cluster must avoid split brain, and to do this you'll need a third "node" of some kind. This third "node" will be less part of the cluster than in the real 3-node cluster, but I don't think you can rule it out of your calculation completely. It will be:

-- pinging gateway (no guarantee against split brain ??)

-- accessing shared storage device?? Hope you meant telling shared storage device to fence out the other node?

-- power the other node off trough remote management adapter or power switch

So a 2-node cluster must really be a sort of 3-node cluster to guarantee no split brain.

Another point might be that shouldn't a 6 node cluster turn into a 5 node cluster once 1 node is dead?

i.e. in the 6 node cluster one node dies, and the 5 other agree that now they're a 5 node cluster. Then another node dies, and the 4 nodes agree that they're a 4 node cluster. Then another node dies, and the remaining nodes agree that they're a 3 node cluster.

And when the 3 dead nodes are all coming back at the same time, without seeing the other nodes, they should refuse to start as they know they used to be part of a 6,5,4 node cluster..

It would be interesting to know if any clustering suites works this way. Or maybe I'm missing something?

Simon Waters said...

I like Jan's comment, but I think one of the things Russell is hinting at is that failures of nodes are not random, uncorrelated events, like a lot of simple failure models assume.

In reality computers fail because - the power went - the power came back - someone split something - something rocked the computer/rack/room - someone fiddled with the hardware - someone fiddled with the software - someone fiddled with the network/cable/....

My suspicion is that clustering for reliability is a lot less successful than people like to imagine. Sure I've worked with redundant hardware that did the job, but a lot of times the extra complexity can be almost more trouble than it is worth.

The most successful I ever so was early Novell stuff, and the people concerned found a lot of issues in testing before they deployed the kit.

Alan Robertson said...

One of my favorite phrases is "complexity is the enemy of reliability" . This is absolutely true, but not a complete picture, because you don't actually care much about reliability, you care about availability.

Complexity (which reduces MTBF) is only worth it if you can use it to drastically cut MTTR - which in turn raises availability significantly. If your MTTR was 0, then you wouldn't care if you ever had a failure. Of course, it's never zero

But, with normal clustering software, you can significantly improve your availability, AND you rmaintainability.

Your post makes some assumptions which are more than a little simplistic. To be fair, the real mathematics of this are pretty darn complicated.

First I agree that there are FAR more 2-node clusters than larger clusters. But, I think for a different reason. People understand 2-node clusters. I'm not saying this isn't important, it is important. But, it's not related to reliability.

Second, you assume a particular model of quorum, and there are many. It is true that your model is the most common, but it's hardly the only one - not even for heartbeat (and there are others we want to implement).

Third, if you have redundant networking, and multiple power sources, as it should, then system failures become much less correlated. The normal model which is used is completely uncorrelated failures.

This is obviously an oversimplification as well, but if you have redundant power supplies supplied from redundant power feeds, and redundant networking etc. it's not a bad approximation.

So, if you have an MTTR of 4 hours to repair broken hardware, what you care about is the probability of having additional failures during those four hours.

If your HA software can recover from an error in 60 seconds, then that's your effective MTTR as seen by (a subset) of users. Some won't see it at all. And, of course, that should also go into your computation. This depends on knowing a lot about what kind of protocol is involved, and what the probability of various lengths of failures is to be visible to various kinds of users. And, of course, no one really knows that either in practice.

If you have a hardware failure every 5 years approximately, and a hardware repair MTTR of 4 hours, then the probability of a second failure during that time is about .009%. The probability of two failures occuring during that time is about 8^10-7% - which is a pretty small number.

Probabilities for higher order failures are proportionately smaller.

But, of course, like any calculation, the probabilities of this are calculated using a number of simplifying assumptions.

It assumes, for example, that the probabilities of correlated failures are small. For example, the probability of a flood taking out all the servers, or some other disaster is ignored.

You can add complexity to solve those problems too ;-), but at some point the managerial difficulties (complexity) overwhelms you and you say (regardless of the numbers) that you don't want to go there.

Mangerial complexity is minimized by uniformity in the configuration. So, if all your nodes can run any service, that's good. If they're asymmetric, and very wildly so, that's bad.

I have to go now, I had a family emergency come up while I was writing this. Later...