Thursday, April 19, 2007

a Heartbeat developer comments on my blog post

Alan Robertson (a major contributor to the Heartbeat project) commented on my post failure probability and clusters. His comment deserves wider readership than a comment generally gets so I'm making a post out of it. Here it is:



One of my favorite phrases is "complexity is the enemy of reliability" . This is absolutely true, but not a complete picture, because you don't actually care much about reliability, you care about availability.

Complexity (which reduces MTBF) is only worth it if you can use it to drastically cut MTTR - which in turn raises availability significantly. If your MTTR was 0, then you wouldn't care if you ever had a failure. Of course, it's never zero

But, with normal clustering software, you can significantly improve your availability, AND your maintainability.

Your post makes some assumptions which are more than a little simplistic. To be fair, the real mathematics of this are pretty darn complicated.

First I agree that there are FAR more 2-node clusters than larger clusters. But, I think for a different reason. People understand 2-node clusters. I'm not saying this isn't important, it is important. But, it's not related to reliability.

Second, you assume a particular model of quorum, and there are many. It is true that your model is the most common, but it's hardly the only one - not even for heartbeat (and there are others we want to implement).

Third, if you have redundant networking, and multiple power sources, as it should, then system failures become much less correlated. The normal model which is used is completely uncorrelated failures.

This is obviously an oversimplification as well, but if you have redundant power supplies supplied from redundant power feeds, and redundant networking etc. it's not a bad approximation.

So, if you have an MTTR of 4 hours to repair broken hardware, what you care about is the probability of having additional failures during those four hours.

If your HA software can recover from an error in 60 seconds, then that's your effective MTTR as seen by (a subset) of users. Some won't see it at all. And, of course, that should also go into your computation. This depends on knowing a lot about what kind of protocol is involved, and what the probability of various lengths of failures is to be visible to various kinds of users. And, of course, no one really knows that either in practice.

If you have a hardware failure every 5 years approximately, and a hardware repair MTTR of 4 hours, then the probability of a second failure during that time is about .009%. The probability of two failures occuring during that time is about 8^10-7% - which is a pretty small number.

Probabilities for higher order failures are proportionately smaller.

But, of course, like any calculation, the probabilities of this are calculated using a number of simplifying assumptions.

It assumes, for example, that the probabilities of correlated failures are small. For example, the probability of a flood taking out all the servers, or some other disaster is ignored.

You can add complexity to solve those problems too ;-), but at some point the managerial difficulties (complexity) overwhelms you and you say (regardless of the numbers) that you don't want to go there.

Mangerial complexity is minimized by uniformity in the configuration. So, if all your nodes can run any service, that's good. If they're asymmetric, and very wildly so, that's bad.

I have to go now, I had a family emergency come up while I was writing this. Later...


End quote.

It's interesting to note that there are other models of quorum, I'll have to investigate that. Most places I have worked have had a MTTR that is significantly greater than four hours. But if you have hot-swap hard drives (so drive failure isn't a serious problem) then having machines have an average of one failure per five years should be possible.

6 comments:

Alan Robertson said...

I got in a hurry on my math because of the emergency. So, there are even more assumptions (errors?) than I documented.

In particular, the probability model I gave was for a particular node to fail. So the probability of either of two failing would be double that, and either of three failing would be triple that.

Note that the probability of multiple simultaneous failures goes up as a power, but the probability of either of only goes up linearly.

I really need to sit down and do the math carefully - but the idea of the simultaneous failures going up as a power is true. And the "any of" probability goes up linearly. That's also true. This is why people can actually use larger HA clusters ;-).

The 5 years figure is the industry standard quoted figure for an average Intel-based server to fail.

The four hours to repair is a common high-quality of service response time from a hardware vendor. I admit that's not the same as actual repair time, but if some "repairs" are just reboots, then it's not a horrible number to start with - if your vendor has cached some spares nearby. I suppose I should sit down and do the math right, and make a spreadsheet of it. (I wonder if I remember that much math?)

I assume disk failures are taken care of by hot swap disks, RAID, etc. and so in effect they "never fail" (at least not totally) so that these failures don't have to be accounted for by the overall availability model.

Here's an intuitive way of thinking about it "from your gut"...

If I took your whole data center and made a cluster out of it, what's the chance that at least half of your servers would fail at once?

Pretty darn small, is the short answer ;-). If it's not pretty darn small, you need to buy better servers, and IBM has just the servers for you ;-). Or maybe they need to hire a better SysAdmin ;-)

If you ask yourself "when is the last time at least half my machines in my data center couldn't communicate with the other half", then hopefully that's also a "pretty darn small" chance too. If not, there are well-known methods for making networks highly reliable too.

[I'm still ignoring "catastrophes" that you haven't accounted for in your HA architecture].

I'm not saying this is free, and it can be pricey. One of my other favorite sayings is "Paranoia is an expensive hobby". How much do you want to spend?

You tell me how much you want to spend, and you can figure out how to spend it.

I'll make a separate comment on quorum models later. It's getting late here.

Anonymous said...

Another question related to uptime are the costs of lost business during downtime and the costs of lost data.

I.e. if a stock exchange system crashes and looses the last five minutes of data then this is expensive too and most likely worse than five minutes downtime.

From memory I remember a story of an ATC system where the MS-SQL servers had to be rebooted once a month to prevent an integer counter overflow. This was done for both machines at the same day and by hand. Now as one technician was on holidays and the other sick, nobody rebooted and the second failed subsequently half an hour later.

The problem with HA systems is to know whether the person who set them up is clever in spotting SPOFs or just a clueless lunatic. It takes years to find out...

--

From my experience really bad are memory corruption (on non ECC machines) and irratic harddisk controller behaviour. Both not easy to spot with heartbeats.

Alan Robertson said...

Data corruption, no doubt, is almost always much worse than loss of availability. And some kinds of data corruption are worse than others. For example, mounting a non-clustered shared disk filesystem twice simultaneously is usually much worse, than updating two replicas of the data simultaneously. In the first case, you have to restore to your previous backups and lose all data since then. In the second case, you only lose updates that were made to one of the sides, and you instantly have a working copy of the data which is nearly always much newer than your last backup (with the possibility of recovering them by significant effort). Typically you would only lose a few minutes of updates at worst - and depending on the kind of networking failure, you might not lose anything.

Heartbeats certainly aren't enough. You need to monitor the health of your servers and the health of your applications. Heartbeat monitors applications and can easily be informed of and act on the health of your servers (with release 2 style Linux-HA Heartbeat configurations).

Alan Robertson said...
This comment has been removed by the author.
Alan Robertson said...

Since data corruption is so serious, this is why cluster designers worry so much about split-brain, which is managed using the ideas of quorum and it's sibling fencing.

This is all about keeping bad things from happening.

This post is really about quorum, since Russ had expressed interest in it.

Quorum is the idea that you can uniquely choose a subcluster to represent the whole cluster in those cases where communication failure has caused the cluster to split into separate sub-clusters which cannot properly communicate with each other. In this way, only one of the subclusters continues on, and the others will sit on their hands and do nothing waiting for a person to fix things.

Some of the kinds of quorum mentioned below are better than others. But, most importantly, they can be used in combination as described later.

The most common kind of quorum is that Russ mentioned in his earlier post - the majority quorum. In this method, for a cluster of n nodes, you grant quorum to a sub-cluster which has more than INT(n/2) members. This means that if you have a 3-node cluster, you have to have two nodes to continue. If you have 4 nodes, you have to have 3 nodes to continue. For 5 nodes, you have to have 3 nodes, and so on.

Other basic methods include disk reserve, so that you have reserve a disk to have quorum. In this case, if only one node survives and it can reserve the disk, it continues to run. However, the disk becomes a single point of failure. This may not be a problem if this single disk is required to run any of the cluster services, since they would fail without it anyway. [Heartbeat does not support this method].

An analagous method is to implement a software resource which grants quorum to one subcluster in a fashion analagous to the disk reserve method. This has the advantage of not requiring disk reserves, or a shared disk, but it has the same SPOF disadvantage as the disk reserve method. Heartbeat does support this method using the quorum daemon. It's incredibly useful for those cases (like split-site clusters) where you cannot use fencing.

Another method is to grant quorum to any subcluster which can ping a certain set of nodes, and not grant it to any which can't access those nodes. This isn't a wonderful method, and has obvious disadvantages with respect to uniqueness, and single points of failure. (Heartbeat doesn't yet implement this one).

Another method is to grant quorum to any node which is a member of a 2-node cluster. This is better than losing quorum and stopping when one node stops, but obviously completely ignores the uniqueness requirement of quorum.

Another method is to ask a human being if you have quorum. This is hardly an ideal circumstance, but useful in some contexts as described below. (Heartbeat doesn't yet implement this one).

Perhaps you say, really the only one of these that's really good is the first one - the majority vote method.

And, I would generally agree with you. But, Heartbeat has the ability to use these in combination which makes some of those methods that seem flaky to be much more reasonable.

Heartbeat has the ability to have multiple quorum modules declared, and they're used in this way: Any module can return HAVEQUORUM, NOQUORUM, or TIE. If they return HAVEQUORUM or NOQUORUM, then no further quorum modules are consulted. However, if they return TIE, then the next quorum module is consulted for its opinion. If the last quorum module returns TIE, it is treated the same as NOQUORUM.

This enables you to use one quorum module to break the tie declared by a previous quorum module.

You could then use the quorumd to break the tie created by a voting module. Or you could use the quorumd instead of the "two-node" module. Or you could use the "pingable" module instead of the "two-node" module. Or you could at the end always tack on a "human" module, in case all else returns TIE.

This is kind of cool, actually. My favorites for next implementation are the pingable and consult human modules.

And, of course, if your cluster loses quorum due to real server failures failures, there are always ways to work around it, with a little human intervention. One method is to tell Heartbeat to ignore quorum. Another is to tell Heartbeat to remove certain nodes from the cluster, after you verify that they're really dead. And, I'm sure that in a pinch, some new methods will be invented. And some of them might actually work ;-).

Anonymous said...

If you ask yourself "when is the last time at least half my machines in my data center couldn't communicate with the other half", then hopefully that's also a "pretty darn small" chance too. If not, there are well-known methods for making networks highly reliable too. +1
Alternative Medicine