I wrote up a blurb on Slashdot today about my perspective on creating highly reliable, available, or fault-tolerant systems, and how you really need to choose which of the three you are going for in designing your compute environment.
I’ve also adjusted my opinion since I posted this. There are a couple more factors, which include initial expense and maintenance cost, that need to be factored in. They are normally the domain of the bean counter, but it’s important that the admin/systems architect be aware of what tradeoffs he is willing to make in order to bring those costs down, and where the sacrifice of reliability, availability, or fault-tolerance needs to be made. And he also needs to appraise the bean counter of the importance of those factors that are being lost, as well.
Proper machine administration is a balance of: * reliability * availability * fault-tolerance
Reliability generally refers to the components of a piece of hardware not breaking very often. In our Compaq DL360 and DL380 rackmount units, reliability is a critical factor in their power supplies. The power supplies are simply crap. The fans die, the supplies blow up. It’s terrible.
Availability, on the other hand, even if you have components or a system that is not highly reliable, it can still be highly available. For instance, 1U chassis systems with dual power supplies, although the power supply reliability tends to be low, nevertheless tend to be available despite the low reliability.
Fault-Tolerance is how well the system handles unusual circumstances. A system may be highly available and reliable, but if it does not handle system faults well, you may have a problem. Fault-Tolerance really refers to how well the system handles unusual conditions.
Now, I realize the distinctions seem rather vague. That is intentionally so! But separating the question into three parts helps grant admins a better look at what the weak points are of a system.
I generally prefer 1U or 2U units to have a single, reliable power supply, rather than dual power supplies with lower reliability. Because of the higher reliability of their PS, they tend to also be available more. However, you have very poor fault tolerance in the power area with just a single power supply, so you’ll normally need redundant systems in order to have a fault-tolerant environment. If that is the case, availability may suffer, for when the first machine goes down in that rare situation that the highly reliable power supply dies, you have some cutover time to the secondary system.
Really, you have to evaluate what’s most important for you. In the bank where I work, reliability is critical; fault-tolerance, somewhat less so. Availability is not so much of an issue. We close at 5:30, and really don’t do any business after that, so from 5:30 PM through 5:30 AM, our availability can be nonexistent for certain types of maintenance, and we’re just fine.
Like I said, the designations there are pretty arbitrary. But if you can come to an approach covering at least three angles on your machines and evaluate by those criteria, your overall uptime goals can be met in a way that suits your organization.
Me, I’m just sick of cheap hardware. I enjoy 1U rackmount systems, but I’d much rather have two higly reliable 2U boxes than four inherently less-reliable 1U units. Of course, vendors come into play there… Sun makes some killer 1U’s that never seem to die, while Compaq and Dell’s 1U (in my subjective experience) have horrible failure rates, particularly in power supplies. Of course, with a sample size of only a few dozen of each type, it’s possible I’m drawing too much of a generalization from too limited a set 🙂