Reliability Considerations
Hardware freaks know how to minimize losses

LAMB CHOP HOME
JOIN TEAM LAMB CHOP
TEAM STATISTICS

ARS TECHNICA FOOD COURT:
ARS TECHNICA ARS DISTRIBUTED FORUM LINKAGE
PERSONAL STATS LOOKUP:
SETI@Home ACCOUNT:
CONTACT WEBMASTER
(remove NO.SPAM)

Mad props go out to IronBits for hosting the site and all the others who have kept this site going for the past couple of years!.

Reliablity Considerations

by Larry Loen

Production depends on two things: Speed and reliability. TLCers demand both. For whatever reason (glamour perhaps?) performance has gotten the most formal attention. Reliability tends to come out negatively (e.g. “Bezerkeley’s down again.”). Here’s an attempt to remedy that.
Don’t miss your chance, go to best australian online casinos only here good luck awaits you!

Some of this will seem extreme. However, the reason is simple and maybe surprising. To some, it comes as perhaps an initial shock that the baseline reliability is usually very high. Even the simplest of schemes gets one to something like 95 per cent of twenty four hours a day, seven days a week operation. This last five per cent matters because five per cent per year is eighteen 24 hour days. Our typical scheme (seen below) is or readily can be made to be more like 99 per cent, minus problems with the local crunching machines (on which more later). But, getting that level of uptime requires attention to detail. Depending on individual circumstances, a variety of schemes may be required.
Buy best baby toothbrush. Monitor your child’s dental health.

Important Note. This analysis doesn’t care whether SetiQueue or some other program is used for the queuing/caching function. Ordinary computer terminology will call a stash of WU a “queue” and nothing should be implied from that.

Hurry up and start winning with casino 25 euro bonus ohne einzahlung at our casino. Limited supply!

Links:

A Typical Environment.

A Simple Setup
��� (for contrast or special cases)
.

More on Brownouts.

Top Tips on avoiding outages.

Some of the Authors’ Actual Failures.

Local Queue Reliability.

Cascaded Queues.

UPS Units.

An Optimal Setup.

Typical TLC Environment

Let’s start off with a diagram of a typical TLCer’s setup:

Here, we see a small farm of five machines all accessing some centralized local queuing machine. This has been the norm for at least two years.

The main reason this setup is typical is that everyone abhors outages at Berkeley. But, what is the reliability of this setup?

Informally, the equation is:

  P(crunch) = (1 - P(localq down))* 
      P(localqueue has at least one WU from Berkeley)

Of course, this is further moderated by the probability a given machine in question is itself crunching, but we’ll ignore that, because all this only applies when one of the machines (the five at the bottom of the diagram), actually needs a new work unit.

The probability that the Local Q has at least one work unit from Berkeley has been a function of the depth of the queue and the local uptime. That is, if you could get several days’ supply, then P(localqueue has at least one WU from Berkeley) could be treated essentially as “one”. Therefore, most TLCers have pragmatically concentrated on getting several days’ production in the queue and doing what they could locally to control their own queue machine’s reliability. However, at some point (more than a week’s worth) a deep queue means that you aren’t contributing to the science and just racking up personal stats (see Miscellaneous topics in the Basic SETI FAQ for more on the lifetime of a WU).

In a separate and recent problem, SETI@home has been hitting Berkeley-imposed bandwidth caps now and then. This means that even if Berkeley is up, it may not be delivering work units at an essentially infinite rate.

Therefore, it appears plausible that we may face, for greater or lesser periods, a probability that Berkeley isn’t delivering enough work units to meet demand. If this happens, for long enough (probably about a week, based on experience to date), no local cache will be able to withstand this intermittent dearth. At least some of the time, the local queue will be empty. How we’ll deal with this, when and if it becomes commonplace, is not entirely resolved. More on that later.

Reliability of the Local Queue Machine

For now, let’s concentrate on a neglected factor — the local queue. The author has seldom seen this discussed. Yet, in any serious discussion on reliability, it must be a major factor.

Some interesting facts:

  • For all the energy focussed on the problem of outages at SETI@home at Berkeley, Berkeley’s site has been remarkably robust and available overall. The author was forced by unusual circumstances not to cache at all for an entire year. Naturally, I kept track of my losses. Including an entire week where someone cut a key communications cable on campus, my losses were only about three per cent. Since the cable cut incident, Berkeley has been very reliable. I have not calculated it so precisely, but it is probably 98 or 99 per cent available. We can therefore take 97 per cent as a worst case, at least in terms of having something up and awaiting connections.
  • I don’t know the failure rate or mode of the SETI servers. They do seem to be commercial grade Unix boxes. Some failures are or at least were related to the data bases at Berkeley.
  • Virtually all intermediate local queues are personal computers with single (nonRAIDed) hardfiles. The probability that this “box” may be down is probably greater than Berkeley’s boxes. The only reason the local box would have a better reliability would be network or database factors at Berkeley’s end. But, it is just as likely that the downtime from the local box, including AC power problems, is simply neglected by most of us.

Some of this is psychological. When Berkeley is down it is often down for only a few hours (almost no one notices, including “drone” machines running the screensaver that don’t do caching) or it is down for days (and it kind of burns into the brain of the dedicated cruncher). In fact, our cherished practices of keeping many days’ supply on hand relates to only a handful of longer outages. As is normal in such things, there are a great many small failures for every large one. Since this is all about fun, and it is so far as easy to overcome the big outages as the small, this doesn’t really matter. What matters is enjoying this. That includes not having to constantly deal with a lot of manual work (e.g. restarting after a long outage at Berkeley). If a deep cache means your machines run pretty much unattended, for larger or smaller outages, why not do the caching thing, even if it objectively only covers a few per cent of losses? It at least gives you a chance to act instead of waiting on Berkeley to fix something on their end.

But, it is probably also true that most TLCers have actually lost more work units due to their local queuing machine being down than Berkeley being down. Indeed, caching ensures this is so.

In the end, if the local queue is down (for any combination of reasons) more than three per cent of the time, it is actually not “paying” for itself and you’re better off going direct to Berkeley. Or, getting a better machine. Conclusion: The reliability of the local queue machine is more important than its performance. Any old slow machine will do (the demands, even with many machines at its back is not all that great), but it must be reliable which can be a problem with older machines, especially if they are implemented with old failing hardfiles. Still, even a machine with an aging hardfile can work, provided it can be readily reinstalled and provided a nearby computer store has an inexhaustible supply of cheap drives.

At our tolerance levels, an old machine with a relativey new drive is not a bad idea. So is simply borrowing a few seconds here and there from your oldest cruncher machine, or even your newest, provided it has a rock-solid record for uptime.

A couple of important caveats:

  • The smaller one’s farm, the less likely all this true. If one has but one machine for both the queue role and the cruncher role, for instance, downtime is simply downtime. Other than UPS and keeping it in good repair, we simply must accept the downtime of the ultimate cruncher machine, as there’s no way of working around its problems.
  • The smaller the farm, the more realistic the idea of retargeting machines away from the broken local queue machine to an alternative in time to avoid loss. Big farms, though, won’t have this ability one way or another.

For those with more than a handful of machines, there is an obvious improvement, though few seem to bother with it:

Here, one implements two local queue machines. That way, if you can’t retarget your cruncher machines, you suffer only the average of the two queue’s outages if you balance your machines between them. Even better, you suffer only half of the loss, per outage.

Cascaded Local Queues

Another popular scheme has been to “cascade” queues. In terms of reliability, this is not as straightforward an idea as it appears:
Cascaded Queues

The reliability equation is:

  P(crunch) = (1 - P(localq down))* 
      P(localqueue has at least one WU from the next machine up)

In this case the P(localqueue) term can no longer be treated as “one.” The fact is, the next queue up is going to have a relatively small and finite number of units. It is true that not very many will be needed. Still, it is a smaller total “stash” and, in the bargain, probably a less reliable machine. The equation looks more like this:

  P(crunch) = (1 - P(localq down))* 
     ( P(localqueue has at least one unit left locally)
      (1 - P(intermediateq down))* 
      P(intermediate queue has at least one WU from Berkeley))

Now, this is not quite sound, mathematically. There is a bit of time domain in here that is very messy to calculate. This was true in the original case, but since we could treat it as one, we could ignore it.

All these P(local queue has at least one unit from the above queue) are really the probability of how often the local queue will fail, over a relevant interval, to obtain that one more unit. So, it is a question of time and even the collective demand from the underlying crunchers themselves. Obviously, this is not an absolute probability, but a calculation of the expected number of failures over time. Suppose, for vast simplification, all five crunchers crunched in 3 hours. In that case, the P(intermediate q) is something like P(intermediate queue fails to get five more units in three hours). Quickly, you begin to think about models rather than formal math. But, there seems little doubt that the probability of each added machine being down contributes negatively to the reliability of the total system and, therefore, to total production.

Returning to the added queue above, what do you gain over the simpler case of a local buffer to make it worth a real deduction in overall reliabilty? The names chosen give the show away. It is best, for cascaded queues, that the next queue up be a “team” queue. That is, a queue that one TLCer maintains on behalf of him/herself and several teammates. The Team Queue should be a machine with an excellent (fast) network connection. This is a factor proven valuable in “brownouts”. When we get to brownout cases, where SETI@home has a cap on total bandwidth, then it seems to be an advantage to cascade to a Team Queue. This is so provided that Queue has a high speed connection in order to maximize its use of the scarce Berkeley communications resource (or, another way, one that gets maximum WU per unit of time). While some TLCers get carried away and do more cascading still, the above diagram really is just about the limit in terms of what can be justified in theory. We have seen some practical differences in implementation (e.g. Team Queues that cascade to each other), but if the “brownout” becomes the norm, we’ll eventually discover it is best to limit cascading queues and implement Team Queues with excellent connections to Berkeley.

As Simple as Possible

Another interesting cross-check is a setup the author has used. Forced by circumstances (firewalls, corporate policy) to avoid the popular queuing programs, this system has had an approximate one or two per cent loss over conventional caching and a per cent or two better, therefore, than no caching at all. A few in corporate America might need to do this, so it is worth showing despite its limitations:

Here, one sees no queues at all. The queuing, such as it is, is done simply by having two copies of SETI@home executing for each CPU. While this will obviously fail if the outage is longer than several hours, a surprisingly high percentage of outages are measured in minutes or hours. It is not yet clear whether this will fare well in “brownout” conditions or not. Since getting units will be hard if brownouts dominate, it may be that such a system will need to drop back to one program copy per CPU and extensive restart logic to attempt to keep all the available CPUs going. But, that is not this setup. What it really highlights is that, at some point, a more elaborate caching system is just. . .more elaborate. There is no point in introducing more layers of cascading without good cause. Therefore, extreme uptime from added cascades is needed to justify adding more layers as, at some point, this simple scheme will out-perform an overly elaborate queuing scheme.

Power and the UPS

You need to pay some attention, too, to “power envelopes”. Each color in the charts are intended to represent some meaningful change in AC power. Even if the machines are in the same town, if they are physically dispersed well enough (or, at a business, have their own independent UPS or even power subsystem), they may be largely or entirely independent of each other in terms of power failures. In the real world, especially at the high 90s of reliability we’re discussing, power failure is a leading cause of outage.

In fact, the author is running a small three machine home farm off of a 950 VA UPS (which has just enough of a rating to carry the day). I’ve probably saved several days’ crunch, minimum, and that’s just the cases I know of. And, that’s times three machines.

If you are at work, the least glitch of AC power back home would cost roughly half a shift per machine (on average). It doesn’t cover extended outages, but like most failures, there may be ten little glitches for every extended failure. In fact, the biggest benefit is for those couple of second outages. The author has typed right through them, a pleasurable experience on the whole. I still lose out, but the loss is measured in hours per year, even in rurual Minnesota.

Perhaps the added headaches you save (e.g. the equipment protection you get against spikes, outages, and power company brownouts) in such a unit is the greatest benefit of all. Not to mention time saved not reloading the OS, not recovering your personal data, etc. If it is a pure cruncher, you can decided if a UPS is cost effective (for my setup, it is 50 dollars a machine, not an inconsiderable cost despite saving several days’ crunch a year). If it has your personal data on it as well, the UPS for a 24 by 7 machine comes highly recommended by me, at least. One never seems to back everything up. You put that personal data at a lot more risk when you leave your system unattended for whole shifts at at time.

Top Tips for Avoiding Outages

  • Implement the correct caching strategy for current conditions. That is “deep caches” when bandwidth is plentiful, Team Queues when bandwidth is sparse.
  • Get a UPS for your home units.
  • If you have enough machines, look at multiple local queues (parallel as in the second diagram, not cascaded). Perhaps these should hook up with multiple Team Queues to mitigate your exposure to any one machine. That first local queue machine is your weakest link and its downtime is magnified by the number of machines it “feeds”.
  • If you have everything under one roof, consider running your queuing function on one of your crunchers (use your slowest box). The actual time spent in uploading and downloading is not all that consequential. A very fast machine would do about 7 WU per day. If five solid minutes of CPU time were needed to upload/download, that is 35 minutes per day. In twenty four hours, this is two per cent. But, of course, it is nothing like solid time. So, getting rid of a machine increases reliability and long term production with immeasurable loss for running the queuing function. If a would-be dedicated local cache machine has no other use, collapsing the function to some cruncher saves you money in electricity (an overlooked cost).
  • Be sure to “register” with the Team Queues even if you aren’t using them now. Currently, they all use the SetiQueue program. As it has worked to this writing, the “first time”, Berkeley must be available (that is, no actual queueing at first access). Thus, unless you use the queue at least once, and arrange with the administrator to stay authorized (SetiQueue tends to “autoexpire” if you don’t use a given queue), you won’t be able to switch on an outage.
  • Pay attention to “geography”. If you have machines in multiple buildings (multiple buildings at work or some at home and some at work), make sure that you exploit any advantages, however slight, in the AC power supply game. Point the right machines at the right local queues. And, think about the parallel game above. Remember that few power failures take out an entire town. In a corporate setting, watch building boundaries as your site may do things like take down all the power for one building over a weeking with surprising frequency. Guarding against a three per cent loss means that even ocassional “takedowns” of this sort are worth attention.
  • Don’t cascade local queue machines without good reason. Ideally, there should be one machine between you and Berkeley. The main exception would be in brownout conditions, getting a higher bandwidth or more reliable connection from a Team Queue. But, that is exactly one more in the cascade.
  • Discretely inquire into Team Queues you use. Are they cascading unnecessarily? Use the Team Queue closest to Berkeley.
  • Figure out (if you can) how to run “diskless”. This is a leading-edge idea amongst home crunchers. The hardfile is a leading cause of failure after electrical power. In truth, this probably won’t matter much in terms of overall reliability, but if we can master this, it will reduce power (making UPS stretch farther) and improve reliability of indivdiual cruncher boxes just a bit. If this is too much hassle, have a clunky old hardfile on hot standby (Red Hat Linux is a workable choice for me). The idea being Red Hat seems to boot almost anything and you can get back on the air, crunching, while you figure out what else went wrong.

More on Brownouts.

As discussed above, a new factor is that SETI@home has a bandwidth cap. Thus, P(local queue has at least one WU from Berkeley) may not be one. There are several plausible ideas that won’t work if the dearth is permanent or even comes and goes for a substantial fraction of the total time.

One tempting idea is to have more than one local queue. This adds another intermediary between you and SETI@home. However, if the fundamental problem is that SETI@home isn’t delivering work units, this will really have no effect, other than to introduce further unreliability into your system.

Also ineffective would be trying to increase the depth of the queues you have. This seems sensible until you realize: if you can’t get WU, then trying to get more WU isn’t a promising answer. We have talked about someone getting more WU than another, but actual experience says SETI@home is so large that if total WUs are capped, everyone will suffer about equally.

There are more effective choices.

One is to bravely do nothing. Ultimately, if there are more machines than work units, we’re all going to face outages. Simply facing facts may be almost as effective as anything else. Poof has observed that if there is an overall cap on bandwidth (and, hence, on available work units), that someone is going to drop out until the demand balances supply. My guess is that demand will still slightly exceed supply. Your outage, then, should be to the ratio of the oversupply of participating machines to the actual WU supply. But, TLCer’s won’t accept such an answer easily!

A better approach is to note that bandwidth dearths may not be a permanent state of affairs. This has been the case at this writing. This is good news, because it means we don’t have to rethink things much. The right strategy in this environment is to point at a Team Queue of some kind (regardless of what method you use to cache) and feed off of that. As discussed before, you may want to aim a larger farm at more than one such queue, in parallel, to increase your overall reliability (or, at least, average out your outages better). You could even abandon your own local queue and point directly at the Team Queue. A Team Queue approach will allow us to concentrate on a common strategy and manage bandwidth better. For instance, a Team Queue will be set up to find those times of day where just those few extra WU will be available and snare them. Moreover, if the brownout lasts long enough, queues of all kinds will begin to dry up. Sharing queues means maximizing the number of active machines. As we approach a full time dearth, crunching a unit becomes a better response than holding on to a unit. You or your neighbor’s cruncher should not sit idle while someone else has WU sitting about. In such constraints, a deep cache represents a production loss, not a gain, for the team. But, with Team Queues, we can oscilate between deep and shallow caches quite naturally if the dearth situation comes and goes (as it so far has).

Those running the Team Queues will bear the responsibility to maximize the probability there are work units in the queue. We will learn where “the breaks in the action are” (that is, slivers of available bandwidth) and stagger the attempts our public queues make to contact Berkeley to maximize our chances of getting work units.

The Team Queue owners will also need some idea of what WU cache their constituents require. This may take some e-mail.

This sketch of ideas is based on actual experience to date. When we first hit this “brownout” problem, second and third shift US Pacific time was supposed to remain uncapped. In practice, we merely found our odds of getting work elevated a bit “offshift”. Near the limit, other problems (e.g. total connection count in the server) seemed to be reached as well. But, if Berkeley gets its act together, then public queues are enough. We’ll figure out who has the highest speed links to Berkeley, run our public queues through them, and so maximize our probability of getting a unit.

A final effective choice, at least if the brownout becomes permanent, is to run more than one Distributed Computing project on the same machine. This can be done (in many if not most cases) so that the time is divided equally between SETI@home and the other project, or with SETI@home dominating. Regardless, though this approach cuts down your SETI contribution some, it ensures your machine contributes to some project at all times. This more drastic solution will only make sense if and when it becomes clear that the available bandwidth can only deliver a fixed number of units, dictating a certain amount of downtime for everyone. At this writing, we haven’t gotten there yet.

While Team Queues adapt well, permanent brownouts would mean that all queues will dry up, at least sometimes, and we’ll need to figure out how to restart crunchers automatically as WU dribble in. See the self-caching FAQ for ideas.

Optimal Setup for the Present

Given all the factors (traditional outages, brownouts, ease of operation), here is what the author would consider an optimum setup for the current circumstances:

Here, one sees all the key factors in play:

  • UPS is used to keep local cruncher losses minimal.
  • A Team Queue is used, allowing dynamic adaption to all conditions at Berkeley.
  • Differing AC power domains are exploited.
  • Excessive cascading is avoided so that no added reliability lossers are injected.

This setup does rely on the Team Queue to be a good, highly reliable machine. Its downtime (or its probability of drying up in brownouts) will determine, almost entirely, the production losses for this situation. Putting a local queue in between adds its own unreliability, but would give some “buffer” agains the Team Queue machine’s problems and, perhaps, certain classes of network failure. However, if the Team Queue is reliable and the internet is normal, a local machine subtracts reliability, it doesn’t add to it. Simpler still (and likely as effective) would be to run two SETI clients per CPU on the cruncher machines, hinted at in the text for Machine Room 2 above. This “two per local machine” probably is enough to cover most Team Queue losses given the likely reliability of a Team Queue.

In practice, adding a local queue over and above the Team Queue shown here is probably a question of how likely there will be DNS type problems with the local queue requiring more than a WU or two’s worth of local buffer. We have seen some problems with DNS resolution here and there that can take a day or two to solve. But, that should be solvable with absolute IP addresses, which again avoids the added reliability questions from an added local queue.

The remaining reason to add a local queue to the above is if one has an intermittent dialup access to the Internet. This especially applies to manual dialup, not autodial situations. Here, the loss of a percent or two, regrettable as that is in the abstract, overcomes the practical issue of local “outage” from not having continuous access to the ‘net.

Some Actual (Local) Failures

In the interests of making this a bit more real, the author will describe some of the actual failures that are readily remembered. This first set is entirely at home.

  • Failure of Tyan Tiger Dual motherboard. Approximate down time: Four months. Cause: Early revision of board mismatched Crucial memory. Losses: Since I purchased a K7S5A to tide me over, 1 GHz Morgan CPU for four months.
  • Power Failures. The UPS has done very well, but I lose about eight to twelve hours per year, at home, due to power failures that exceed the UPS’ time scale. However, without it, my losses would probably be ten times that, to say nothing of me probably being too nervious to run the machines while at work.
  • Setup errors in running the clients. Occassionally, the various forms of sneakernetting I do catch up to me. I do not have a good estimate for this, but it is probably several days per year. This is much less frequent nowadays simply because I tend, one way or another, to have two things available for each CPU. I most recently lost about 8 SETI hours on one of two dualie CPUs because of a triffling detail in how “nohup” works in Linux. One of two CPUs functioned instead of two.
  • Hardware upgrades. This probably happens a half dozen times at a cost of four hours each, so that’s about a day per year. Software upgrades do not usually cause me much problem as I seldom/never upgrade the operating system, unless in conjunction with a hardware upgrade.
  • Rebooting Windows. While this happens with some regularity (at least weekly), it is on my slowest machine. This is caused by some ill-understood “leak” in the Genome@Home client (so I would not have this running SETI). However, it is in the five minutes per week range (if it were SETI) and so can be ignored.

At work, the leading cause of downtime is some sort of site shutdown (in whole or, more often, in part). For instance, I have occassional, rotating access to many machines in several buildings. When any one of these buildings are powered down for some repair (e.g. over Labor Day Weekend), I lose production from all such machines for that time.