by Larry Loen
Production depends on two things: Speed and reliability. TLCers demand both.
For whatever reason (glamour perhaps?) performance has gotten the most formal attention.
Reliability tends to come out negatively (e.g. “Bezerkeley’s down again.”). Here’s an
attempt to remedy that.
Don’t miss your chance, go to best australian online casinos only here good luck awaits you!
Some of this will seem extreme. However, the reason is
simple and maybe surprising. To some, it comes as perhaps an
initial shock that the baseline reliability is usually very high.
Even the simplest of schemes gets one to something like 95 per cent
of twenty four hours a day, seven days a week operation. This last five per cent
matters because five per cent per year is eighteen 24 hour days. Our typical scheme (seen below) is
or readily can be made to be more like 99 per cent, minus problems with the local
crunching machines (on which more later). But, getting that level of uptime requires attention
to detail. Depending on individual circumstances, a variety of schemes may be required.
Buy best baby toothbrush. Monitor your child’s dental health.
Important Note. This analysis doesn’t care whether SetiQueue or some other
program is used for the queuing/caching function. Ordinary computer terminology will call
a stash of WU a “queue” and nothing should be implied from that.
Hurry up and start winning with casino 25 euro bonus ohne einzahlung at our casino. Limited supply!
Let’s start off with a diagram of a typical TLCer’s setup:
Here, we see a small farm of five machines all accessing some
centralized local queuing machine. This has been the norm for at least two years.
The main reason this setup is typical is that everyone abhors outages at Berkeley.
But, what is the reliability of this setup?
Informally, the equation is:
P(crunch) = (1 - P(localq down))*
P(localqueue has at least one WU from Berkeley)
Of course, this is further moderated by the probability a given machine in question is
itself crunching, but we’ll ignore that, because all this only applies when one of the
machines (the five at the bottom of the diagram), actually needs a new work unit.
The probability that the Local Q has at least one work unit from Berkeley has been a function
of the depth of the queue and the local uptime. That is, if you could get several days’ supply,
then P(localqueue has at least one WU from Berkeley) could be treated essentially as “one”.
Therefore, most TLCers have pragmatically concentrated on getting
several days’ production in the queue and doing what they could locally to control their own
queue machine’s reliability. However, at some point (more than a week’s worth) a deep queue means that you
aren’t contributing to the science and just racking up personal stats (see Miscellaneous
topics in the Basic SETI FAQ for more on the lifetime of a WU).
In a separate and recent problem, SETI@home has been hitting Berkeley-imposed
bandwidth caps now and then. This means that even if Berkeley is up, it may not
be delivering work units at an essentially infinite rate.
Therefore, it appears plausible that we may face, for greater or lesser periods, a probability
that Berkeley isn’t delivering enough work units to meet demand. If this happens, for long
enough (probably about a week, based on experience to date), no local cache will be able to
withstand this intermittent dearth. At least some of
the time, the local queue will be empty. How we’ll deal with this, when and if it becomes
commonplace, is not entirely resolved. More on that later.
For now, let’s concentrate on a neglected factor — the local queue.
The author has seldom seen this discussed.
Yet, in any serious discussion on reliability, it must be a major factor.
Some interesting facts:
- For all the energy focussed on the problem of outages at SETI@home at Berkeley,
Berkeley’s site has been remarkably robust and
available overall. The author was forced by unusual circumstances not to cache at all for an entire
year. Naturally, I kept track of my losses. Including an entire week where someone cut a
key communications cable on campus, my losses were only about three per cent. Since the cable
cut incident, Berkeley has been very reliable. I have not calculated it so precisely, but it is
probably 98 or 99 per cent available. We can therefore take 97 per cent as a worst case,
at least in terms of having something up and awaiting connections.
- I don’t know the failure rate or mode of the SETI servers. They do seem to be commercial
grade Unix boxes. Some failures are or at least were related to the data bases at Berkeley.
- Virtually all intermediate local queues are personal computers with single (nonRAIDed) hardfiles.
The probability that this “box” may be down is probably greater than Berkeley’s boxes. The only
reason the local box would have a better reliability would be network or database
factors at Berkeley’s end. But, it is just
as likely that the downtime from the local box, including AC power problems,
is simply neglected by most of us.
Some of this is psychological. When Berkeley is down it is often down for only a few hours
(almost no one notices, including “drone” machines running the
screensaver that don’t do caching) or it is down for days (and it kind of burns into the brain of the dedicated
cruncher). In fact, our cherished practices of keeping many days’ supply on hand relates to
only a handful of longer outages. As is normal in such things, there are a great many small failures for
every large one. Since this is all about fun, and it is so far as easy to
overcome the big outages as the small, this doesn’t really matter.
What matters is enjoying this. That includes not having to constantly deal with a lot of
manual work (e.g. restarting after a long outage at Berkeley).
If a deep cache means your machines run pretty much unattended, for larger
or smaller outages, why not do the
caching thing, even if it objectively only covers a few per cent of losses? It at least gives
you a chance to act instead of waiting on Berkeley to fix something on their end.
But, it is probably also true that most TLCers have actually lost more work units due to their local
queuing machine being down than Berkeley being down. Indeed, caching ensures this is so.
In the end, if the local queue is down (for any combination of reasons) more than three per
cent of the time, it is actually not “paying” for itself and you’re better off going direct to
Berkeley. Or, getting a better machine. Conclusion: The reliability of the local queue machine
is more important than its performance. Any old slow machine will do (the demands, even with
many machines at its back is not all that great), but it must be reliable which can be
a problem with older machines, especially if they are implemented with old failing hardfiles.
Still, even a machine with an aging hardfile can work, provided it can be readily reinstalled
and provided a nearby computer store has an inexhaustible supply of cheap drives.
At our tolerance levels, an old machine with a relativey new drive is not a bad idea.
So is simply borrowing a few seconds here and there from your oldest cruncher machine, or
even your newest, provided it has a rock-solid record for uptime.
A couple of important caveats:
- The smaller one’s farm, the less likely all this true. If one has but one machine for
both the queue role and the cruncher role, for instance, downtime is simply downtime. Other than
UPS and keeping it in good repair, we simply must accept the downtime of the ultimate
cruncher machine, as there’s no way of working around its problems.
- The smaller the farm, the more realistic the idea of retargeting machines away from the
broken local queue machine to an alternative in time to avoid loss. Big farms, though, won’t have this ability
one way or another.
For those with more than a handful of machines, there is an obvious improvement, though
few seem to bother with it:
Here, one implements two local queue machines. That way, if you can’t retarget your cruncher machines,
you suffer only the average of the two queue’s outages if you balance your machines
between them. Even better, you suffer only half of the loss, per outage.
Another popular scheme has been to “cascade” queues. In terms of reliability,
this is not as straightforward an idea as it appears:
The reliability equation is:
P(crunch) = (1 - P(localq down))*
P(localqueue has at least one WU from the next machine up)
In this case the P(localqueue) term can no longer be treated as “one.” The fact is,
the next queue up is going to have a relatively small and finite number of units. It is true
that not very many will be needed. Still, it is a smaller total “stash” and, in the bargain,
probably a less reliable machine. The equation looks more like this:
P(crunch) = (1 - P(localq down))*
( P(localqueue has at least one unit left locally)
(1 - P(intermediateq down))*
P(intermediate queue has at least one WU from Berkeley))
Now, this is not quite sound, mathematically. There is a bit of time domain in here that
is very messy to calculate. This was true in the original case, but since we could treat it
as one, we could ignore it.
All these P(local queue has at least one unit from the above queue) are really the
probability of how often the local queue will fail, over a relevant interval, to obtain that one more
unit. So, it is a question of time and even the collective demand from the underlying crunchers
themselves. Obviously, this is not an absolute probability, but a calculation of the
expected number of failures over time. Suppose, for vast
simplification, all five crunchers crunched in 3 hours. In that case, the P(intermediate q) is something
like P(intermediate queue fails to get five more units in three hours). Quickly, you begin
to think about models rather than formal math. But, there seems little doubt that the probability of each
added machine being down contributes negatively to the reliability of the total
system and, therefore, to total production.
Returning to the added queue above, what do you gain over the simpler case of a local buffer to
make it worth a real deduction in overall reliabilty? The names chosen give the show away.
It is best, for cascaded queues, that the next queue up be a “team” queue. That is, a queue
that one TLCer maintains on behalf of him/herself and several teammates. The Team Queue should
be a machine with an excellent (fast) network connection. This is a factor proven valuable in
“brownouts”. When we get to brownout cases, where SETI@home has a cap on total
bandwidth, then it seems to be an advantage to cascade to a Team Queue. This is so provided that Queue has
a high speed connection in order to maximize its use of the scarce Berkeley communications
resource (or, another way, one that gets maximum WU per unit of time). While some TLCers get carried away
and do more cascading still, the above diagram really is just about the limit in terms of what can be justified
in theory. We have seen some practical differences in implementation
(e.g. Team Queues that cascade to each other), but if the “brownout” becomes the norm, we’ll
eventually discover it is best to limit cascading queues and implement Team Queues with
excellent connections to Berkeley.
Another interesting cross-check is a setup the author has used. Forced by circumstances (firewalls,
corporate policy) to avoid the popular queuing programs, this system has had an approximate one
or two per cent loss over conventional caching and a per cent or two better, therefore, than no
caching at all. A few in corporate America
might need to do this, so it is worth showing despite its limitations:
Here, one sees no queues at all. The queuing, such as it is, is done simply by having
two copies of SETI@home executing for each CPU. While this will obviously fail if the outage is
longer than several hours, a surprisingly high percentage of outages are measured in minutes or
hours. It is not yet clear whether this will fare well in “brownout” conditions or not. Since getting units
will be hard if brownouts dominate, it may be that such a system will need to drop back to one program copy per CPU and
extensive restart logic to attempt to keep all the available CPUs going. But, that is not this
setup. What it really highlights is that, at some point, a more elaborate caching system is just. . .more elaborate. There
is no point in introducing more layers of cascading without good cause. Therefore, extreme uptime from added
cascades is needed to justify adding more layers as, at some point, this simple scheme will
out-perform an overly elaborate queuing scheme.
You need to pay some attention, too, to “power envelopes”. Each color in the charts are
intended to represent some meaningful change in AC power. Even if the machines are in the
same town, if they are physically dispersed well enough (or, at a business, have their own
independent UPS or even power subsystem), they may be largely or entirely independent of each
other in terms of power failures. In the real world, especially at the high 90s of reliability
we’re discussing, power failure is a leading cause of outage.
In fact, the author is running a small three machine home farm off of a 950 VA UPS (which has just
enough of a rating to carry the day). I’ve probably saved several days’ crunch, minimum, and that’s just the cases I know
of. And, that’s times three machines.
If you are at work, the least glitch of AC power back home would cost
roughly half a shift per machine (on average). It doesn’t cover extended outages, but like
most failures, there may be ten little glitches for every extended failure. In fact, the
biggest benefit is for those couple of second outages. The author has typed right through them,
a pleasurable experience on the whole. I still lose out, but the loss is measured in hours
per year, even in rurual Minnesota.
Perhaps the added headaches you save (e.g. the equipment protection you get against
spikes, outages, and power company brownouts) in such a unit is the greatest benefit of all. Not to
mention time saved not reloading the OS, not recovering your personal data, etc. If it
is a pure cruncher, you can decided if a UPS is cost effective (for my setup, it is
50 dollars a machine, not an inconsiderable cost despite saving several days’ crunch a
year). If it has your personal data on it as well, the UPS for a 24 by 7 machine
comes highly recommended by me, at least. One never seems to back everything up. You
put that personal data at a lot more risk when you leave your system unattended for
whole shifts at at time.
- Implement the correct caching strategy for current conditions. That is “deep caches”
when bandwidth is plentiful, Team Queues when bandwidth is sparse.
- Get a UPS for your home units.
- If you have enough machines, look at multiple local queues (parallel as in the second
diagram, not cascaded). Perhaps these should hook up with multiple Team Queues to
mitigate your exposure to any one machine. That first local queue machine is your weakest link
and its downtime is magnified by the number of machines it “feeds”.
- If you have everything under one roof, consider running your queuing function on one
of your crunchers (use your slowest box). The actual time spent in uploading and downloading
is not all that consequential. A very fast machine would do about 7 WU per day. If five solid
minutes of CPU time were needed to upload/download, that is 35 minutes per day. In twenty four hours, this is
two per cent. But, of course, it is nothing like solid time. So, getting rid of a
machine increases reliability and long term production with immeasurable loss for running
the queuing function. If a would-be dedicated local cache machine has no
other use, collapsing the function to some cruncher saves you money
in electricity (an overlooked cost).
- Be sure to “register” with the Team Queues even if you aren’t using them now. Currently,
they all use the SetiQueue program. As it has worked to this writing, the “first time”,
Berkeley must be available (that is, no actual queueing at first access). Thus, unless
you use the queue at least once, and arrange with the administrator to stay authorized (SetiQueue
tends to “autoexpire” if you don’t use a given queue), you won’t be able to switch on an outage.
- Pay attention to “geography”. If you have machines in multiple buildings (multiple
buildings at work or some at home and some at work), make sure that you exploit any advantages,
however slight, in the AC power supply game. Point the right machines at the right
local queues. And, think about the parallel game above.
Remember that few power failures take out an
entire town. In a corporate setting, watch building boundaries as your site may do things like
take down all the power for one building over a weeking with surprising frequency. Guarding against
a three per cent loss means that even ocassional “takedowns” of this sort are worth attention.
- Don’t cascade local queue machines without good reason. Ideally, there should be one
machine between you and Berkeley. The main exception would be in brownout conditions, getting a
higher bandwidth or more reliable connection from a Team Queue. But, that is exactly one more
in the cascade.
- Discretely inquire into Team Queues you use. Are they cascading unnecessarily? Use the Team Queue
closest to Berkeley.
- Figure out (if you can) how to run “diskless”. This is a leading-edge idea amongst home
crunchers. The hardfile is a leading cause of failure after electrical power. In truth, this
probably won’t matter much in terms of overall reliability, but if we can master this, it will
reduce power (making UPS stretch farther) and improve reliability of indivdiual cruncher boxes
just a bit. If this is too much hassle, have a clunky old hardfile on hot standby (Red Hat Linux
is a workable choice for me). The idea being Red Hat seems to boot almost anything and you can
get back on the air, crunching, while you figure out what else went wrong.
As discussed above, a new factor is that SETI@home has a bandwidth cap. Thus,
P(local queue has at least one WU from Berkeley) may not be one.
There are several plausible ideas that won’t work if the dearth is
permanent or even comes and goes for a substantial fraction of the total time.
One tempting idea is to have more than one local queue.
This adds another intermediary between you and SETI@home. However, if the fundamental problem
is that SETI@home isn’t delivering work units, this will really have no effect, other than to
introduce further unreliability into your system.
Also ineffective would be trying to increase the depth of the queues you have. This
seems sensible until you realize: if you can’t get WU, then trying to get more WU isn’t a promising answer.
We have talked about someone getting more WU than another, but actual experience says SETI@home
is so large that if total WUs are capped, everyone will suffer about equally.
There are more effective choices.
One is to bravely do nothing. Ultimately, if there are more machines than work units, we’re
all going to face outages. Simply facing facts may be almost as effective as anything else. Poof
has observed that if there is an overall cap on bandwidth (and, hence, on available work units),
that someone is going to drop out until the demand balances supply. My guess is that demand will
still slightly exceed supply. Your outage, then, should be to the ratio of the oversupply of
participating machines to the actual WU supply. But, TLCer’s won’t accept such an answer easily!
A better approach is to note that bandwidth dearths may not be a permanent state of affairs. This has been the case at this
writing. This is good news, because it means we don’t have to rethink things much. The right
strategy in this environment is to point at a Team Queue of some kind (regardless of what method
you use to cache) and feed off of that. As discussed before, you may want to aim a larger farm
at more than one such queue, in parallel, to increase your overall reliability (or, at least, average out your
outages better). You could even abandon your own local queue and point directly at the Team Queue.
A Team Queue approach will allow us to concentrate on a common strategy and manage bandwidth better.
For instance, a Team Queue will be set up to find those times of day where just those few extra WU will
be available and snare them. Moreover, if the brownout lasts long enough, queues of all kinds will
begin to dry up. Sharing queues means maximizing the number of active machines. As we approach
a full time dearth, crunching a unit becomes a better response than holding on to a unit.
You or your neighbor’s cruncher should not sit idle while someone else has WU sitting about. In such
constraints, a deep cache represents a production loss, not a gain, for the team. But, with Team Queues,
we can oscilate between deep and shallow caches quite naturally if the dearth situation comes and goes
(as it so far has).
Those running the Team Queues will bear the responsibility to maximize the probability there are work
units in the queue. We will learn where “the breaks in the action are” (that is, slivers of
available bandwidth) and stagger the attempts
our public queues make to contact Berkeley to maximize our chances of getting work units.
The Team Queue owners will also need some idea of what WU cache their constituents require.
This may take some e-mail.
This sketch of ideas is based on actual experience to date. When we first hit this “brownout” problem,
second and third shift US Pacific time was supposed to remain
uncapped. In practice, we merely found our odds of getting work elevated a bit “offshift”.
Near the limit, other problems (e.g. total connection count in the server) seemed to be
reached as well. But, if Berkeley gets
its act together, then public queues are enough. We’ll figure out who has the highest speed links to
Berkeley, run our public queues through them, and so maximize our probability of getting a unit.
A final effective choice, at least if the brownout becomes permanent, is to run more than one Distributed Computing project on the same machine.
This can be done (in many if not most cases) so that the time is divided equally between SETI@home
and the other project, or with SETI@home dominating. Regardless, though this approach cuts down your SETI contribution
some, it ensures your machine contributes to some project at all times. This more drastic solution
will only make sense if and when it becomes clear that the available bandwidth can only deliver
a fixed number of units, dictating a certain amount of downtime for everyone. At this writing,
we haven’t gotten there yet.
While Team Queues adapt well, permanent brownouts would mean that all queues will dry up, at least
sometimes, and we’ll need to figure out how to restart crunchers automatically as WU dribble in.
See the self-caching FAQ for ideas.
Given all the factors (traditional outages, brownouts, ease of operation), here is what the author
would consider an optimum setup for the current circumstances:
Here, one sees all the key factors in play:
- UPS is used to keep local cruncher losses minimal.
- A Team Queue is used, allowing dynamic adaption to all conditions at Berkeley.
- Differing AC power domains are exploited.
- Excessive cascading is avoided so that no added reliability lossers are injected.
This setup does rely on the Team Queue to be a good, highly reliable machine. Its downtime
(or its probability of drying up in brownouts) will determine, almost entirely, the production
losses for this situation. Putting a local queue in between adds its own unreliability, but would
give some “buffer” agains the Team Queue machine’s problems and, perhaps, certain classes of
network failure. However, if the Team Queue is reliable and the internet is normal, a local machine
subtracts reliability, it doesn’t add to it. Simpler still (and likely as effective) would be to run two SETI clients per CPU on the
cruncher machines, hinted at in the text for Machine Room 2 above. This “two per local machine”
probably is enough to cover most Team Queue losses given the likely reliability of a Team Queue.
In practice, adding a local queue over and above the Team Queue shown here
is probably a question of how likely there will be DNS type
problems with the local queue requiring more than a WU or two’s worth of local buffer. We have
seen some problems with DNS resolution here and there that can take a day or two to solve.
But, that should be solvable with absolute IP addresses, which again avoids the
added reliability questions from an added local queue.
The remaining reason to add a local queue to the above is if one has an intermittent
dialup access to the Internet. This especially applies to manual dialup, not autodial
situations. Here, the loss of a percent or two, regrettable as that is in
the abstract, overcomes the practical issue of local “outage” from not having continuous
access to the ‘net.
In the interests of making this a bit more real, the author will describe some of the actual
failures that are readily remembered. This first set is entirely at home.
- Failure of Tyan Tiger Dual motherboard. Approximate down time: Four months. Cause: Early
revision of board mismatched Crucial memory. Losses: Since I purchased a K7S5A to tide me over,
1 GHz Morgan CPU for four months.
- Power Failures. The UPS has done very well, but I lose about eight to twelve hours per year,
at home, due to power failures that exceed the UPS’ time scale. However, without it, my losses would
probably be ten times that, to say nothing of me probably being too nervious to run the machines
while at work.
- Setup errors in running the clients. Occassionally, the various forms of sneakernetting I do
catch up to me. I do not have a good estimate for this, but it is probably several days per year.
This is much less frequent nowadays simply because I tend, one way or another, to have two things available
for each CPU. I most recently lost about 8 SETI hours on one of two dualie CPUs because of a triffling
detail in how “nohup” works in Linux. One of two CPUs functioned instead of two.
- Hardware upgrades. This probably happens a half dozen times at a cost of four hours each,
so that’s about a day per year. Software upgrades do not usually cause me much problem as I
seldom/never upgrade the operating system, unless in conjunction with a hardware upgrade.
- Rebooting Windows. While this happens with some regularity (at least weekly), it is on my
slowest machine. This is caused by some ill-understood “leak” in the Genome@Home client (so I
would not have this running SETI). However, it is in the five minutes per week range (if it
were SETI) and so can be ignored.
At work, the leading cause of downtime is some sort of site shutdown (in whole or, more
often, in part). For instance, I have occassional, rotating
access to many machines in several buildings. When any one of these buildings are
powered down for some repair (e.g. over Labor Day Weekend), I lose production from all such
machines for that time.