0.99999%

Chuck Yerkes chuck+baylisa at snew.com
Fri Feb 22 17:03:04 PST 2002


Quoting Bryan McDonald (bigmac at tellme.com):
> 5 9's always amuses me. People always ask for it, and most don't understand
> what they are asking for.  And your right, trying for it is interesting.

Allow me to riff off of this:
I used to work as a system admin at a place that developed an
HA (high availability/failover management) program.  Development
was motivated and, in a large part, funded by some Wall St
companies down the street a bit.  I sort of kibitzed on interface
(when developers choose ways that an SA should work with a
product, an SA should be involved in restraining them).

The software worked but, frankly, meant that you were no long
on Unix (SunOS or Solaris at the time), but on "Unix + HA" and
non-HA experienced SA's couldn't touch it.

Most of the tech support came from people who'd blithely edit
/etc/inetd.conf or services on one machine and, when failover
happened, have a problem.

The first part of the install was usually "teaching HA" - all
systems had do have some redundancy - dual paths to disks so
that a cable or SCSI card failure was minor.  The LAST thing
we wanted to do was an actual machine -> machine failover.

The goal was (a marketing) "4 9's of uptime" (barring planned
downtime).  Since the target was trading systems, they were
generally in heavy use for 10 hours/day, during which failure
could cost 10's of thousands per minute (and nobody ever was
stopped from making a bad deal during a failure - they always
"just lost $50,000").


Then we had people who grumped that having a spare 4-way machine
doing NOTHING BUT WAITING for the other to die was a waist of
money.

It came down to this:  In 1995 terms, you spent $100k on hardware
and it could be maintained only by your SENIOR system admins
($150k/year) and you would have failure/reboots that took less than
1-2 minutes to recover from.

When management demands 5 9s (< 1 second of downtime/day or 5
minutes/YEAR), that will cost a lot of money.   0.9999% sounds
lovely.  Tandem made a lot of money on that.  Most of the time,
it's entirely moot.  You can have a system be down for 20 minute.
You can also work around bumps for a short time.  You can also
find high resilience in a lot of customer side systems.

Local Directors (and equiv.) fronting web servers that talk to
an SQL database with apps that can fail gracefully when the
SQL server goes down.  EBay and Amazon REQUIRE HA for that,
it's their business.  Many businesses will have annoyances
if their web server apps goes down, but a 30 minute outage
will not affect much - getting "online banking is temporarily
unavailable, please check back soon" won't drive Wells Fargo
out of business on the rare occasion.

5 9's usually costs several hundred thousand dollars and is
not often actually needed or wanted.  If anyone thinks they
can have 4+ 9s and do it on the cheap, then they are lying to
themselves and awaiting budget overruns.


So management:
Let's assume that "5 9's" is non-mathematical and that they
mean "spend a reasonable amount to keep our stuff working."

That might mean a couple web servers containing the external
static data.  A backup MX box to catch mail if (when?) your
T1 goes down.  A spare server box to use when one of your
servers goes down.  Hell, standardizing on a couple server
types so that you can HAVE spares.  Spending money on monitoring
and alerting can be slid into this budget item.


Educating management.  Presenting to management in acceptable
ways.  Not "No we can't do that" but "Sure we can do that, let
me put together a budget for that project and you take a look
at if we want to do it."

"Strategic management" needs to plan money and other resources.
We need to help them do that.  Middle managements' stereotypical
job is to stand between our reality and the reality that the
board needs to hear (the PHB factor).  Learning how to communicate
with and have the trust of the top folks is one of the most
valuable skills a Sr. SA can have.

chuck



More information about the Baylisa mailing list