This blog post is from a member of Amazon’s S3 team. Though I haven’t yet made use of that service — I believe WordPress can be used to meet all my web publishing needs –I have been a long-time customer of Amazon, and can vouch for their reliability; as I write this I cannot recall any specific problem that surfaced trying to buy a book or record from them.
The post makes a specific guarantee about the reliability of S3, and backs up that commitment by offering to refund money if Amazon doesn’t deliver.
This promise can be found in the Amazon S3 Service Level Agreement, which says in part:
Service Credits are calculated as a percentage of the total charges paid by you for Amazon S3 for the billing cycle in which the error occurred in accordance with the schedule below.
Monthly Uptime Percentage Service Credit Percentage
Equal to or greater than 99.9 % but less than 99.9%: 10%
Less than 99.9 %: 25%
For example, if the uptime for any month is less than 99.9% then Amazon will give you a credit of one quarter of your bill for that month.
This is very nice of the folks at Amazon, and I wouldn’t be surprised if many people wondered how Amazon could risk so much income by promising to offer such high rates of reliability. For example, a day has just under 100,000 seconds, so to meet the 99.9% commitment they would need to keep their service down time under .1% of a day. One percent of a day is about nine hundred seconds, and one-tenth of that is about 90 seconds. So put in simple terms, Amazon is promising to be down no more than a minute or so each day.
Amazon offers this promise without limiting it to either hardware or software. After all, down is down, whether caused by a hardware failure or a software failure.
Let’s start with the software. I don’t know what operating system Amazon uses, but I wouldn’t be surprised to learn they are using Linux, because
In the world of Linux, expecting that a software error will, on average, bring your machine down for about a minute a day is totally unacceptable.
For example, I’ve been using Linux for almost ten years now, and while I have encountered many problems, especially when debugging my own code, I have never had a machine crash due to a kernel failure, other than incidents when it turned out I was attempting to use a kernel that wasn’t built to support a particular hardware configuration, as I reported in some of my posts a couple of months ago about building and configuring hardware to run Ubuntu.
However, when Linux is properly configured, it is amazingly reliable, or should I say amazon-ingly reliable. I would guess that there are many Linux applications running in large enterprises that go at down at most a few seconds a year.
So much for software reliability. How about hardware reliability?
A colleague who knew one of the founders of Yahoo told me some years back how Yahoo then configured its hardware. They would buy stock, commodity-priced, x86-based boxes, install the software on them, and then run them in a test environment 24×7 for two weeks. Those that didn’t fail were then put into production use, and were removed from service after six months, as Yahoo had found that six months was the “sweet spot” for hardware reliability.
He also said that Yahoo used one of the BSD variants of Unix. They had their own customized kernel, which they created by stripping out every line of code that wasn’t needed to meet their business needs. (This probably explains why they didn’t use Linux, as it would have been very difficult to keep up with the rapid pace of Linux development.)
I’ve read that Google used a similar approach in its early days, hand-building its own boxes with commodity parts. Given that they are known to have millions of processors in the world-wide collective that is the Google “search” box, I expect they use the same approach today, though I don’t think they limit the term of use but rely on fault-tolerant computing instead, using software to detect when a box has gone bad and then passing the work on to a box known to be in good shape.
There is another approach you can use to assemble hardware:
If you need hardware of extraordinary reliability then visit IBM.COM, or just look up the number of the nearest IBM office and give the folks there a call.
IBM has spent over four decades learning how to build and deploy that kind of hardware. IBM’s hardware is used to power most of the largest and most demanding applications running in corporations throughout the world today, as it has been for the decades when System/360 was announced in the 1960′s.
IBM, like Amazon, is a company. Though I don’t know the details, I am confident IBM backs up its commitment to providing reliable hardware –and the underlying operating system which was written entirely by IBM — by offering similar credit agreements.
In the world of the largest enterprises, acceptable downtime is measured in seconds, or fractions of seconds, per year.
I don’t know what hardware uses Amazon, but if they are not yet using IBM’s,I suggest a good way to improve the level of their service would be to pick up a phone and call the nearest IBM location, to inquire about running Linux on IBM mainframes, as this is as good as it gets when it comes to providing a rock-solid computing environment.
Linux, unlike IBM and Amazon, is not a corporation, but a vast collective of programmers, testers, writers, and users. Some members of the Linux community are paid by corporations to work on Linux, as is the case for scores of my colleagues in IBM’s Linux Technology Center, the LTC. Among them are some of the best programmers known to me –folks like Ted T’so, Gerritt Huizenga, Sean Dague, Rusty Russell, and Andrew Tridgell. By the way, I just looked up “tridge” in IBM’s Blue Pages internal directory to make sure I got the spelling of his name right, and noted he works as a member of the LTC ALRT team. ALRT stands for Advanced Linux Response Team, and consists of programmers who on are constant, round-the-clock, alert status, ready to respond to a Linux problem encountered by one of IBM’s customers. It is also a round-the-world team; for example, Tridge and Rusty are based on Australia.
I know that many ALRT team members have gotten a phone call in the middle of the night, telling them the name of the customer and the location, so they can pack their bags and get to the nearest airport, as quickly as did the ships of the Royal Navy set to sea on their missions.
Indeed, I would expect getting such a call is a rite of passage to people who serve on the ALRT team, and it would be an interesting exercise to collect a list of some of the situations that have been brought to their attention, and how things turned out.
IBM offers this kind of support not just for Linux but to all its customers. Here are some examples that I can vouch for personally.
As I have written earlier, I had the good fortune be present for a meeting of the IBM Academy of Technology that was held in Toronto; it was held in early October of 1998.
I believe that IBM’s major involvement in open-source, going beyond the collaboration with the Apache folks that had begun a few months earli, can be directly traced to that meeting. Among the attendees was Larry Loucks. He was a co-author of a paper encouraging IBM to investigate open-source, and during the meeting he gave an example of the power of open access to source code. While working as an engineer at a customer site in North Dakota in the late 60′s he had found a problem and was able to fix it himself after he looked at the source code for System/360 and found the cause of the problem.
When I left IBM Research to work for IBM’s Software Group a few years back, one of my new colleagues relayed an incident that had happened to him late in December a few years earlier. He had been woken up in the middle of the night to be told that one of IBM’s largest customers was having serious problems, problems so serious that the customer was threatening to toss IBM out the door unless those problems were fixed. He was told to get on a plane, fly to the customer’s site, learn the problems, and then do whatever it took, spending as much of IBM’s resources as needed, to keep that company as an IBM customer. It took him about six months, much of it flying to and fro. But IBM did keep the account, and hundreds of thousands of people use that company’s services every day.
One of IBM’s lesser-known teams is the Crisis Management Team (CRT). These folks go to sleep every night expecting to be awakened, as they are called soon after there is a major disaster in ANY part of the world. For IBM has customers in every country in the world, and so any disaster affects some of IBM’s customers. And since IBM is thus on the scene of every disaster as part of its commitment to its customers, it also pitches in however it can to help everyone else struck by a disaster. For example, I know a member of the team who was at Ground Zero of 9/11 within hours after that murderous attack; he worked on it or close to it for months.
I’ve been working with Rob Eggers, a colleague on the LTC, for well over a year providing guidance to IBM’s corporate philanthropic team on open-source issues, including the encouraging of IBMer’s with open-source skills to volunteer those skills to assist non-profit and educational institutions. (This blog, and related efforts such as The Chay Project and Fallen Soldiers, are my own modest efforts as an IBM volunteer, doing this work on my own time and my own dime.)
As I mentioned, IBM is always to be found at the scene of a major disaster, and IBM helped to put together the first version of Sahana, an open-source crisis/disaster management system, in the days just after the devastating Asian Tsunami of December, 2004.
I focus on education. For example, I am writing this while on a trip to Indianapolis to attend a conference on open-source and education, a trip I will pay for out of my own pocket.
Rob has been working for over a year as the leader of a group of skilled IBM programmers who have volunteered to help make Sahana better.
Rob was on vacation this past August, in the days just after Peru was struck by a devastating earthquake. While on vacation in Minneapolis visiting his family, he got a phone call directing him to get to Peru as soon as possible so he could help the IBM team on the ground. He learned that one of IBM’s most senior executives had directed that a team of IBMers from the U.S. be sent to Peru to help the IBM employees who were already engaged helping their countrymen deal with the aftermath of the earthquake.
He was in Peru within a day or so, after stopping by his home in Austin to pick up his passport and taking a multi-leg flight that consumed many hours. He was accompanied by a consultant who works with the CRT team, a man with years of experience dealing on the scene in the aftermath of disaster.
Rob spent about a week in Peru. When I spoke with him later he mentioned it had been one of the most rewarding experiences of his life, both personal and professional. He also mentioned that he had met the Prime Minister of Peru, as one of the IBMers in Peru is a personal friend of the Chief of Staff of the PM. 
I have had similar experiences while working as an IBM volunteer. In some cases my work has helped IBM’s business, and one of the services I offer as a volunteer is to serve as an intermediary to folks who wish to solicit IBM’s help in a particular effort, or possibly partner with IBM in a joint effort involving open-source, education, crisis management, and so forth.
I have mentioned some of the IBMers who are paid to support Linux and “make Linux better,” a phrase that has served as the motto of the LTC for many years. But they are but a small fraction of the larger community that labors worldwide every day to make Linux better.
They stand by the work. I use that phrase due to an incident many years ago when one of the crowns in my mouth went bad. While sitting in the chair of my then dentist, Ronald Maitland, I asked how much it would cost for the repair. He said, “Nothing. I stand by my work.”
I was so impressed that I took that phrase as my own motto, a reminder that I should stand by my work, too.
That phrase sums up the Linux Service Agreement:
We stand by our work.
If you have a problem with Linux just tell us about it and we will make every effort to fix it, at no cost to you.
Not only are they standing by, I stand in awe of what they have achieved.
So should you.
I have told you about several incidents where people have been woken in the middle of the night to be informed of a problem.
I am sure that Linus Torvalds, the CEO of Linux, and Sam Palmisano, IBM’s current CEO, have also received such calls.
But I doubt they get many such calls these days. They can go to sleep with such confidence in their product that they know they will sleep like a baby, for after decades of work they are no longer babes in the woods.
1. While stuck in O’Hare airport yesterday for several hours getting a first-hand education on the American Airlines Service Agreement, I noticed the man in front of me in a line had a Peruvian passport. It turns out he had come to Chicago to run in the marathon the previous day, an event that turned out to be a disaster due to the high heat and humidity. I mentioned that I was aware of the recent, devasting Peruvian earthquake, and that IBM had helped his countrymen cope in the aftermath,