Understanding 9s
by Max in February 2023
Everyone makes assumptions
Any time you have encounter a client-server relationship in your life, there are usually a set of expectations and norms that guide it. Those expectations are usually informed by the context of the relationship.
One of the most common (and literal) client-server relationships you encounter is when eating at a restaurant. By occupying space in the restaurant you typically expect a server to do some stuff; talk about the specials, take your drink order, maybe bring water. You also typically expect it to happen in some time frame that is convenient for you. For sit-down restaurants that time frame may be relaxed, for a drive-through the expectation is both faster and more strict.
And all of this dance happens without either of you explicitly establishing the rules for the relationship.
SLAs: Assumptions in software
In software-engineering-land, we write these expectations explicitly as Service-Level Agreements, aka SLAs.
Most often you’ll establish SLAs around the availability, reliability, and latency of your software. Availability is whether the software is up and running. In the restaurant analogy, this is how often the restaurant is actually open during hours of operation (something that has become less predictable in the pandemic). Reliability is the liklihood that a software operation will do what the user intends. Like how often the restaurant gets the customer’s order correct. Latency is how long an operation will take in the software. You typically measure both how long any individual operation will take and the length of time for end-to-end flow. For restaurants, you could measure the latency of the ordering process and also the latency from customer sitting down to customer paying their bill.
A lot of software is always-on and being interacted with thousands or millions of times a day, so SLAs tend to be expressed in terms of “9s”. For example, Amazon Web Services (AWS), which powers much of the publicly-accesible internet, has a monthly uptime SLA of 99.99% (also known as 4 “9s”). One ten-thousandth of the month (or about 4 minutes) AWS servers can be down and that would be okay according to their SLA. Any more downtime and customers can complain about a breach of contract and seek restitution as outlined on their page.
The tightest SLA I’ve interacted with on a piece of RAM which had a reliability SLA of 99.9999% (or 6 “9s”).
But is it broken?
From a user perspective it can be hard to tell whether your bad experience with software is a sign of a problem or totally expected and within SLA. So I like to keep in mind how these “9s” map to the probability of my bad experience. Here’s a little table to help you understand too!
Probability of 1 failure a month
Availability guarantee | Daily use | Hourly use | Minutely use | Secondly use |
---|---|---|---|---|
99% | 26% | 99.9% | 100% | 100% |
99.9% | 2.9% | 51.3% | 100% | 100% |
99.99% | 0.3% | 6.9% | 98% | 100% |
99.999% | 0.03% | 0.7% | 35% | 100% |
99.9999% | 0.003% | 0.07% | 4.2% | 92.5% |
Anyway, none of this math is ground-breaking but hopefully it helps give you an idea as to whether what you’re experiencing with your software is something unusually bad or expectedly bad.
How do we get around this?
A lot of software engineering is done to work around these probabalistic constraints. Typically the best thing to do is to add redundancy. For example, clients should retry failures, and hosts should run multiple servers and ensure enough are up to run the service.