technically a blog

home | blog / tags | newsletter | projects | books

If it never breaks, you're doing it wrong

Monday, June 24, 2024

When the power goes out, most people are understanding. Yet the most livid I've seen people is when web apps or computers they use have a bug or go down. But most of the time, it's a really bad sign if this never happens¹.

I was talking to my dad about this recently. For most of his career, he was a corporate accountant for a public utility company. Our professional interests overlap in risk, systems, internal controls, and business processes. These all play into software engineering, but risk in particular is why we should expect our computer systems to fail us.

The power goes out sometimes

As a motivating example, let's talk about the power company. When's the last time you had a power outage? If you're in the US, it's probably not that long ago. My family had our last outage for about an hour last year, and my parents had their power go out for half a day a few weeks ago.

Both of these outages were from things that were preventable.

My family's power outage was because a tree came down on an above ground power line. This could have been prevented by burying the cables. This would take quite a bit of digging, and it's common in a lot of new developments, but where we are everything is above ground for legacy reasons. Or maybe we could have removed more of the trees around the power lines! But that's probably not a great idea, because trees are important for a lot of reasons, including preventing erosion and mitigating floods.

My parents' power outage was from an animal climbing into some equipment (this makes me very sad, poor thing). This could have been prevented by protecting and sealing the equipment. Perhaps there was protection and it was broken, and an inspection could have found it. Or perhaps the equipment needed other forms of protection and sealing.

There are also power failures for reasons that are a failure to recognize and acknowledge risk, or a change to the risk levels. In particular, I think about the failures of Texas's power grid recently. These failures involved an overloading of the grid in a way that was predicted, and resulted in catastrophic failures. The risk that this would happen changed as our climate has changed, and utilities infrastructure is difficult to quickly update to reflect this change in reality².

The thing is, all of these interventions are known. We can do all of these things, and they're discussed. Each of them comes with a cost. There are two aspects of this cost: there are the literal dollars we pay to make these interventions, and there is the opportunity cost of what we don't do instead. In a world of limited resources, we must consider both.

When you're deciding which changes to make, you have to weigh the cost of interventions against the cost of doing nothing. Your cost of not doing anything is roughly the probability of an event happening times the expected cost of such an event. You can calculate that, and you should! Whereas your cost of doing an intervention is the cost of the intervention plus any lost gains from the things you opt not to do instead (this can be lost revenue or it can be from other failures you get from doing this intervention over other ones).

What does your downtime cost you?

This all comes back to software. Let's look at an example, using fake numbers for ease of calculation.

Let's say you have a web app that powers an online store. People spend $1 in your shop each minute, and you know you have a bug that gives you a 10% chance of going down for an hour once a month. Should you fix it?

We want to say yes by default, because geez, one hour of downtime a month is a lot! But this is a decision we can put numbers behind. Off the bat, we want to say that the cost of an outage would be 0.1 * 60 * 1, or $6 a month. If your software developers cost you $3/hour, and can fix this in 10 hours, then you'd expect to make a profit on fixing this in five months.

But this also ignores some real-world aspects of the issue: How will downtime or uptime affect your reputation, and will people still be willing to buy from you? If you're down, do you lose the money or do people return later and spend it (are you an essential purchase)? Are purchases uniformly distributed across time as we used here for simplicity, or are there peak times when you lose more from being down? Is your probability of going down uniform or is it correlated to traffic levels (and thus probably to revenue lost)?

Quantifying the loss from going down is hard, but it's doable. You have to make your assumptions clear and well known.

What do you give up instead?

The other lens to look at this through is what you give up to ensure no downtime. Downtime is expensive, and so is increasing amounts of uptime.

Going from 9% to 99% uptime is pretty cheap. Going from 99% to 99.9% uptime gets a little trickier. And going from 99.9% uptime to 99.99% uptime is very expensive. Pushing further than that gets prohibitively expensive, not least because you will be seeking to be more reliable than the very components you depend on³! That shift to be more reliable than the components you use means a significant shift in thinking and how you design things, and it comes with a cost.

When you work to increase uptime, it's at the expense of something else. Maybe you have to cut a hot new feature out of the roadmap in order to get a little more stability. There goes a big contract from a customer that wanted that feature. Or maybe you have to reduce your time spent on resolving tech debt. There goes your dev velocity, right out the window.

This can even be a perverse loop. Pushing toward more stability can increase complexity in your system while robbing you of the time to resolve tech debt, and both complexity and tech debt increase the rate of bugs in your system. And this leads to more instability and more downtime!

There are some team configurations and companies who can setup engineering systems in a way where they're able to really push uptime to incredible levels. What the major cloud providers and CDNs do is incredible. On the other hand, small teams have some inherent limits to what they're able to achieve here. With a handful of engineers you're not going to be able to setup the in-house data centers and power supplies that are necessary to even have a possibility of pushing past a certain point of uptime. Each team has a limit to what they can do, and it gets exceedingly expensive the closer you push to that limit.

Why do people get upset?

An interesting question is why people get upset when software fails, especially when we're not similarly upset by other failures. I'm not entirely sure, since I'm generally understanding when systems fail (this has always been my nature, but it's been refined through my job and experience). But I have a few hypotheses.

It's hard to be patient when you have money on the line. If you have money on the line from a failure (commission for people selling the software, revenue for people using it in their business, etc.) then this is going to viscerally hurt, and it takes deliberate effort to see past that pain.
We don't see the fallible parts of software. We see power lines every day, and we can directly understand the failures: a tree fell on a line, it's out, makes sense. But with software, we mostly see a thin veneer over the top of the system, and none of its inner workings. This makes it a lot harder to understand why it might fail without being a trained professional.
Each failure seems unique. When the power goes out, we experience it the same way each time, so we get used to it. But when a piece of software fails, it may fail in different ways each time, and we don't have a general "all software fails at once" moment but rather many individual softwares failing independently. This makes us never really get used to running into these issues, and they're a surprise each time.
We know who to be mad at. When the power goes out, we don't really know who we can be upset at. We shouldn't be upset at the line workers, because they're not deciding what to maintain; who, then? Whereas with software, we know who to be mad at: the software engineers of course! (Let's just ignore the fact that software engineers are not often making the business decision of what to focus development efforts on.)
We don't actually get more mad, I just see it more because I'm in software. This one is interesting: we might not actually be more mad when power goes out, I might just be more aware of it. I'm not sure how to check this, but I'd be curious to hear from people in other fields about when things fail and how understanding folks are.

I'm sure there are more reasons! At any rate, it's a tricky problem. We can start to shift it by talking openly about the risk we take and the costs involved. Trade-offs are so fundamental to the engineering process.

Thank you to Erika Rowland for reviewing a draft of this post and providing very helpful feedback!

Exceptions apply in areas that are safety critical, where a failure can result in very real loss of life. Even in these situations, though, it's not crystal clear: Would you rather a hospital invest in shifting from 99.99% power uptime to 99.999%, or spend that same budget on interventions that apply more often? The former saves many lives in the case of an unlikely disaster, while the latter saves fewer lives but does so more certainly in more common situations. We always have limited resources available, and how we spend them reflects trade-offs.

This is not an excuse, though. We saw this coming. Our climate has been changing for quite a while, and people have been predicting changes in load on the grid. But plenty of people want to deny this reality, shift the blame onto other people, or hope for a miraculous solution. Or they simply like to watch the world burn, literally. Either way, now that we're where we are, it's going to be a slow process to fix it.

My friend Erika pointed me to this great short, approachable resource on how complex systems fail. She also has a great note going through four different ways that people use the word "resilience", which is very helpful.

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.

Want to become a better programmer? Join the Recurse Center!
Want to hire great programmers? Hire via Recurse Center!