A systems design perspective on why chess.com's servers have been melting

Monday, February 13, 2023

January 2023 was a rough month if you wanted to play chess on the most popular chess website, chess.com1. Their service has been experiencing an unprecedented amount downtime because of a huge influx of users2. There have been days where it's all but unusable. It's frustrating as a user! It's also surely frustrating for the business behind the site.

Chess has reached an all-time peak in popularity. In January 2023, Google search traffic exceeded the boom from the release of The Queen's Gambit. There's a huge influx of new or returning players, and they flock to the site with the obvious domain. Chess.com's app has hit #1 most downloaded free game on the iOS app store.

Part of doing good systems design is planning for capacity. A general rule of thumb is that you should design a system for up to a certain amount of growth. Beyond some point, architectural requirements will be dramatically different. Planning for capacity does not mean planning for infinite capacity, but what may realistically happen.

Why not plan for universal adoption from the very beginning? Why not create something which can scale infinitely? Because it's usually too expensive. Making something that's infinitely scalable means that you need to have (effectively) infinite capacity, and servers have to be paid for somehow.

Some things can easily and cheaply be scaled up to the max. Static sites are pretty easy on that front. You can put a CDN like Cloudflare or Fastly in front of them and you get a lot of scale for very little money, and they can absorb big spikes in traffic. But it's not free, and it's not as cheap as it seems.

This blog is hosted on a small VPS without a CDN. So far, the traffic hasn't required a CDN to serve: the little VPS chugs along just fine. I could put a CDN in front of it, and it would be free or cheap to get gigantic capacity. I've held off on doing it, because there's cost from complexity. By adding in another component like a CDN, I would have to worry about caching and propagation time. I would have to worry about deployment and configuration.

There's value in simplicity. Scaling usually adds complexity.

Adding complexity early on leaves a lot on the table. Instead of adding features that users could benefit from, you have this intangible benefit: the ability to ✨scale✨. In an ideal world, users never even notice the work you put in to scaling, because things work as they expect. Users really only notice when scaling isn't happening.

So if the current growth wasn't planned for already, why can't they just scale up now? We can't say for sure, because we don't know the details of their systems. But we can gather some information:

They have a blog post out about why their servers are struggling, and they explicitly mention that they have hardware shipments arriving soon with "the most powerful possible live chess and database servers", so presumably a lot of what powers their live play is still their on-prem hardware.

But they also say that they have other bottlenecks. This is the whack-a-mole aspect of scaling systems. You measure the system and find one bottleneck, and you generally cannot find the next bottleneck until that one is resolved.

They've identified a number of bottlenecks already, and one of their actions in particular gives some reasonable information about what the database looked like before: they're working on separating out the database with users and gameplay. From their description, it sounds like all of the data for chess.com was in one big MySQL database. With beefy hardware, this can last a long time, but eventually it hits a breaking point. We found the breaking point!

Why wasn't it addressed earlier? Two primary reasons, I think:

All the things that they're doing to respond to this influx of users are labor intensive and expensive, in terms of time now, real money, and perhaps most importantly in terms of future maintenance costs. It's going to be harder to maintain chess.com now that their database is sharded and tables are split out across separate databases. It's very easy to spin up a local stack for development when you have fewer things to spin up!

All that to get to the point: From a systems design perspective, a system is well-designed if it meets the requirements, but doesn't dramatically exceed them. One part is about doing what it's supposed to; the other part is about doing so efficiently. If they'd been able to handle this massive boom in users, well beyond what any reasonable model would have projected, then they would have produced a design that was in all likelihood very wasteful.

Major hugs to all the folks at chess.com who are dealing with these outages. I know you're doing your best. Hang in there.


1

When people mention chess.com's server issues, there's often a chorus of "Well Lichess is better!" and "Lichess is handling it!". That's not what this post is about. I enjoy and use both sites, and I want both to continue successfully.

2

There is a lot of speculation on why this boom has happened. Anyone's guess is as good as mine. There are a lot of things at play, such as a chess bot that went viral and the positive feedback loop of being the top downloaded game in the app store.


If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe to the newsletter or use the RSS feed.

Want to become a better programmer? Join the Recurse Center!