technically a blog

home | blog / tags | newsletter | projects | coaching / consulting

A systems design perspective on why chess.com's servers have been melting

Monday, February 13, 2023

January 2023 was a rough month if you wanted to play chess on the most popular chess website, chess.com^[1]. Their service has been experiencing an unprecedented amount downtime because of a huge influx of users^[2]. There have been days where it's all but unusable. It's frustrating as a user! It's also surely frustrating for the business behind the site.

Chess has reached an all-time peak in popularity. In January 2023, Google search traffic exceeded the boom from the release of The Queen's Gambit. There's a huge influx of new or returning players, and they flock to the site with the obvious domain. Chess.com's app has hit #1 most downloaded free game on the iOS app store.

Part of doing good systems design is planning for capacity. A general rule of thumb is that you should design a system for up to a certain amount of growth. Beyond some point, architectural requirements will be dramatically different. Planning for capacity does not mean planning for infinite capacity, but what may realistically happen.

Why not plan for universal adoption from the very beginning? Why not create something which can scale infinitely? Because it's usually too expensive. Making something that's infinitely scalable means that you need to have (effectively) infinite capacity, and servers have to be paid for somehow.

Some things can easily and cheaply be scaled up to the max. Static sites are pretty easy on that front. You can put a CDN like Cloudflare or Fastly in front of them and you get a lot of scale for very little money, and they can absorb big spikes in traffic. But it's not free, and it's not as cheap as it seems.

This blog is hosted on a small VPS without a CDN. So far, the traffic hasn't required a CDN to serve: the little VPS chugs along just fine. I could put a CDN in front of it, and it would be free or cheap to get gigantic capacity. I've held off on doing it, because there's cost from complexity. By adding in another component like a CDN, I would have to worry about caching and propagation time. I would have to worry about deployment and configuration.

There's value in simplicity. Scaling usually adds complexity.

Adding complexity early on leaves a lot on the table. Instead of adding features that users could benefit from, you have this intangible benefit: the ability to ✨scale✨. In an ideal world, users never even notice the work you put in to scaling, because things work as they expect. Users really only notice when scaling isn't happening.

So if the current growth wasn't planned for already, why can't they just scale up now? We can't say for sure, because we don't know the details of their systems. But we can gather some information:

According to a Google Cloud showcase, chess.com uses GCP. So they use some cloud services.
They also use a lot of on-prem hardware, according to their SRE job description.
They use MySQL as a primary database, based on their job descriptions.
They use a NoSQL store as another database, also based on their job descriptions.

They have a blog post out about why their servers are struggling, and they explicitly mention that they have hardware shipments arriving soon with "the most powerful possible live chess and database servers", so presumably a lot of what powers their live play is still their on-prem hardware.

But they also say that they have other bottlenecks. This is the whack-a-mole aspect of scaling systems. You measure the system and find one bottleneck, and you generally cannot find the next bottleneck until that one is resolved.

They've identified a number of bottlenecks already, and one of their actions in particular gives some reasonable information about what the database looked like before: they're working on separating out the database with users and gameplay. From their description, it sounds like all of the data for chess.com was in one big MySQL database. With beefy hardware, this can last a long time, but eventually it hits a breaking point. We found the breaking point!

Why wasn't it addressed earlier? Two primary reasons, I think:

It was likely known that it will be an issue later, but again, scale is expensive. Choosing to break apart the database now would be a very expensive project, delaying major new features, when that scale doesn't seem likely! And on top of that, they're in the midst of integrating in systems from acquiring the Play Magnus Group, so they're not exactly short of work to do.
Load testing is hard, so capacity planning is hard. It's tough to create a load that's a good facsimile of real production data, so it's likely that the test will not give an exact understanding of the load you can handle. (That's why you aim for load test results that are better than you need by a wide margin.) So it's possible they didn't know exactly when they would hit the breaking point, and what would break when they did.

All the things that they're doing to respond to this influx of users are labor intensive and expensive, in terms of time now, real money, and perhaps most importantly in terms of future maintenance costs. It's going to be harder to maintain chess.com now that their database is sharded and tables are split out across separate databases. It's very easy to spin up a local stack for development when you have fewer things to spin up!

All that to get to the point: From a systems design perspective, a system is well-designed if it meets the requirements, but doesn't dramatically exceed them. One part is about doing what it's supposed to; the other part is about doing so efficiently. If they'd been able to handle this massive boom in users, well beyond what any reasonable model would have projected, then they would have produced a design that was in all likelihood very wasteful.

Major hugs to all the folks at chess.com who are dealing with these outages. I know you're doing your best. Hang in there.

When people mention chess.com's server issues, there's often a chorus of "Well Lichess is better!" and "Lichess is handling it!". That's not what this post is about. I enjoy and use both sites, and I want both to continue successfully. ↩
There is a lot of speculation on why this boom has happened. Anyone's guess is as good as mine. There are a lot of things at play, such as a chess bot that went viral and the positive feedback loop of being the top downloaded game in the app store. ↩

Please share this post, and subscribe to the newsletter or RSS feed. You can email my personal email with any comments or questions.

If you're looking to grow more effective as a software engineer, please consider my coaching services.

Want to become a better programmer? Join the Recurse Center!
Want to hire great programmers? Hire via Recurse Center!