Load testing is hard, and the tools are... not great. But why?
Monday, January 4, 2021
If you're building an application that needs to scale—and we all tell ourselves that we are—then at some point you have to figure out if it does or not. This is where load testing comes in: if you want to see whether or not your application can handle scale, just generate scale and see if it can handle it! It sounds straightforward enough.
Then you try to actually generate load. This is straightforward if your application is dead simple, because you can use something like Apache JMeter to generate repeated requests. If you can do this, I envy you: every system I've worked on is more complicated and requires a more intricate testing plan.
Your application gets slightly more complicated, so you then turn to tools like Gatling. These let you simulate virtual users going through scenarios, which is a lot more helpful than just besieging one or a handful of URLs. Even this isn't sufficient if you're writing an application that uses both WebSockets and HTTP calls, over a long-lived session, and requires certain actions repeated on a timer. Unless I severely missed something in the documentation, I cannot see a way to, say, setup a heartbeat that runs ever 30 seconds, do certain actions upon response to a WebSocket message, and also do some other HTTP actions, all with the same HTTP session. I haven't found a way to do that in any load testing tool (which is why I wrote my own at work, which I hope to open source if I can make the time to clean it up and separate out proprietary bits).
But let's suppose you do have a tool that works, out of the box, like Gatling or Locust, and it fits your needs. Great! Now let's write that test. In my experience, this is the hardest bit yet, because you have to first figure out what realistic load looks like — welcome to a day or three of dredging through logs and taking notes while you peer at the network tools in your browser as you click around in your web application. And then after you know what realistic load looks like, you get to write what boils down to a subset of your application to pretend to be a user, hit the API, and do the things your user would do.
And we're not done yet! This is fine, we have our load test written and it's realistic. But this is a moving target, because updates keep going out. So now you have the maintenance problem, too: as your application changes, how do you keep your load test up to date? There isn't great tooling to do this, there is little out there to help you. You have to make this part of your process and hope you don't miss things. This is not a satisfying answer, and that's why this is also one of the hardest parts of load testing an application.
We'll just skip the whole "running it" part, because honestly, if you've gotten this far through a load test, then running it shouldn't be the hardest part.
Where the complexity lies
So basically, here's where we are:
- Most load testing tools support simplistic workloads, and even the complex ones don't let you do everything that's realistically needed to simulate real usage of a web application.
- Writing the test with a simulation of real usage is the hardest part, even if the tools do support what you need.
- Maintaining the test is the second hardest part, and the tooling here does not help you in the slightest.
Let's look at these in detail and see how much complexity we can pare away.
Simulating users. Do we have to?
I'm a "yes" here, although it might depend on your application. And for these purposes, we're talking about the user of a service; if you have a monolith, this is your users as a whole, but if you have microservices the "user" might be another one of your services! For the applications I've worked on, I have had minor success with targeted tests of specific endpoints. But these end up requiring such complicated setup that you aren't better off than you were with the load test itself! And while it may yield some results and improvements, it doesn't get to everything (you may have endpoints that interact) and you don't get a realistic workload.
"When do you not need to simulate users?" is probably a better question. Seems to me like this is when you know that your endpoints are all independent in performance, you don't have any stateful requests, and the ordering of requests does not impact performance. These are big things to assume and it's hard to have confidence in them without testing their independence, at which point, we're back to writing that whole dang test.
The best you can do here is probably at the API and system design time, not at your test time. If you design a simpler API, you're going to have far less surface area to test. If you design a system with more certainly independent pieces (distinct databases per service, for example) then it's easier to test them in isolation than in a monolith. Doing this also lets you use a tool that is simpler, so you get two wins!
Writing the tests is hard. So is maintenance.
Creating a load test is hard because you have to do a few things: you have to understand what the flow through usage of your API is, and you have to write a simulation of that usage. Understanding that flow means understanding other systems than the one under test and since your system is presumably not the focus of their documentation, there is not going to be a super clear diagram of when and how it's called; this often looks like sifting logs until you figure out what the representative usage is. And then writing that simulation is certainly not trivial, because you need to manage the state for a large number of actors representing users of your API!
Oh, and you get to write integration tests for this now, too.
There's some research out there on how to make some of these tasks easier. You can figure out what you need for the initial test, and detect regressions (missing new workloads) from automated analysis of the logs, for example. But as far as I can tell, there is no software on GitHub, let alone a product I can buy, that's going to do that for me. So it doesn't seem like it has much of any traction in industry. It would be a big project to implement it on your own, which might be why it has languished (or is done at big companies, and is not spoken of).
Maybe don't load test everything?
There's a lot of complexity in load tests, and there is not a lot of tooling to help you with it. So maybe the answer is: write fewer of these types of tests, and don't expect them to give you all the answers to how your system performs.
You have a few options for getting a great picture of how your system performs:
- Good old analysis. Sit down with a notebook, a pen, an understanding of your systems as a whole, and an afternoon to spare, and you can figure out with some napkin math what the general parameters and bounds of scaling on your system are. When you find the bottleneck, or you have some unknowns (how many transactions per second can our database support? how many do we generate?) then you can go test those specifically!
- Feature rollouts. If you can roll out features slowly across your users, then you don't necessarily have to do any load testing at all! You can measure performance experimentally and see if it's good enough. Good? Roll forward. Bad? Roll back.
- Traffic replay. This doesn't help at all with new features (see feature rollouts ten words ago for that) but it does help with understanding your system breaking points for existing features without as much development. You can take the traffic you saw before and replay it (multiple times over, even, by combining multiple different periods' traffic) and see how the system performs! (Side note: I would love tooling to help with this, and with amplifying traffic when doing this, so if anyone has a recommendation... hit me up.)
If you have some silver bullet I've missed, or a fantastic research paper in this area you'd recommend reading, or a story of terrible times with scaling that you want to share with me, please email them to email@example.com.