Profiling Rust programs the easy way
Monday, December 4, 2023
Performance is one of the big reasons to use Rust. It's not a magic wand for performance, it just gives you the control to eke out whatever performance you need. So if your program is still slow, how do you fix that?
Profiling your program is one of the best options for figuring out why it's slow and where you need to focus your improvement. Without profiling, you're guessing blindly at where the problem may lie. With a profile, you can see where most of the time is spent and focus your efforts.
There are a few ways to profile Rust programs, but my favorite is
flamegraph (also called
It's a wonderful tool that wraps around the standard profilers perf (on Linux) and dtrace (on MacOS).
The basic usage of
flamegraph is quite straightforward and their docs cover it well, but the amount of options can be daunting.
At its most basic, after you install it and dependencies1, you can run it as a cargo command.
Here are a few of the invocations I use.
# Run your default target with no arguments
# Run your default target with the arguments after --
cargo flamegraph -- arg1 arg2 arg3
# Run the specified bin target with arguments
cargo flamegraph -b mybin -- arg1 arg2 arg3
# Run your default target with arguments and save
# the results to a different filename
cargo flamegraph -o myoutput.svg -- arg1 arg2 arg3
You can mix and match these options to combine them.
Running one of these commands will produce a file, named
flamegraph.svg unless you overrode the output filename.
After you generate that, you'll want to make sure the results are reasonable to get the results you need. The output will tell you how much data it recorded and how many samples it took, something like this:
[ perf record: Woken up 59 times to write data ]
[ perf record: Captured and wrote 14.706 MB perf.data (925 samples) ]
In this example, we have 925 samples which is probably reasonable to make progress on the big things. How many samples you need will vary depending on your program, but I've not found good results too far below 1,000, and far above that seems to make things really slow. If you have big, sweeping inefficiencies, fewer samples will still catch them. If they're relatively subtle and small gains, you may need many more samples. It's an art to figure out how to tune the sample size.
To control how many samples you get, you have two options: you can change your program, or change the instrumentation.
Sampling is done at a particular frequency, so you can control the program duration and you can control the sampling frequency.
If you're getting very few samples, but you can make your program run for longer (larger input, multiple repetitions, etc.) then that can increase the samples you get.
The same applies in reverse for too many samples.
The other option is to change the instrumentation itself: you can use the
-F argument to alter the frequency of sampling2.
# Sample at a rate of 1997 samples per second
cargo flamegraph -F 1997 -- arg1 arg2 arg3
From here with a good sample, the work is now back to you, the programmer-analyst.
I like to open the SVG file in Firefox, which has a convenient viewer that allows you to zoom in and examine individual stacks of events.
But you can use any suitable SVG viewer.
You should be able to navigate around the
flamegraph to see visually where CPU time is being spent and use that to concentrate your efforts.
For a stronger introduction to how to read and use a flamegraph, see the flamegraph docs which has a section dedicated to this.
A few gotchas
While doing various profiling of my Rust programs, I've hit a few gotchas that tripped me up. Here are the ones I remember. There are certainly more, so let me know if there's something I should add to this list!
- Missing system calls. When the system under test spends a lot of time in system calls, those can lead to a misleading flamegraph if they aren't captured. Since system calls transfer control to the kernel, a standard user typically cannot measure them—and perf is by default running as you! To get around that, you can have it sample as root. In
flamegraphyou would add the
--rootflag, which will use sudo to get privileges to sample everything including during system calls. This is especially important when you're doing anything with a lot of disk or network activity, otherwise the code calling those system calls may be missing and you will be on a wild goose chase!
- Optimizations hiding information. As stated in the
flamegraphdocs, "Due to optimizations etc... sometimes the quality of the information presented in the flamegraph will suffer when profiling release builds." To address this, you either set
debug = truefor your release target, or you can use the environment variable
- Lockstep sampling. As Brendan Gregg points out, sampling frequencies are set off from typical frequencies used by programs. If you use a frequency like 100Hz, you may end up on the same frequency of a repeating event in your program, resulting in sampling from the same point repeatedly instead of sampling from across the entire program. You can experiment with different frequencies and see if any of them result in notably better or worse results; if they're all about the same, then you're probably not in lockstep with your program.
Now go forth2 and profile your programs!
On Fedora, I had to install
sudo dnf install perf, and I had to downgrade perf (
sudo dnf downgrade perf) since the latest version has a regression which results in mangled names appearing in the generated results. If your results don't have the function names you expect, check for that.
How many programming languages can you fit in a relatively normal sounding English sentence? Also, is there a language called "now" yet?
If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.
Want to become a better programmer? Join the Recurse Center!