technically a blog

home | blog / tags | newsletter | projects | books

Achieving awful compression with digits of pi

Thursday, March 14, 2024

Compression is a really hard problem, and it attracts a lot of interesting ideas. There are some numbers whose digits contain all sequences of digits¹. People have long hypothesized that pi is one such number; a proof remains elusive.

If we have a number which contains all sequences of digits, could we transmit a message using that? Instead of telling my friend Erika the message, I could send her the offset and length in some number where that message occurs, then she could reconstruct the message!

The problem is that you wind up with a much larger message than if you'd just sent what you wanted to in the first place. Let's take a look first at how you'd do such a ridiculous thing, then we'll see why it doesn't work and compute the compression ratio you might actually achieve.

Happy Pi Day!

Finding our message in pi

The first thing we need to do is figure out where our message is in pi.

The obvious approach here is to compute digits of pi, scanning through them and checking where our message is. We can do this with a spigot algorithm, which lets us compute digits sequentially from left to right. Traditional approximations would give us a converging number: 3.2, then 3.05, then 3.13, etc. In contrast, a spigot algorithm would give us 3, then 3.1, then 3.14, etc. Using this lets us scan through pi only until we find our message!

The other thing we would like here is the ability to generate individual digits without computing the preceding digits. If we can do this, it makes decoding a lot faster, because you can start calculating from exactly where the message is rather than all the digits that came before. It also means that we only have to store the current digits we're checking, leading to much lower memory consumption.

The algorithm we're going to use here is the Bailey-Borwein-Plouffe formula, which was discovered in 1995 and allows us to compute the hex digits of pi. Using base-16 digits means it's easier to encode our messages, which are natively in 8-bit byte arrays! Each byte corresponds to two nibbles, which are each hex digits. Perfect.

There weren't any libraries that I wanted to use for this in Rust, so the solution was to port some code from C! One of the authors of the algorithm we're using, David Bailey, has code listings in C and fortran for computing the algorithm. It's easier for me to port some code from C to Rust than from the math of a paper to Rust.

The main computation for digits of pi is this function. I don't understand the details, so don't ask me why; I just ported it.

pub fn pi_digits(id: usize) -> Vec<u8> {
    let s1 = series(1, id as i64);
    let s2 = series(4, id as i64);
    let s3 = series(5, id as i64);
    let s4 = series(6, id as i64);
    let pid = 4. * s1 - 2. * s2 - s3 - s4;
    let pid = pid - pid.floor() + 1.;

    to_hex(pid)
}

We also define the function series, which this uses, but won't go into the details there. The code is available if you want to see it.

Now given this function, how can we compress with it?

We could scan until we find our entire message, but that ends up taking almost literally forever, and longer the bigger your message is. Instead, we're going to limit how many digits of pi we'll scan, and then find the longest matches within there. Our compressed message then, instead of one (offset, length) pair, is a list of such pairs.

We do that with two functions and one struct.

The struct tells us where a match is. Our first function finds our longest partial match that's within our digit limit.

pub struct Location {
    pub offset: usize,
    pub length: usize,
}

pub fn find_pi_match(msg: &[u8], limit: usize) -> Option<Location> {
    let mut best_match: Option<Location> = None;
    let mut offset = 0;

    'outer: while offset < limit {
        let mut pi_digits = PiIterator::from(offset);

        while let Some(b) = pi_digits.next()
            && b != msg[0]
        {
            offset += 1;
            if offset >= limit {
                break 'outer;
            }
        }

        let length = pi_digits
            .zip(msg.iter().skip(1))
            .take_while(|(a, &b)| *a == b)
            .count() + 1;

        if length == msg.len() {
            return Some(Location { offset, length });
        } else if let Some(m) = &best_match {
            if length > m.length {
                best_match = Some(Location { offset, length });
            }
        } else {
            best_match = Some(Location { offset, length });
        }

        offset += 1;
    }

    best_match
}

And then we string together partial matches to cover our entire message. This is our compression function.

pub fn compress(msg: &[u8], limit: usize) -> Vec<Location> {
    let mut locs = vec![];

    let mut index = 0;
    while index < msg.len() {
        let m = find_pi_match(&msg[index..], limit).expect("should find some match");
        index += m.length;
        locs.push(m);
    }

    locs
}

It's not production ready—if it fails to find a match it just panics—but honestly, honestly, is that a worry?

Now we can run this. If we encode the message "hello" with a max offset of 4196², then we get the following compressed message:

[
  Location { offset: 2418, length: 3 },
  Location { offset: 936, length: 3 },
  Location { offset: 60, length: 2 },
  Location { offset: 522, length: 2 },
]

Neat! Our message is "compressed" using pi! But how well does it do?

Measuring our compression ratio

An important part of any compression scheme is the data compression ratio. This is computed as uncompressed-size / compressed-size, and you want as high of a number as possible. If your compression ratio is 4, that means that your original message is 4x larger than your compressed ratio, so you've saved a ton of storage space or transmission bandwidth!

How well does our compression do here? Let's take a look at our example above.

We encoded "hello" and got back an array of four locations. Those were defined with usize for convenience, but each could fit in smaller numbers. Let's be generous and say that we're packing each location into a 16-bit int.

That means that our compressed size is 4 * 16-bits = 4 * 2 bytes = 8 bytes! And our original message was... uh oh. Our original message was 5 bytes. Our compression ratio is 5/8 = 0.6125, a very bad compression ratio!

I ran an experiment for a few message lengths, and the compression ratio stays about the same across them.

chart showing a consistent compression ratio of around 0.6

The ultimate problem here is that, even if you can find your message, you're going to find it so far out that it won't be a reduction of what you have to send! Obviously we were limited here in how far we can compute, but computing further isn't going to solve this problem.

Using pi compression

Naturally, you might now ask, "But Nicole, this sounds great, how can I use it?" It's your lucky day, because you can go download it and use it. Just add it with cargo add pi-compression to get version 3.1.4.

But be careful to abide by the terms of the license. You can pick AGPL, or you can use the Gay Agenda License if you prefer.

Huge thanks to Erika for implementing the pi-based compression with me! It was a blast pairing with you on this. ❤️

These are called normal numbers!

Chosen so that it would terminate in a reasonable amount of time.

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.

Want to become a better programmer? Join the Recurse Center!
Want to hire great programmers? Hire via Recurse Center!