technically a blog

home | blog / tags | newsletter | projects | books

Writing a basic code formatter

Monday, August 14, 2023

I've been working on my programming language for a couple of months now, in fits and starts¹. In the original post, I laid out my plan for it, and after creating the parser the next step was writing a formatter. I thought this would be a nice intermediate step after writing the parser, something easy to exercise the code without being as complicated as the interpreter.

Well... It was hard, even with the shortcuts I took.

The author of Crafting Interpreters once wrote that a formatter was the hardest program he had written, and now I can see why. Mine is definitely not as sophisticated as his, and it was difficult to figure out on my own.

One of the big challenges I ran into was what interface to use for the formatter. I wound up settling on this trait, along with a companion struct.

/// Trait for types that can be formatted for pretty printing.
pub trait Format {
    fn fmt(&self, ctx: &mut FormatContext) -> String;
}

pub struct FormatContext {
    pub indent: usize,
}

impl FormatContext {
    pub fn indent_incr(&mut self) {
        self.indent += 4;
    }

    pub fn indent_decr(&mut self) {
        self.indent -= 4;
    }
}

In an early iteration I was passing through a Write object directly, and writing into it as I went. The issue with doing that was that I could only do one forward pass through the code, so if I ever wanted to limit line lengths, I couldn't! When I wrote things, I would only know the local content and not what came before or after it.

So, instead I went for returning a String and building it iteratively. This is perhaps not the most efficient choice, but optimization is for future-Nicole. She loves that shit, so that's kind of a gift to my future self.

I also had to make sure to include some context inside each format call. I originally passed in the indentation level directly, but quickly moved that into a mutable context variable. Placing it inside a context struct allows me to more easily add more variables than if I have to add them to each implementation of the trait. Right now the context only contains the indentation level, but could potentially contain more information, such as line lengths or format settings.

After that, it was a matter of just kind of chugging through and having it write out what each different piece of the tree corresponds to. Here's an abridged version of what formatting a Stmt looks like.

impl Format for Stmt {
    fn fmt(&self, ctx: &mut FmtCtx) -> String {
        let tab = " ".repeat(ctx.indent);

        match self {
            Stmt::Import(expr) => {
                format!("{}include {};", " ".repeat(ctx.indent), expr.fmt(ctx))
            }
            Stmt::Declaration(ident, expr) => {
                let expr = expr.fmt(ctx);
                format!("{}let {} = {};", tab, ident.0, expr,)
            }
            // ... SNIP! ...
        }
    }
}

One of the trickiest parts was handling blank lines. My syntax tree did not include these originally, and my parser stripped out all whitespace. What to do? I tried two approaches.

First I tried integrating blank lines into the grammar and parsing it out, so that I could just directly print them. This was a detour, and it was very messy and never worked right. Ultimately, I had to abandon this path because there was no clean way to get it working. The messy way would have involved updating every single part of my parser. No thank you.

Then I stumbled into the more correct (less wrong?) way of doing it. I used the line numbers provided while parsing! If these line numbers differed by more than 1, I knew² that there were extra blank lines between the two elements, so I emitted a BlankLine element in addition to whatever I was parsing.

This is a kludge in some ways, because there are edge cases (like the one in a footnote). I think that the right way to do this is actually to include the line number information on the tokens themselves, and have more information than just the starting line number. Where does a function start and end, for example? But it works for now, and it allows me to potentially use the same tree for both the interpreter and the formatter. This decision may not last forever, but it saves some time now.

There are a few things I skipped over for the sake of keeping the formatter simple. The main one is line length. Like the Go formatter, I just decided to let lines be as long as you want them to be, since that means I never need to deal with wrapping lines: one statement, one line. I also didn't ever collapse blocks, they always span two lines. And if you have a comment at the end of a line it gets shoved onto the next line. Oh, and there's no semblance of proper error handling...

A few of these I would like to fix later on (like the comment formatting and error handling), but others I don't really care about (line length). I'd also like to extend it to have more command-line options so it can format in-place instead of printing to standard out, but I'll probably work on that when I have the interpreter running since it won't matter until then.

And now it runs! We can pass in a messy program like this:

let year = 2023; let greeting =
    "Hurl was created in " + year + "!"
    ;

let p = (func(x) { print(x); });
p(
    greeting);

And get out a clean program like this!

let year = 2023;
let greeting = "Hurl was created in " + year + "!";

let p = (func(x) {
    print(x);
});
p(greeting);

As usual, the code is in the repo if you want to take a look.

This project was harder than I anticipated, and I also learned a lot more than I expected to. And now, like all serious languages, Hurl has a formatter. Next up is the interpreter and a standard library. After that... maybe a language server, and a package manager?

I work full-time, write as a hobby, and have two young kids at home. Free time is limited.

I think there are edge cases where this is not true, like if you have two functions which butt up against each other. They end up as siblings in the parse tree but they're more than 1 line apart, so my mechanism would detect blank lines here. This is okay for my formatter (I want blank lines there) but it's a happy accident, and I'm not happy about it.

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.

Want to become a better programmer? Join the Recurse Center!
Want to hire great programmers? Hire via Recurse Center!