Testing a WebSocket that could hang open for hours

Monday, July 1, 2024

I recently ran into a bug in some Go code that no one had touched in a few years. The code in question was not particularly complicated, and had been reviewed by multiple people. It included a timeout, and is straightforward: allow a Websocket connection to test that the client can open those successfully, and then close it.

The weird thing is that some of these connections were being held open for a long time. There was a timeout of one second, but sometimes these were still open after twelve hours. That's not good!

This bug ended up being instructive in both Go and in how WebSockets work. Let's dive in and see what was going on, then what it tells us!

Comic showing a caterpillar and a butterfly, representing the transformation of HTTP requests into WebSockets.

Identifying the bug

The preliminary investigation found that this was happening for users with a particular VPN. Weird, but not particularly helpful.

After the logs turned up little useful info, I turned to inspecting the code. It was pretty easy to see that the code itself had a bug, in a classic new-to-Go fashion. The trickier thing (for later) was how reproduce the bug and verify it in a test.

The bug was something like this:

for {
    select {
    case <-ctx.Done():
        // we timed out, so probably log it and quit!
        return

    default:
        _, _, err := conn.ReadMessage()

        if err != nil {
            // ...
        }
    }
}

There are two conspiring factors here: first, we're using a default case in the select, and second, that default case has no read deadline. The default case is run when no other case is ready, which is the case until we time out. The issue is that we won't interrupt this case when the other one becomes ready. And in that case, conn.ReadMessage() will wait until it receives something if no read deadline has been set.

The question then becomes, how do we actually run into this case?

How does this happen?

This is a weird case, because it requires the end client to misbehave. Right before the bugged for loop, the server sent a WebSocket close frame to the client. If you have such a connection open in your browser, then when it receives the close frame it will send one back. This is part of the closing handshake for WebSockets. So if we get nothing back, that means that something went wrong.

Taking a step back, let's refresh some details about WebSockets. WebSocket connections are bidirectional, much like TCP connections: the client and the server can send messages and these messages can interleave with each other. In contrast, a regular HTTP connection follows a request-response pattern where the client sends a request and then the server sends a single response1.

But the cool thing is that WebSockets start out life as a regular HTTP request. When you send a WebSocket request, the body starts as something like this2:

GET /websocket/ HTTP/1.1
Host: server.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==
Sec-WebSocket-Version: 13

After this request, the server ideally responds saying it'll switch protocols with something like this response:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: HSmrc0sMlYUkAGmm5OPpG2HaGWk=

After that's done, then both ends switch to a different binary protocol that's not related to HTTP. Pretty neat that it starts life as a regular HTTP request!

Now that we have a WebSocket open, the server and client can each send messages. These are either data messages or control messages. Data messages are what we send and receive in our applications and are what you usually see and handle. Control messages are used to terminate the connection or do other operational things, and are usually hidden from the application.

When the connection ends, you're supposed to send a particular control message: a close frame. After receiving it, the other side is supposed to respond with a close frame. And then you can both close the underlying network connection and move on with your lives.

But it turns out that sometimes that doesn't happen! This could be that the client connecting to your server is doing something naughty and didn't send it to leave you hanging. Or maybe the network was cut and the message didn't get back to you, or maybe the other end of the connection vanished in a blaze of thermite.

Whatever the cause, when this happens, if you're waiting for that close frame you'll be waiting a long time. So now we have to reproduce it in a test.

Leaving the server hanging in a test

Reproducing the bug was a bit tricky since I couldn't use any normal ways of opening a WebSocket. Those implementations all assume you want a correct implementation but oh, no, I want a bad implementation. To do that, you have to roll up your sleeves and do the request by hand on top of TCP.

The test relies on opening a TCP connection, sending the upgrade request, and then just... not responding or sending anything. Then you periodically try to read from the connection. If you get back a particular error code on the read, you know the server has closed the TCP connection. If you don't, then it's still open!

This is what it looks like, roughly. Here I've omitted error checks and closing connections for brevity; this isn't production code, just an example. First, we open our raw TCP connection.

addr := server.Addr().String()
conn, err := net.Dial("tcp", addr)

Then we send our HTTP upgrade request. Go has a nice facility for doing this: we can form an HTTP request and put it onto our TCP connection3.

req, err := http.NewRequest("GET", url, nil)
req.Header.Add("Upgrade", "websocket")
req.Header.Add("Connection", "Upgrade")
req.Header.Add("Sec-WebSocket-Key", "9x3JJHMbDL1EzLkh9GBhXDw==")
req.Header.Add("Sec-WebSocket-Version", "13")

err = req.Write(conn)

We know the server is going to send us back an upgrade response, so let's snag that from the connection. Ideally we'd check that it is an upgrade response but you know, cutting corners for this.

buf := make([]byte, 1024)
_, err = conn.Read(buf)

And then we get to the good part. Here, what we have to do is we just wait and keep checking if the connection is open! The way we do that is we try to read from the connection with a read deadline. If we get io.EOF, then we know that the connection closed. But if we get nothing (or we read data) then we know it's still open.

You don't want your test to run forever, so we set a timeout4 and if we reach that, we say that the test failed: it was held open longer than we expected! But if we get io.EOF before then, then we know it was closed as we hoped. So we'll loop and select from two channels, one which ticks every 250 ms, and the other which finishes after 3 seconds.

ticker := time.NewTicker(250 * time.Millisecond)
timeout := time.After(3 * time.Second)

for {
    select {
        case <-ticker.C:
            conn.SetReadDeadline(time.Now().Add(10 * time.Millisecond))
            buf := make([]byte, 1)
            _, err = conn.Read(buf)

            if err == io.EOF {
                // connection is closed, huzzah! we can return, success
                return
            }

        case <-timeout:
            // if we get here, we know that the connection didn't close.
            // we have a bug, how sad!
            assert.Fail(t, "whoops, we timed out!")
            return
    }
}

Resolving the bug

To resolve the bug, you have two options: you can set a read deadline, or you can run the reads in a goroutine which sends a result back when you're done.

Setting a read deadline is straightforward, as seen above. You can use it and then you'll be happy, because the connection can't hang forever on a read! The problem is, in the library we were using, conn.SetReadDeadline sets it for the underlying network connection and if it fails, the whole WebSocket is corrupt and future reads will fail.

So instead, we do it as a concurrent task. This would look something like this:

waitClosed := make(chan error)
go func() {
    _, _, err := conn.ReadMessage()
    if err != nil {
        // ...
    }

    waitClosed <- err
}()

timeout := time.After(3 * time.Second)

for {
    select {
    case <-timeout:
        // we timed out, so close the conection and quit!
        conn.Close()
        return

    case <-waitClosed:
        // success! nothing needed here
        return
    }
}

It looks like it will leak resources, because won't that goroutine stay open even if the we hit the timeout? The key is that when we hit the timeout we close the underlying network connection. This will cause the read to finish (with an error) and then that goroutine will also terminate.


It turns out, there are a lot of places for bugs to hide in WebSockets code and other network code. And with existing code, a bug like this which isn't causing any obvious problems can lurk for years before someone stumbles across it. That's doubly true if the code was trying to do the right thing but had a bug that's easy to miss if you're not very familiar with Go.

Debugging things like this is a joy, and always leads to learning more about what's going on. Every bug is an opportunity to learn more.


Thanks to Erika Rowland and Dan Reich for providing feedback on a draft of this post.


1

There are other ways that HTTP requests can work, such as with server-sent events. And a single connection can send multiple resources. But the classic single-request single-response is a good mental model for HTTP most of the time.

2

This example is from the WebSockets article on Wikipedia.

3

I wanted to do this in Rust (my default choice) but found this part of it much easier in Go. I'd still like to write a tool that checks WebSockets for this behavior (and other naughty things), so I might dig in some more with Rust later.

4

The first time I wrote this test, I had the timeout inline in the case, which resulted in never timing out, because it was created fresh every loop.


If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.

Want to become a better programmer? Join the Recurse Center!
Want to hire great programmers? Hire via Recurse Center!