Solving my fun, frustrating docker-machine error

Tuesday, December 8, 2020

Last Saturday, I ran into a problem doing a routine backup of a web app I maintain. In fact, this was the second time I ran into the exact same issue, so it's time to write it down. (Hopefully, the third time I run into this, I have the presence of mind to look up my own solution!)

My web app is deployed using docker-machine and docker-compose. This is not a great production setup, but it works for me and there are just a handful of users. Every week, I manually run a backup script that copies down the database and all the images from this web app. (I could set up a cron job, but I consciously chose to keep it manual so I would, every week, be able to see that the backups are working: this has paid off, since I saw when it did NOT work!)

When I ran the backups, I ran into a mysterious error message:

Error checking TLS connection: Error checking and/or regenerating the certs: There was an error validating certificates for host "xx.xx.xx.xx:2376": dial tcp xx.xx.xx.xx:2376: i/o timeout
You can attempt to regenerate them using 'docker-machine regenerate-certs [name]'.
Be advised that this will trigger a Docker daemon restart which might stop running containers.

Error response from daemon: Container 06916a79c735b152c287c8aaa57ff65958898f819c604ceee83fadf3f502922f is not running

First thought: Okay, well, that's weird that the certs are expired but let's just follow what it says. Let's regenerate those. So, I did, and then... the entire app was down, because it shut down the containers but could not start them! Now a routine backup has turned into an outage.

Aaand I can't see the machine:

$ docker-machine ls
picklejar            generic   Timeout

The strange thing? docker-machine ssh <host> worked.

So... I cannot see the machine, I cannot validate the certs, but ssh works.

If you're screaming at the monitor right now because the answer is obvious, I know. I missed it in the moment, but it was right there in front of me (sort of) in that first error message: dial tcp xx.xx.xx.xx:2376: i/o timout: This means that we can't establish a TCP connection on that port, which could be... caused by the firewall. Let's not talk about how long it took me to realize this, and how many other things I tried before I had that head smack moment: doh!

The problem was: I have the instance firewalled in a way that allows my home network to establish the TCP connections needed for docker-machine, but no external traffic. BUT I have ssh allowed from any port, so that I can get into the host while I'm on the go (or that was the idea, when travel was a thing). So when my ISP issued me a new IP address, suddenly I could do some things on the machine (ssh) but could not do others, leading to this confusing situation of docker-machine kinda sorta half working.

So if you get an error message from docker-machine about an error validating certificates, don't just assume (as I did) that its suggested fix is a good idea: verify that you don't have a network/firewall issue first.

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.

Want to become a better programmer? Join the Recurse Center!
Want to hire great programmers? Hire via Recurse Center!