Solving my fun, frustrating docker-machine error
Tuesday, December 8, 2020
Last Saturday, I ran into a problem doing a routine backup of a web app I maintain. In fact, this was the second time I ran into the exact same issue, so it's time to write it down. (Hopefully, the third time I run into this, I have the presence of mind to look up my own solution!)
My web app is deployed using docker-machine and docker-compose. This is not a great production setup, but it works for me and there are just a handful of users. Every week, I manually run a backup script that copies down the database and all the images from this web app. (I could set up a cron job, but I consciously chose to keep it manual so I would, every week, be able to see that the backups are working: this has paid off, since I saw when it did NOT work!)
When I ran the backups, I ran into a mysterious error message:
$ backup.sh
Error checking TLS connection: Error checking and/or regenerating the certs: There was an error validating certificates for host "xx.xx.xx.xx:2376": dial tcp xx.xx.xx.xx:2376: i/o timeout
You can attempt to regenerate them using 'docker-machine regenerate-certs [name]'.
Be advised that this will trigger a Docker daemon restart which might stop running containers.
Error response from daemon: Container 06916a79c735b152c287c8aaa57ff65958898f819c604ceee83fadf3f502922f is not running
First thought: Okay, well, that's weird that the certs are expired but let's just follow what it says. Let's regenerate those. So, I did, and then... the entire app was down, because it shut down the containers but could not start them! Now a routine backup has turned into an outage.
Aaand I can't see the machine:
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
picklejar generic Timeout
The strange thing? docker-machine ssh <host>
worked.
So... I cannot see the machine, I cannot validate the certs, but ssh works.
If you're screaming at the monitor right now because the answer is obvious, I know. I missed it in the moment, but it was right there in front of me (sort of) in that first error message: dial tcp xx.xx.xx.xx:2376: i/o timout
: This means that we can't establish a TCP connection on that port, which could be... caused by the firewall. Let's not talk about how long it took me to realize this, and how many other things I tried before I had that head smack moment: doh!
The problem was: I have the instance firewalled in a way that allows my home network to establish the TCP connections needed for docker-machine, but no external traffic. BUT I have ssh allowed from any port, so that I can get into the host while I'm on the go (or that was the idea, when travel was a thing). So when my ISP issued me a new IP address, suddenly I could do some things on the machine (ssh) but could not do others, leading to this confusing situation of docker-machine kinda sorta half working.
So if you get an error message from docker-machine about an error validating certificates, don't just assume (as I did) that its suggested fix is a good idea: verify that you don't have a network/firewall issue first.