Army of Gnomes
They say that your servers should be cattle, not pets, but I've never truly subscribed to that theory. Sure, nameless autoscaling behemoths are a great technical achievement. Probably safer and more resilient, too. But it would be tremendous overkill for my hobby projects.
One such project has a database server that's been chugging along happily for many years.
Tonight, on a whim, I tried SSHing to it to check something (I forget what - probably the uptime), and was surprised to have my connection rejected.
What could cause that, I wondered? And of course, I quietly panicked. Had it been compromised? Could it have been corrupted in some way?
I tried several more times, confirmed my SSH agent had all my pubkeys loaded, even tried a few different possible usernames, just in case. Nada.
At this point, my best guess was that the system time had drifted out of sync with Atomic time. I've had similar issues with this in the past, usually related to HTTPS or signed web requests. I suspected that the math that calculates SSH connections might be failing due to the drift.
Before I started messing around with my local system time to try and guess how far the clock had drifted, I had the idea to check the backups. I wanted to look at the file timestamps. If there was a gap from when they were supposed to run and the time that they were written to the backup server, that might give me a hint of how far I'd have to adjust my clock. That is, if my atomic time-drift theory was correct.
Connecting to the backup server proved to be difficult. Signing into it went ... poorly. I used the wrong credentials at first, and then it started throwing captchas at me that were utterly unreadable by anyone's standard.
Eventually, I got the captcha sorted out, and I signed in.
I discovered that the backups hadn't run since SEPTEMBER. Over a month.
Anyway, that avenue of inquiry proved to be a bust. I spent a few minutes guessing at the time-drift theory by hand, and trying to connect from other servers to see if I'd set them up with access (I hadn't). No success.
Then I realized ... well, damn. I need to get a fresh backup. Kind of urgently.
I had another server that was still happily talking to this database, so I connected to that one - thankfully without any issues. I installed a CLI mysql client on it so I could run a
mysqldump, and then I spent a few minutes futzing with parameters to get the style of output I wanted. The inter-server credentials are quite locked down, so I used the
--single-transaction flag to bypass table locking. The client I was using was also much newer than the server it was talking to. The
--column-statistics=0 flag disabled the unsupported column statistics feature. Once that was set, the backup proceeded smoothly. I gzipped it and got it off the intermediate server ASAP. Phew, potential crisis averted.
But I still needed to regain access to the database server. The inter-server credentials were for database access only, they wouldn't help with direct access.
With the potential backup crisis mitigated, I started to speculate about what variable had changed. Which piece of the whole flow had changed most recently?
Then I remembered: I had upgraded MacOS!
Some quick searching revealed that MacOS Ventura and Sonoma's SSH algorithm lists had dropped support for the RSA algorithm that was being used. Oops. I mean, good - clearing out weak algorithms is the right thing to do. But in this case, it caused a bit of a hiccup. Thankfully, I found a link that had a clear explanation of how to bring RSA back into Sonoma's list of SSH algorithms.
I edited my ssh config for that host to add these two flags:
HostkeyAlgorithms +ssh-rsa PubkeyAcceptedAlgorithms +ssh-rsa
Once I reconfigured it to add that back into the mix, I was able to connect just fine. Like nothing had gone wrong at all.
Then I tried invoking the backup script manually to see why it was failing.
After a few minutes of waiting, I got the result:
ERROR: S3 error: 403 (RequestTimeTooSkewed): The difference between the request time and the current time is too large. Failed to upload backup to S3!
The backups were failing BECAUSE THE SYSTEM TIME HAD DRIFTED TOO MUCH FROM ATOMIC TIME. Vindication!
I sync'd up to atomic time, re-ran the backup script, and everything went smoothly.
2045 days, 12 hours, 59 minutes.
Even for a pet server, that might be a bit absurd.
Published: November 2, 2023
Thank you for reading!
If you enjoyed this post, please consider supporting me on Ko-Fi. Monthly Supporters get early access to my writing and updates on my hobby projects.
To be notified of new posts on Whateverthing, please follow the RSS feed:
subscribe to the rss feed