Disaster Recovery Playbook
Recovery procedures for the worst-case scenarios. Print this out and keep it in your ops runbook.
Scenario 1 — dockmesh server host is dead
Section titled “Scenario 1 — dockmesh server host is dead”Impact: UI unreachable. Containers keep running on every other host — Docker doesn’t need dockmesh to keep doing its job. Agents reconnect in a loop until the server is back. Target RTO: under 10 minutes with a recent backup in hand.
Prerequisites (set these up BEFORE disaster)
Section titled “Prerequisites (set these up BEFORE disaster)”- Automated system backups to an off-host target (SFTP / SMB / S3 / WebDAV). Configure under Settings → Backups with a daily schedule and the
dockmesh-systemjob type. Verify the backup regularly (see Scenario 4). - Off-host DNS / LB entry for your dockmesh URL so agents find the new server without manual agent reconfig.
- Unencrypted system backups, OR encrypted with a key stored OUTSIDE the dockmesh data dir (the server’s own age key is inside the tarball — self-encrypted system backups can’t self-restore).
Recovery steps (~8-10 minutes)
Section titled “Recovery steps (~8-10 minutes)”- Spin up a new host — any Linux with Docker 20.10+ and a few GB of disk. A fresh cloud VM is fine; you don’t need to replicate the old spec.
- Install the dockmesh binary (not the data — just the binary):
Terminal window curl -fsSL https://get.dockmesh.dev | bash# Stop the service so systemd doesn't fight us during restoresystemctl stop dockmesh - Copy your latest backup to the new host:
Terminal window # From your workstation, or scp/rsync from the backup targetscp dockmesh-system-20260419-0300.tar.gz newhost:/tmp/ - Run the restore command:
Terminal window dockmesh restore --from /tmp/dockmesh-system-20260419-0300.tar.gzrestorewrites the DB,/stacks/, and thedata/dir into the pathsdockmesh config showlists. It refuses to overwrite a populated install without--force; on a fresh host this doesn’t trigger. It runs the post-restore sanity checks automatically and printssanity OKwhen the state is bootable. - Point DNS / your load balancer at the new host so agents find it.
- Start the service:
Terminal window systemctl start dockmeshjournalctl -u dockmesh -f - Verify:
- Log in with your usual admin credentials (they came back with the DB).
- Agents page should list every agent as
onlinewithin 30-60s. They’re still authenticating against the CA cert that came back with the restoreddata/dir. - Stacks page should look as it did at backup time.
What if the sanity check warns about missing CA / genesis?
Section titled “What if the sanity check warns about missing CA / genesis?”The restore warns loudly when:
agents-ca.crt/agents-ca.keymissing → every agent has to re-enroll. Annoying but not catastrophic. Delete each agent from the UI, run the install-agent command on each host.audit-genesis.sha256missing → the chain-of-custody claim has a break across the restore boundary. The chain from first-row-after-restore is still verifiable against each row’s stored hashes; you just can’t verify against the ORIGINAL genesis anymore. Note this in your audit trail if compliance matters.secrets.env(JWT secret) missing → every existing session is invalidated and users have to log in again. First-boot-after-restore generates a fresh JWT secret.
These all surface as [warn] rows from dockmesh restore, not [FAIL].
Time to recover: 15-30 minutes if you have recent backups.
Time to recover without recent backups: days. Don’t be this person.
Scenario 2 — Clock skew breaks mTLS
Section titled “Scenario 2 — Clock skew breaks mTLS”Impact: Agents can’t connect, logs show TLS errors.
date -u # check server + each agentsystemctl restart systemd-timesyncd # or: chronyd -QConnections resume within 30s of time correction.
Scenario 3 — A stack is corrupted (application-level data loss)
Section titled “Scenario 3 — A stack is corrupted (application-level data loss)”Impact: One app is broken but containers are running.
- Stop the stack to prevent further corruption (Stack → Stop)
- Browse backups (Backups → Browse → filter by stack)
- Pick the last known-good run (check timestamps vs. when corruption started)
- Restore — dockmesh restores volumes, optionally the compose file too, and restarts
- Verify application-level consistency before letting traffic back in
If the restore itself causes issues, dockmesh keeps the pre-restore volume copy for 24 hours.
Scenario 4 — “Is our backup actually restorable?” (pre-disaster verify)
Section titled “Scenario 4 — “Is our backup actually restorable?” (pre-disaster verify)”The single most valuable DR practice is knowing your backups work before you need them. Backups that have never been restored are Schrödinger’s backups — half the time they turn out to be corrupt / encrypted-with-a-lost-key / missing-the-CA when you discover them at 3am.
Option A — Upload-and-verify via UI (non-destructive)
Section titled “Option A — Upload-and-verify via UI (non-destructive)”- Settings → System → Verify a backup → file picker.
- Upload your latest
dockmesh-system-*.tar.gz. - Server extracts to a temp dir, runs the same sanity checks
dockmesh restoredoes, deletes the temp dir. The live install is never touched. - Result: a table of
[ ok ] / [warn] / [FAIL]rows for each check. Pass = you can restore this backup successfully.
Same endpoint from the CLI:
curl -H "Authorization: Bearer $DM_TOKEN" \ -F "file=@dockmesh-system-20260419-0300.tar.gz" \ https://dockmesh.example.com/api/v1/restore/verifyOption B — Full dry-run via dockmesh restore
Section titled “Option B — Full dry-run via dockmesh restore”# Walks the tarball without writing anything. Checks entry structure +# prints the would-restore target for each file + a running total.dockmesh restore --from backup.tar.gz --dry-runDoesn’t run sanity (no DB to check), but catches corrupt tarballs, encrypted archives, and unexpected prefixes.
Automate it
Section titled “Automate it”Schedule a weekly verify of your latest system-backup run against Option A. If any week’s result isn’t passed: true, page the on-call.
Scenario 5 — A single host is dead (hardware failure, unrecoverable)
Section titled “Scenario 5 — A single host is dead (hardware failure, unrecoverable)”Impact: Stacks on that host are offline. Other hosts unaffected.
Use drain if possible (Hosts → host → Drain) to move stacks off before decommissioning. If the host is already unreachable, use per-stack volume backups to restore onto a replacement host.
Scenario 6 — Agent mTLS certs are compromised
Section titled “Scenario 6 — Agent mTLS certs are compromised”dockmesh ca rotate --reissue-all-agentsRotates the server CA. All agents re-enroll on next connect. Audit the audit log for the pre-rotation window. Rotate any secrets (DB passwords, API keys) that could have leaked through compromised agents.
Scenario 7 — Database corruption (SQLite)
Section titled “Scenario 7 — Database corruption (SQLite)”systemctl stop dockmeshsqlite3 /opt/dockmesh/data/dockmesh.db "PRAGMA integrity_check;"If ok → VACUUM;. If corrupt → dump + reimport, or (likelier) restore from backup per Scenario 1.
Scenario 8 — Ransomware / malicious delete
Section titled “Scenario 8 — Ransomware / malicious delete”Isolate the host, don’t restart, mount disks read-only on a clean host for forensics. Rebuild from scratch on new hosts, restore from OFF-SITE backups — if your backups are only on the same host as the ransomware’d dockmesh, they’re gone too. Off-site + air-gapped backups are the only defense.
Monthly fire drill
Section titled “Monthly fire drill”Once a month:
- Run Scenario 4 option A against your latest system backup. If it doesn’t pass clean, stop — your backups are not actually DR-ready until you fix that.
- Pick one random stack. Pretend its host is gone. Restore the stack’s latest volume backup to a different host.
- Verify the restored stack works end-to-end (hit its port, log in, read some data).
- Document any surprises in your internal runbook.
A disaster recovery plan you haven’t tested is a disaster recovery hope.
Backup retention for disaster recovery
Section titled “Backup retention for disaster recovery”Minimum recommended retention:
- Last 7 daily backups
- Last 4 weekly backups
- Last 12 monthly backups
- 1 annual backup (compliance / long-term)
Off-site (separate failure domain) copies of all of the above.
See also
Section titled “See also”- Backup & Restore — setup + target types
- Hardening — prevent disasters
- Troubleshooting — for issues not yet at disaster level