Disaster Recovery Playbook

Recovery procedures for the worst-case scenarios. Print this out and keep it in your ops runbook.

Scenario 1 — dockmesh server host is dead

Impact: UI unreachable. Containers keep running on every other host — Docker doesn’t need dockmesh to keep doing its job. Agents reconnect in a loop until the server is back. Target RTO: under 10 minutes with a recent backup in hand.

Prerequisites (set these up BEFORE disaster)

Automated system backups to an off-host target (SFTP / SMB / S3 / WebDAV / local). Configure under Backups (top-level sidebar entry) with a daily schedule and the system-backup job kind. Verify the backup regularly (see Scenario 4).
Off-host DNS / LB entry for your dockmesh URL so agents find the new server without manual agent reconfig.
Unencrypted system backups, OR encrypted with a key stored OUTSIDE the dockmesh data dir (the server’s own age key is inside the tarball — self-encrypted system backups can’t self-restore).

Recovery steps (~8-10 minutes)

Spin up a new host — any Linux with Docker 20.10+ and a few GB of disk. A fresh cloud VM is fine; you don’t need to replicate the old spec.

Install the dockmesh binary (not the data — just the binary):

curl -fsSL https://get.dockmesh.dev | sudo bash
# Stop the service so systemd doesn't fight us during restore
systemctl stop dockmesh

Copy your latest backup to the new host:

# From your workstation, or scp/rsync from the backup target
scp dockmesh-system-20260419-0300.tar.gz newhost:/tmp/

Run the restore command:
Terminal window
```
dockmesh restore --from /tmp/dockmesh-system-20260419-0300.tar.gz
```
restore writes the DB, /stacks/, and the data/ dir into the paths dockmesh config show lists. It refuses to overwrite a populated install without --force; on a fresh host this doesn’t trigger. It runs the post-restore sanity checks automatically and prints sanity OK when the state is bootable.
Point DNS / your load balancer at the new host so agents find it.

Start the service:

systemctl start dockmesh
journalctl -u dockmesh -f

Verify:
- Log in with your usual admin credentials (they came back with the DB).
- Agents page should list every agent as online within 30-60s. They’re still authenticating against the CA cert that came back with the restored data/ dir.
- Stacks page should look as it did at backup time.

What if the sanity check warns about missing CA / genesis?

The restore warns loudly when:

agents-ca.crt / agents-ca.key missing → every agent has to re-enroll. Annoying but not catastrophic. Delete each agent from the UI, run the install-agent command on each host.
audit-genesis.sha256 missing → the chain-of-custody claim has a break across the restore boundary. The chain from first-row-after-restore is still verifiable against each row’s stored hashes; you just can’t verify against the ORIGINAL genesis anymore. Note this in your audit trail if compliance matters.
secrets.env (JWT secret) missing → every existing session is invalidated and users have to log in again. First-boot-after-restore generates a fresh JWT secret.

These all surface as [warn] rows from dockmesh restore, not [FAIL].

Time to recover: 15-30 minutes if you have recent backups.

Time to recover without recent backups: days. Don’t be this person.

Scenario 2 — Clock skew breaks mTLS

Impact: Agents can’t connect, logs show TLS errors.

date -u                                    # check server + each agent
systemctl restart systemd-timesyncd        # or: chronyd -Q

Connections resume within 30s of time correction.

Scenario 3 — A stack is corrupted (application-level data loss)

Impact: One app is broken but containers are running.

Stop the stack to prevent further corruption (Stack → Stop)
Browse backups (Backups → Browse → filter by stack)
Pick the last known-good run (check timestamps vs. when corruption started)
Restore — dockmesh restores volumes, optionally the compose file too, and restarts
Verify application-level consistency before letting traffic back in

The restore overwrites the existing volume contents in place — there is no automatic pre-restore safety copy today. If you might want to back out, run a manual snapshot of the affected volumes first (or pause + clone via docker run --rm -v vol:/data alpine tar cz /data > snap.tar.gz), restore second.

Scenario 4 — “Is our backup actually restorable?” (pre-disaster verify)

The single most valuable DR practice is knowing your backups work before you need them. Backups that have never been restored are Schrödinger’s backups — half the time they turn out to be corrupt / encrypted-with-a-lost-key / missing-the-CA when you discover them at 3am.

Option A — Upload-and-verify via UI (non-destructive)

Open Backups → Restore in the sidebar, choose Verify uploaded archive (file picker).
Upload your latest dockmesh-system-*.tar.gz.
Server extracts to a temp dir, runs the same sanity checks dockmesh restore does, deletes the temp dir. The live install is never touched.
Result: a table of [ ok ] / [warn] / [FAIL] rows for each check. Pass = you can restore this backup successfully.

Same endpoint from the CLI:

curl -H "Authorization: Bearer $DM_TOKEN" \
     -F "file=@dockmesh-system-20260419-0300.tar.gz" \
     https://dockmesh.example.com/api/v1/restore/verify

Option B — Full dry-run via `dockmesh restore`

# Walks the tarball without writing anything. Checks entry structure +
# prints the would-restore target for each file + a running total.
dockmesh restore --from backup.tar.gz --dry-run

Doesn’t run sanity (no DB to check), but catches corrupt tarballs, encrypted archives, and unexpected prefixes.

Automate it

Schedule a weekly verify of your latest system-backup run against Option A. If any week’s result isn’t passed: true, page the on-call.

Scenario 5 — A single host is dead (hardware failure, unrecoverable)

Impact: Stacks on that host are offline. Other hosts unaffected.

Use drain if possible (Agents → agent → Drain) to move stacks off before decommissioning. If the host is already unreachable, use per-stack volume backups to restore onto a replacement host.

Scenario 6 — Agent mTLS certs are compromised

dockmesh ca rotate --reissue-all-agents

Rotates the server CA. All agents re-enroll on next connect. Audit the audit log for the pre-rotation window. Rotate any secrets (DB passwords, API keys) that could have leaked through compromised agents.

Scenario 7 — Database corruption (SQLite)

systemctl stop dockmesh
sqlite3 /var/lib/dockmesh/data/dockmesh.db "PRAGMA integrity_check;"

If ok → VACUUM;. If corrupt → dump + reimport, or (likelier) restore from backup per Scenario 1.

Scenario 8 — Ransomware / malicious delete

Isolate the host, don’t restart, mount disks read-only on a clean host for forensics. Rebuild from scratch on new hosts, restore from OFF-SITE backups — if your backups are only on the same host as the ransomware’d dockmesh, they’re gone too. Off-site + air-gapped backups are the only defense.

Monthly fire drill

Once a month:

Run Scenario 4 option A against your latest system backup. If it doesn’t pass clean, stop — your backups are not actually DR-ready until you fix that.
Pick one random stack. Pretend its host is gone. Restore the stack’s latest volume backup to a different host.
Verify the restored stack works end-to-end (hit its port, log in, read some data).
Document any surprises in your internal runbook.

A disaster recovery plan you haven’t tested is a disaster recovery hope.

Backup retention for disaster recovery

Minimum recommended retention:

Last 7 daily backups
Last 4 weekly backups
Last 12 monthly backups
1 annual backup (compliance / long-term)

Off-site (separate failure domain) copies of all of the above.