Skip to content

Disaster Recovery Playbook

Recovery procedures for the worst-case scenarios. Print this out and keep it in your ops runbook.

Scenario 1 — dockmesh server host is dead

Section titled “Scenario 1 — dockmesh server host is dead”

Impact: UI unreachable. Containers keep running on every other host — Docker doesn’t need dockmesh to keep doing its job. Agents reconnect in a loop until the server is back. Target RTO: under 10 minutes with a recent backup in hand.

Prerequisites (set these up BEFORE disaster)

Section titled “Prerequisites (set these up BEFORE disaster)”
  • Automated system backups to an off-host target (SFTP / SMB / S3 / WebDAV). Configure under Settings → Backups with a daily schedule and the dockmesh-system job type. Verify the backup regularly (see Scenario 4).
  • Off-host DNS / LB entry for your dockmesh URL so agents find the new server without manual agent reconfig.
  • Unencrypted system backups, OR encrypted with a key stored OUTSIDE the dockmesh data dir (the server’s own age key is inside the tarball — self-encrypted system backups can’t self-restore).
  1. Spin up a new host — any Linux with Docker 20.10+ and a few GB of disk. A fresh cloud VM is fine; you don’t need to replicate the old spec.
  2. Install the dockmesh binary (not the data — just the binary):
    Terminal window
    curl -fsSL https://get.dockmesh.dev | bash
    # Stop the service so systemd doesn't fight us during restore
    systemctl stop dockmesh
  3. Copy your latest backup to the new host:
    Terminal window
    # From your workstation, or scp/rsync from the backup target
    scp dockmesh-system-20260419-0300.tar.gz newhost:/tmp/
  4. Run the restore command:
    Terminal window
    dockmesh restore --from /tmp/dockmesh-system-20260419-0300.tar.gz
    restore writes the DB, /stacks/, and the data/ dir into the paths dockmesh config show lists. It refuses to overwrite a populated install without --force; on a fresh host this doesn’t trigger. It runs the post-restore sanity checks automatically and prints sanity OK when the state is bootable.
  5. Point DNS / your load balancer at the new host so agents find it.
  6. Start the service:
    Terminal window
    systemctl start dockmesh
    journalctl -u dockmesh -f
  7. Verify:
    • Log in with your usual admin credentials (they came back with the DB).
    • Agents page should list every agent as online within 30-60s. They’re still authenticating against the CA cert that came back with the restored data/ dir.
    • Stacks page should look as it did at backup time.

What if the sanity check warns about missing CA / genesis?

Section titled “What if the sanity check warns about missing CA / genesis?”

The restore warns loudly when:

  • agents-ca.crt / agents-ca.key missing → every agent has to re-enroll. Annoying but not catastrophic. Delete each agent from the UI, run the install-agent command on each host.
  • audit-genesis.sha256 missing → the chain-of-custody claim has a break across the restore boundary. The chain from first-row-after-restore is still verifiable against each row’s stored hashes; you just can’t verify against the ORIGINAL genesis anymore. Note this in your audit trail if compliance matters.
  • secrets.env (JWT secret) missing → every existing session is invalidated and users have to log in again. First-boot-after-restore generates a fresh JWT secret.

These all surface as [warn] rows from dockmesh restore, not [FAIL].

Time to recover: 15-30 minutes if you have recent backups.

Time to recover without recent backups: days. Don’t be this person.

Impact: Agents can’t connect, logs show TLS errors.

Terminal window
date -u # check server + each agent
systemctl restart systemd-timesyncd # or: chronyd -Q

Connections resume within 30s of time correction.

Scenario 3 — A stack is corrupted (application-level data loss)

Section titled “Scenario 3 — A stack is corrupted (application-level data loss)”

Impact: One app is broken but containers are running.

  1. Stop the stack to prevent further corruption (Stack → Stop)
  2. Browse backups (Backups → Browse → filter by stack)
  3. Pick the last known-good run (check timestamps vs. when corruption started)
  4. Restore — dockmesh restores volumes, optionally the compose file too, and restarts
  5. Verify application-level consistency before letting traffic back in

If the restore itself causes issues, dockmesh keeps the pre-restore volume copy for 24 hours.

Scenario 4 — “Is our backup actually restorable?” (pre-disaster verify)

Section titled “Scenario 4 — “Is our backup actually restorable?” (pre-disaster verify)”

The single most valuable DR practice is knowing your backups work before you need them. Backups that have never been restored are Schrödinger’s backups — half the time they turn out to be corrupt / encrypted-with-a-lost-key / missing-the-CA when you discover them at 3am.

Option A — Upload-and-verify via UI (non-destructive)

Section titled “Option A — Upload-and-verify via UI (non-destructive)”
  1. Settings → System → Verify a backup → file picker.
  2. Upload your latest dockmesh-system-*.tar.gz.
  3. Server extracts to a temp dir, runs the same sanity checks dockmesh restore does, deletes the temp dir. The live install is never touched.
  4. Result: a table of [ ok ] / [warn] / [FAIL] rows for each check. Pass = you can restore this backup successfully.

Same endpoint from the CLI:

Terminal window
curl -H "Authorization: Bearer $DM_TOKEN" \
-F "file=@dockmesh-system-20260419-0300.tar.gz" \
https://dockmesh.example.com/api/v1/restore/verify

Option B — Full dry-run via dockmesh restore

Section titled “Option B — Full dry-run via dockmesh restore”
Terminal window
# Walks the tarball without writing anything. Checks entry structure +
# prints the would-restore target for each file + a running total.
dockmesh restore --from backup.tar.gz --dry-run

Doesn’t run sanity (no DB to check), but catches corrupt tarballs, encrypted archives, and unexpected prefixes.

Schedule a weekly verify of your latest system-backup run against Option A. If any week’s result isn’t passed: true, page the on-call.

Scenario 5 — A single host is dead (hardware failure, unrecoverable)

Section titled “Scenario 5 — A single host is dead (hardware failure, unrecoverable)”

Impact: Stacks on that host are offline. Other hosts unaffected.

Use drain if possible (Hosts → host → Drain) to move stacks off before decommissioning. If the host is already unreachable, use per-stack volume backups to restore onto a replacement host.

Scenario 6 — Agent mTLS certs are compromised

Section titled “Scenario 6 — Agent mTLS certs are compromised”
Terminal window
dockmesh ca rotate --reissue-all-agents

Rotates the server CA. All agents re-enroll on next connect. Audit the audit log for the pre-rotation window. Rotate any secrets (DB passwords, API keys) that could have leaked through compromised agents.

Scenario 7 — Database corruption (SQLite)

Section titled “Scenario 7 — Database corruption (SQLite)”
Terminal window
systemctl stop dockmesh
sqlite3 /opt/dockmesh/data/dockmesh.db "PRAGMA integrity_check;"

If okVACUUM;. If corrupt → dump + reimport, or (likelier) restore from backup per Scenario 1.

Scenario 8 — Ransomware / malicious delete

Section titled “Scenario 8 — Ransomware / malicious delete”

Isolate the host, don’t restart, mount disks read-only on a clean host for forensics. Rebuild from scratch on new hosts, restore from OFF-SITE backups — if your backups are only on the same host as the ransomware’d dockmesh, they’re gone too. Off-site + air-gapped backups are the only defense.

Once a month:

  1. Run Scenario 4 option A against your latest system backup. If it doesn’t pass clean, stop — your backups are not actually DR-ready until you fix that.
  2. Pick one random stack. Pretend its host is gone. Restore the stack’s latest volume backup to a different host.
  3. Verify the restored stack works end-to-end (hit its port, log in, read some data).
  4. Document any surprises in your internal runbook.

A disaster recovery plan you haven’t tested is a disaster recovery hope.

Minimum recommended retention:

  • Last 7 daily backups
  • Last 4 weekly backups
  • Last 12 monthly backups
  • 1 annual backup (compliance / long-term)

Off-site (separate failure domain) copies of all of the above.