Skip to content

Prometheus Metrics

dockmesh exposes its own metrics in Prometheus format at /metrics. Point a Prometheus scraper at it, alongside your other node-exporter / cAdvisor targets, for unified observability.

By default, /metrics is enabled but requires authentication. For internal Prometheus scraping, the simplest option:

  1. Settings → API tokens → New token
  2. Name: prometheus-scraper
  3. Role: custom role with metrics.read permission only
  4. Copy the token

Add to your Prometheus scrape config:

scrape_configs:
- job_name: dockmesh
metrics_path: /metrics
scheme: https
static_configs:
- targets: ['dockmesh.example.com']
authorization:
type: Bearer
credentials: <your-api-token>

For air-gapped or trusted internal network deployments, you can disable auth on /metrics via DOCKMESH_METRICS_AUTH=false — the endpoint becomes public. Do this only if the dockmesh server isn’t reachable from untrusted networks.

dockmesh-specific metrics (all prefixed dockmesh_):

MetricTypeLabelsMeaning
dockmesh_hosts_totalgaugestatusCount of hosts by status (online/offline/degraded)
dockmesh_stacks_totalgaugehost, stateStacks per host/state
dockmesh_containers_totalgaugehost, stateContainers per host/state
dockmesh_agent_last_seen_secondsgaugehostTime since last agent heartbeat
dockmesh_deploy_duration_secondshistogramstack, resultDeploy durations
dockmesh_backup_last_run_timestampgaugejobLast successful run per backup job
dockmesh_backup_last_duration_secondsgaugejobLast backup duration
dockmesh_api_requests_totalcountermethod, path, statusHTTP API request counts
dockmesh_audit_log_entries_totalcounteractionAudit events by action
dockmesh_alerts_fired_totalcounterseverity, ruleAlert fires
dockmesh_agent_rtt_secondshistogramhostAgent protocol round-trip time

Plus standard Go runtime metrics (go_*) and process metrics (process_*).

# Any host offline for > 2 minutes
dockmesh_agent_last_seen_seconds > 120
# Stacks in error state
dockmesh_stacks_total{state="error"}
# Backup jobs that haven't run in > 26 hours
time() - dockmesh_backup_last_run_timestamp > 26 * 3600
# 95th percentile deploy time over 5 minutes
histogram_quantile(0.95,
rate(dockmesh_deploy_duration_seconds_bucket[5m]))
# Alert fire rate by severity
sum by (severity) (rate(dockmesh_alerts_fired_total[5m]))

Import dashboard ID DXXXX (once we publish it) from grafana.com, or see the monitoring stack guide for a starter setup.

Community dashboards and template JSON are in the GitHub repo’s monitoring/ directory.

For multi-dockmesh setups (unusual, but possible — e.g. one dockmesh per region), federate the metrics to a central Prometheus:

scrape_configs:
- job_name: federate
scrape_interval: 30s
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job="dockmesh"}'
static_configs:
- targets:
- 'prometheus-eu.example.com'
- 'prometheus-us.example.com'
  • Monitoring Stack — full Prometheus + Grafana + Loki setup
  • Alerts — dockmesh’s built-in alert system (can co-exist with Prometheus alerting)
  • API Overview — for programmatic access beyond metrics