Prometheus Metrics
dockmesh exposes its own metrics in Prometheus format at /metrics. Point a Prometheus scraper at it, alongside your other node-exporter / cAdvisor targets, for unified observability.
Enable the endpoint
Section titled “Enable the endpoint”By default, /metrics is enabled but requires authentication. For internal Prometheus scraping, the simplest option:
- Settings → API tokens → New token
- Name:
prometheus-scraper - Role: custom role with
metrics.readpermission only - Copy the token
Add to your Prometheus scrape config:
scrape_configs: - job_name: dockmesh metrics_path: /metrics scheme: https static_configs: - targets: ['dockmesh.example.com'] authorization: type: Bearer credentials: <your-api-token>For air-gapped or trusted internal network deployments, you can disable auth on /metrics via DOCKMESH_METRICS_AUTH=false — the endpoint becomes public. Do this only if the dockmesh server isn’t reachable from untrusted networks.
What’s exposed
Section titled “What’s exposed”dockmesh-specific metrics (all prefixed dockmesh_):
| Metric | Type | Labels | Meaning |
|---|---|---|---|
dockmesh_hosts_total | gauge | status | Count of hosts by status (online/offline/degraded) |
dockmesh_stacks_total | gauge | host, state | Stacks per host/state |
dockmesh_containers_total | gauge | host, state | Containers per host/state |
dockmesh_agent_last_seen_seconds | gauge | host | Time since last agent heartbeat |
dockmesh_deploy_duration_seconds | histogram | stack, result | Deploy durations |
dockmesh_backup_last_run_timestamp | gauge | job | Last successful run per backup job |
dockmesh_backup_last_duration_seconds | gauge | job | Last backup duration |
dockmesh_api_requests_total | counter | method, path, status | HTTP API request counts |
dockmesh_audit_log_entries_total | counter | action | Audit events by action |
dockmesh_alerts_fired_total | counter | severity, rule | Alert fires |
dockmesh_agent_rtt_seconds | histogram | host | Agent protocol round-trip time |
Plus standard Go runtime metrics (go_*) and process metrics (process_*).
Useful PromQL queries
Section titled “Useful PromQL queries”# Any host offline for > 2 minutesdockmesh_agent_last_seen_seconds > 120
# Stacks in error statedockmesh_stacks_total{state="error"}
# Backup jobs that haven't run in > 26 hourstime() - dockmesh_backup_last_run_timestamp > 26 * 3600
# 95th percentile deploy time over 5 minuteshistogram_quantile(0.95, rate(dockmesh_deploy_duration_seconds_bucket[5m]))
# Alert fire rate by severitysum by (severity) (rate(dockmesh_alerts_fired_total[5m]))Grafana dashboard
Section titled “Grafana dashboard”Import dashboard ID DXXXX (once we publish it) from grafana.com, or see the monitoring stack guide for a starter setup.
Community dashboards and template JSON are in the GitHub repo’s monitoring/ directory.
Federate
Section titled “Federate”For multi-dockmesh setups (unusual, but possible — e.g. one dockmesh per region), federate the metrics to a central Prometheus:
scrape_configs: - job_name: federate scrape_interval: 30s honor_labels: true metrics_path: /federate params: match[]: - '{job="dockmesh"}' static_configs: - targets: - 'prometheus-eu.example.com' - 'prometheus-us.example.com'See also
Section titled “See also”- Monitoring Stack — full Prometheus + Grafana + Loki setup
- Alerts — dockmesh’s built-in alert system (can co-exist with Prometheus alerting)
- API Overview — for programmatic access beyond metrics