Prometheus Metrics

dockmesh exposes its own metrics in Prometheus format at /metrics. Point a Prometheus scraper at it, alongside your other node-exporter / cAdvisor targets, for unified observability.

Enable the endpoint

By default, /metrics is enabled but requires authentication. For internal Prometheus scraping, the simplest option:

Settings → API tokens → New token
Name: prometheus-scraper
Role: custom role with metrics.read permission only
Copy the token

Add to your Prometheus scrape config:

scrape_configs:
  - job_name: dockmesh
    metrics_path: /metrics
    scheme: https
    static_configs:
      - targets: ['dockmesh.example.com']
    authorization:
      type: Bearer
      credentials: <your-api-token>

For air-gapped or trusted internal network deployments, you can disable auth on /metrics via DOCKMESH_METRICS_AUTH=false — the endpoint becomes public. Do this only if the dockmesh server isn’t reachable from untrusted networks.

What’s exposed

dockmesh-specific metrics (all prefixed dockmesh_):

Metric	Type	Labels	Meaning
`dockmesh_hosts_total`	gauge	`status`	Count of hosts by status (online/offline/degraded)
`dockmesh_stacks_total`	gauge	`host`, `state`	Stacks per host/state
`dockmesh_containers_total`	gauge	`host`, `state`	Containers per host/state
`dockmesh_agent_last_seen_seconds`	gauge	`host`	Time since last agent heartbeat
`dockmesh_deploy_duration_seconds`	histogram	`stack`, `result`	Deploy durations
`dockmesh_backup_last_run_timestamp`	gauge	`job`	Last successful run per backup job
`dockmesh_backup_last_duration_seconds`	gauge	`job`	Last backup duration
`dockmesh_api_requests_total`	counter	`method`, `path`, `status`	HTTP API request counts
`dockmesh_audit_log_entries_total`	counter	`action`	Audit events by action
`dockmesh_alerts_fired_total`	counter	`severity`, `rule`	Alert fires
`dockmesh_agent_rtt_seconds`	histogram	`host`	Agent protocol round-trip time

Plus standard Go runtime metrics (go_*) and process metrics (process_*).

Useful PromQL queries

# Any host offline for > 2 minutes
dockmesh_agent_last_seen_seconds > 120

# Stacks in error state
dockmesh_stacks_total{state="error"}

# Backup jobs that haven't run in > 26 hours
time() - dockmesh_backup_last_run_timestamp > 26 * 3600

# 95th percentile deploy time over 5 minutes
histogram_quantile(0.95,
  rate(dockmesh_deploy_duration_seconds_bucket[5m]))

# Alert fire rate by severity
sum by (severity) (rate(dockmesh_alerts_fired_total[5m]))

Grafana dashboard

Import dashboard ID DXXXX (once we publish it) from grafana.com, or see the monitoring stack guide for a starter setup.

Community dashboards and template JSON are in the GitHub repo’s monitoring/ directory.

Federate

For multi-dockmesh setups (unusual, but possible — e.g. one dockmesh per region), federate the metrics to a central Prometheus:

scrape_configs:
  - job_name: federate
    scrape_interval: 30s
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{job="dockmesh"}'
    static_configs:
      - targets:
          - 'prometheus-eu.example.com'
          - 'prometheus-us.example.com'