Monitoring and alerts

Wire Prometheus + Grafana up to DeltaGlider Proxy and set up the alerts you actually want to be paged on.

The full metrics catalog lives at reference/metrics.md. This page is the operational task sheet: how to scrape, what to graph, what to alert on.

Scrape configuration

Prometheus

# validate
scrape_configs:
  - job_name: deltaglider
    metrics_path: /_/metrics
    scrape_interval: 15s
    static_configs:
      - targets: ["dgp.example.com:9000"]

For multiple instances behind a load balancer, use service discovery or list each target directly:

scrape_configs:
  - job_name: deltaglider
    metrics_path: /_/metrics
    scrape_interval: 15s
    static_configs:
      - targets:
          - "dgp-1:9000"
          - "dgp-2:9000"
          - "dgp-3:9000"

The /_/metrics endpoint is exempt from SigV4 auth, so Prometheus doesn't need credentials. Bare /metrics is part of the S3-compatible namespace and must not be used for Prometheus scraping.

Docker Compose starter (Prometheus + Grafana)

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

docker compose up -d, open http://localhost:3000 (admin/admin), add Prometheus as a data source at http://prometheus:9090, import the panels below.

Dashboard panels (PromQL)

Request rate by operation

sum by (operation) (rate(deltaglider_http_requests_total[5m]))

Time series, stacked. Shows which S3 operations dominate.

Latency p50 / p95 / p99

histogram_quantile(0.50, sum by (le) (rate(deltaglider_http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(deltaglider_http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (le) (rate(deltaglider_http_request_duration_seconds_bucket[5m])))

Three queries on one panel. Unit: seconds.

Latency by operation (p95)

histogram_quantile(0.95, sum by (le, operation) (rate(deltaglider_http_request_duration_seconds_bucket[5m])))

Useful for spotting slow operations — GET (delta decode) vs HEAD (cache read) have very different profiles.

Error rate

sum(rate(deltaglider_http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(deltaglider_http_requests_total[5m]))

Stat panel. Unit: percent (0–1).

Delta compression effectiveness

# Bytes saved per second
rate(deltaglider_delta_bytes_saved_total[5m])

# Cumulative bytes saved
deltaglider_delta_bytes_saved_total

# p50 compression ratio (lower is better; 0.1 = 90% saved)
histogram_quantile(0.50, rate(deltaglider_delta_compression_ratio_bucket[1h]))

Storage decisions mix

sum by (decision) (rate(deltaglider_delta_decisions_total[5m]))

Pie chart. Shows delta vs passthrough vs reference split.

Cache hit ratio

rate(deltaglider_cache_hits_total[5m])
  /
(rate(deltaglider_cache_hits_total[5m]) + rate(deltaglider_cache_misses_total[5m]))

Gauge. Target: > 90%.

Cache headroom

deltaglider_cache_size_bytes
deltaglider_cache_entries

Compare cache_size_bytes against DGP_CACHE_MB * 1048576 to see utilisation.

Codec pressure

deltaglider_codec_semaphore_available

Gauge. When it drops to 0, xdelta3 permits are all in use and encode/decode queue. Increase DGP_CODEC_CONCURRENCY.

Encode + decode latency (p95)

histogram_quantile(0.95, rate(deltaglider_delta_encode_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(deltaglider_delta_decode_duration_seconds_bucket[5m]))

Auth failure rate

sum by (reason) (rate(deltaglider_auth_failures_total[5m]))

A spike in invalid_signature = client misconfiguration. A spike in missing_header = unauthenticated probes.

Uptime

time() - process_start_time_seconds

Alerting rules

Drop these into your Prometheus rules.yml. Tune thresholds to your SLO.

groups:
  - name: deltaglider
    rules:
      - alert: DeltaGliderHighErrorRate
        expr: >
          sum(rate(deltaglider_http_requests_total{status=~"5.."}[5m]))
          / sum(rate(deltaglider_http_requests_total[5m])) > 0.05
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "DeltaGlider error rate above 5%"

      - alert: DeltaGliderSlowRequests
        expr: >
          histogram_quantile(0.95,
            sum by (le) (rate(deltaglider_http_request_duration_seconds_bucket[5m]))
          ) > 2
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "DeltaGlider p95 latency above 2s"

      - alert: DeltaGliderLowCacheHitRatio
        expr: >
          rate(deltaglider_cache_hits_total[15m])
          / (rate(deltaglider_cache_hits_total[15m]) + rate(deltaglider_cache_misses_total[15m]))
          < 0.5
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Reference cache hit ratio < 50% — consider raising DGP_CACHE_MB"

      - alert: DeltaGliderCodecSaturated
        expr: deltaglider_codec_semaphore_available == 0
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "All codec slots busy for 5+ minutes — consider raising DGP_CODEC_CONCURRENCY"

      - alert: DeltaGliderAuthFailureSpike
        expr: sum(rate(deltaglider_auth_failures_total[5m])) > 1
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Sustained auth failures (> 1/s for 5 min)"

      - alert: DeltaGliderDown
        expr: up{job="deltaglider"} == 0
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "DeltaGlider instance unreachable"

Built-in admin dashboard

The admin UI ships a live monitoring page at /_/admin/diagnostics/dashboard — same metrics, auto-refreshed every 5s, with a storage-analytics tab that surfaces per-bucket savings and estimated cost. It's not a substitute for a proper Grafana setup in production (no historical retention, no alerting), but it's enough to answer "is the proxy healthy right now?" without leaving the UI.

Metrics reference — full catalog, labels, buckets.
Production deployment — cache sizing, codec concurrency, log levels.
Troubleshooting — symptom → metric mapping when something misbehaves.