Guides: deploy & operate

How to run multiple instances (HA)

This guide shows you how to run more than one DeltaGlider Proxy instance against the same storage, coordinated through a shared S3 bucket.

The shared bucket does two jobs: it syncs the encrypted config DB (deltaglider_config.db — IAM users, groups, OAuth providers) between instances, and it hosts the replication leader leases that stop two instances from running the same rule (with automatic failover when a leader dies). Because leases depend on atomic conditional writes, the proxy validates the bucket's backend at boot and refuses to start on one that can't enforce them — see How to use non-CAS backends safely. Object data itself needs nothing: all instances route to the same backends. Why it's built this way: Multi-backend architecture.

1. Point every instance at a sync bucket#

Set the same sync bucket on every instance, in YAML or env:

advanced:
  config_sync_bucket: dgp-iam-sync

DGP_CONFIG_SYNC_BUCKET=dgp-iam-sync

After every IAM mutation, the mutating instance uploads the encrypted DB to the bucket. The other instances poll every 5 minutes and download when the ETag changes. All instances must share the same bootstrap password — it's the DB encryption key.

2. Designate one writer#

Sync is not multi-master. Run exactly one instance as the IAM administration surface (where operators use the admin GUI / admin API); treat the others as readers. If two instances both mutate, the "loudest" writer wins and the other's mutations are lost — you'll see continuous [config-sync] ETag mismatch on DB download — retrying in the logs. An occasional mismatch is normal (two events inside one 5-minute poll window resolve on the next cycle); a continuous one means you have two writers.

If you want no writer at all, switch to iam_mode: declarative and manage IAM via YAML + GitOps — see How to manage IAM as code.

3. Force a sync when you can't wait#

After a known-good mutation on the writer, make a reader pull immediately instead of waiting out the poll interval:

curl -b cookies -X POST https://dgp-reader-1:9000/_/api/admin/config/sync-now

Use this during rollouts and incident response — e.g. you just disabled a leaked key on the writer and want every reader to enforce it now.

4. If you scale with Helm#

replicaCount defaults to 1 — do not raise it until the sync bucket is configured. With the sync bucket set, replication rules elect a single leader per rule through an S3 lease object in that bucket (conditional-write CAS): if the leader dies, its lease lapses (default lease_ttl: "300s") and a peer takes over automatically — no double-run, no shared DB required. Lifecycle and maintenance jobs still use node-local database leases (heartbeat_interval: "60s" renewals), so under multiple replicas those can run on more than one pod; their operations are idempotent, so this wastes work rather than corrupting data. The sync bucket must pass the boot-time conditional-write validation — see How to use non-CAS backends safely.

See How to deploy on Kubernetes with Helm for the chart specifics.

5. Route multipart uploads to one instance#

The state of a multipart upload — the upload id and the parts received so far — lives only in the memory and on the local disk of the instance that answered the CreateMultipartUpload request. No other instance knows about that upload. Behind a round-robin load balancer, every UploadPart request that lands on a different instance is therefore rejected with a NoSuchUpload error. Cookie-based session affinity cannot fix this, because S3 clients do not carry cookies, and affinity by client IP address stops working when many clients share one address behind a NAT gateway.

The supported answer is consistent hashing by the directory of the URL path at the load balancer. Hash the request path with its last segment removed — in S3 terms, the bucket plus the key's prefix. Everything in one directory then reaches the same instance: every object key in that prefix, and every part of any multipart upload of those keys. Hashing the directory rather than the full path matters for a second reason: a delta prefix is shared state. All keys in one prefix update the same reference file, and the lock that serialises those updates lives inside a single process — so all writes into one prefix must land on one instance, not just all requests for one key.

On Kubernetes, the official operator deploys this router for you — see How to scale out with the Kubernetes operator. On any other platform, configure the equivalent on your load balancer. For HAProxy:

balance hash path,regsub([^/]*$,x)
hash-type consistent

For nginx, derive the directory with a map and hash on it:

map $uri $uri_dir {
    ~^(?<dir>.*/) $dir;
    default       $uri;
}
upstream dgp {
    hash $uri_dir consistent;
    ...
}

Be aware of the limit of this approach: when you add or remove an instance, part of the hash ring moves, so a multipart upload that is in flight on a moved prefix fails and the client has to restart it from the beginning.

6. Mind upgrades across the fleet#

During a rolling upgrade, a newer binary may migrate the DB schema forward; older instances still running will download a DB they can't fully read. Upgrade all instances before making IAM mutations, or accept that mid-rollout mutations are lost on older readers. Details: How to upgrade the proxy.

Verify#

# 1. On the writer: create a throwaway user (admin GUI or API), then force a pull on a reader
curl -b cookies -X POST https://dgp-reader-1:9000/_/api/admin/config/sync-now

# 2. The reader sees the new user
curl -b cookies https://dgp-reader-1:9000/_/api/admin/users | jq '.[] | .name'

# 3. The new user's credentials work against the reader
aws s3 ls --endpoint-url https://dgp-reader-1:9000

Watch the reader's logs for [config-sync] lines — a download on ETag change is the success signal; continuous ETag-mismatch retries mean two writers.

How to use non-CAS backends safely — the conditional-write validation that runs at startup when the sync bucket is set, and what it refuses
How to back up and restore — sync replicates state; it does not protect it
How to manage IAM as code — the GitOps alternative to a designated writer
How to monitor with Prometheus and Grafana — scraping multiple targets
Configuration reference — config-sync fields