HA Controller (Redis)
PrexorCloud HA is active-active with lease-scoped work. Multiple controllers run simultaneously against the same MongoDB and Valkey; any healthy controller serves REST and gRPC. Mutating work is serialised through Valkey leases with monotonic fencing tokens — there is no single primary waiting for failover. This guide adds a second controller to a single-controller install, validates the failover, and covers operator-facing HA semantics.
What you’ll build
flowchart LR
D["dashboard / prexorctl"] --> LB["L4 load balancer<br/>:8080 :9090"]
LB --> C1["controller-1"]
LB --> C2["controller-2"]
C1 <-->|leases + pubsub| V[("Valkey")]
C2 <-->|leases + pubsub| V
C1 -->|state| M[("MongoDB")]
C2 -->|state| M
C1 --> N1["daemon node-1"]
C2 --> N2["daemon node-2"]Two controllers, one Valkey, one Mongo, lease-scoped writes, ~15-second failover with no manual intervention.
Prerequisites
- PrexorCloud v1.0+ controller already running (single-node) on
controller-1. - A shared MongoDB (replica set recommended) reachable from both controller hosts.
- A shared Valkey (or Redis-protocol-compatible) reachable from both controller hosts. Valkey ships with the project’s Compose stack.
- Reasonably synchronised clocks across controllers (NTP). Fencing protects against split-brain writes; clock skew only affects lease expiry timing.
- An L4 load balancer (HAProxy, nginx-stream, AWS NLB, Cloudflare
Spectrum) in front of
:8080(REST) and:9090(gRPC). Round-robin is fine.
1. Confirm the existing controller runs in production profile
HA requires the production wiring graph (RedisRuntimeServices).
Check /etc/prexorcloud/controller.yml:
runtime: profile: productionredis: uri: "redis://valkey.internal:6379"mongo: uri: "mongodb://mongo-rs/prexorcloud?replicaSet=rs0"ConfigValidator rejects a production config without a configured
coordination store. If yours says profile: development, switch to
production and restart the controller before continuing.
2. Add the second controller
On the new host (controller-2), point the installer at the existing
Mongo and Valkey and skip bootstrapping:
sudo prexorctl setup \ --component controller \ --controller-mongo-uri "mongodb://mongo-rs/prexorcloud?replicaSet=rs0" \ --controller-redis-uri "redis://valkey.internal:6379" \ --bootstrap=false \ --non-interactive--bootstrap=false skips admin-user creation (the existing controller
already has one) and skips CA generation (the new controller reads the
existing CA from Mongo). Within ~15 seconds:
prexorctl status# CONTROLLERS# controller-1 READY leases=4# controller-2 READY leases=2Both controllers compete for leases; mutating work distributes automatically. Reads serve from any.
3. Put the load balancer in front
Example HAProxy snippet:
frontend prexor-rest bind :8080 default_backend prexor-rest-be
backend prexor-rest-be option httpchk GET /api/v1/system/ready server c1 controller-1.internal:8080 check inter 5s fall 2 rise 2 server c2 controller-2.internal:8080 check inter 5s fall 2 rise 2
frontend prexor-grpc bind :9090 mode tcp default_backend prexor-grpc-be
backend prexor-grpc-be mode tcp balance roundrobin server c1 controller-1.internal:9090 check server c2 controller-2.internal:9090 check/api/v1/system/ready is the single liveness endpoint — it returns
200 only when both coordination.store=available and
state.store=available.
Point prexorctl at the load balancer:
prexorctl config set controller "https://prexor.example.com:8080"prexorctl status # should still show two controllersDaemons keep their existing certs and reconnect through the LB to
whichever controller serves them; node ownership is tracked via
prexor:v1:node:<id>:owner keys.
4. Validate failover
Stop one controller and watch the survivor pick up the leases:
# On controller-1:sudo systemctl stop prexorcloud-controller
# On any host that can reach controller-2:prexorctl --controller https://controller-2:8080 status# CONTROLLERS# controller-2 READY leases=6 ← absorbed all leasesWithin lease.timeoutSeconds (default 15s) the survivor reconciles
live node sessions, persisted workflows, and runtime records before
issuing additional mutations. The harness verifies this at four points
in RecoveryTest: drain, deployment, placement-time, and in-flight
module mutation.
Restart controller-1:
sudo systemctl start prexorcloud-controllerprexorctl status# CONTROLLERS# controller-1 READY leases=2 ← reacquires its share# controller-2 READY leases=4How to verify it works
The four signals you want green at all times:
- Both
/api/v1/system/readyreturn 200. LB health-checks catch this immediately. prexorctl statuslists every expected controller inREADY.prexorcloud.lease.acquired.total(Prometheus, per controller) is monotonically increasing. Long flat lines mean a controller stopped doing mutating work.prexorcloud.fencing.token.rejected.totalis zero or near-zero. Bursts mean clock skew or a stuck-process fencing fight; investigate.
The nightly perf-baseline job (.github/workflows/nightly.yml)
exercises a multi-controller path; drift comparator alerts on
regression.
What HA gives you over a single controller
| Feature | Single | HA |
|---|---|---|
| Write availability through one controller dying | no | yes (~15s) |
| SSE replay across controller restart | depends on profile | yes (Valkey-buffered) |
| Module mutation handoff during restart | no | yes |
| Workflow handoff (drain, deploy, healing) | no | yes |
| Persisted REST + workload rate-limit windows | yes (production) | yes |
| Cluster event fanout (Redis pub/sub) | local only | pub/sub fanout |
Common pitfalls
| Symptom | Likely cause |
|---|---|
| Both controllers stop accepting writes after add | Pointed at same Mongo, different Valkey instances. Confirm redis.uri matches on both. |
| Operations on group X hang for >60s | Stuck lease. redis-cli -u $REDIS_URI GET prexor:v1:lease:group:X to inspect; DEL to force-release if needed (fencing token rejects stale writes). |
prexorcloud.fencing.token.rejected spikes | Clock skew between controllers. NTP-sync; the fencing token math itself is safe. |
| Daemons flap between controllers | Node-ownership records timing out due to network blips. Increase node.ownership.ttlSeconds. |
| Restore is racing with HA | prexorctl restore does not currently grab all mutation leases. Stop every controller before restoring. |
Where to go next
- Concepts → Architecture §“HA model” — why active-active rather than primary/standby.
- Guides → Backup + Restore — special HA care during restores.
- Operations → Monitoring — Prometheus alerts specific to HA (lease churn, fencing rejects).
- Operations → HA Setup — operator runbook with the full failure-mode catalog.