Upgrading
PrexorCloud upgrades are designed around two facts: controllers are active-active under shared MongoDB + Valkey, and daemons are stateless re-attachers. Both let you upgrade in place without a maintenance window, provided you walk the steps in order.
What you’ll learn
- The pre-flight every upgrade owes you, regardless of size
- How to roll a single controller (downtime) versus an HA pair (zero downtime)
- Daemon and module upgrades — drain, replace, rejoin
- The rollback path when an upgrade goes wrong
Pre-flight (always)
- Read the release notes for every version between current and
target. Pay attention to:
- Config schema changes (new required keys, deprecated keys).
- Mongo schema migrations — the controller logs
migration applied:on startup; some require a manual data backfill (release notes call these out). - Module SDK or capability changes that might break installed modules.
- Verify the current install is healthy:
Terminal window prexorctl statuscurl -fs http://localhost:8080/api/v1/system/ready - Take a backup. Always.
Terminal window prexorctl backup create --label "pre-upgrade-$(date -u +%Y%m%d)" - Check module compatibility.
Confirm each is compatible with the target release.
Terminal window prexorctl module list - Verify the new release is signed.
Terminal window cosign verify-blob \--certificate-identity-regexp "^https://github.com/prexorjustin/prexorcloud/.github/workflows/release.yml@refs/tags/" \--certificate-oidc-issuer "https://token.actions.githubusercontent.com" \--signature checksums.txt.sig \--certificate checksums.txt.pem \checksums.txtsha256sum -c checksums.txt
Single-controller upgrade
This path causes ~10–60s of downtime. Use it only when you’re not running HA.
# 1. Stop the controller.sudo systemctl stop prexorcloud-controller
# 2. Replace the binary / package / image.# A — package manager:sudo apt-get install --only-upgrade prexorcloud-controller# B — manual jar swap:sudo cp prexorcloud-controller-<new-version>.jar /opt/prexorcloud/lib/# C — Docker Compose:docker compose pull controllerdocker compose up -d controller
# 3. Watch it come back up.sudo systemctl start prexorcloud-controllersudo journalctl -u prexorcloud-controller -fWatch for:
migration applied:— normal when restoring an older config or bumping schema.migration failed:— stop, restore the pre-flight backup, open an issue. Do not improvise schema fixes.coordination.store=available— Valkey reachable.state.store=available— Mongo reachable.
Verify:
curl -fs http://localhost:8080/api/v1/system/readyprexorctl statusIf /system/ready does not go green within two minutes:
sudo journalctl -u prexorcloud-controller --since "5 min ago" | grep -i ERRORMost upgrade failures are config drift (a new required key) or a Mongo migration that needs manual intervention.
HA controller upgrade (zero downtime)
Run controllers one at a time. The surviving controller picks up
leases automatically within nodeTimeoutSeconds of the stopped one
losing its session.
# On controller-1:sudo systemctl stop prexorcloud-controller
# controller-2 acquires leases within ~lease-timeout seconds.# Verify on controller-2:curl -fs http://controller-2:8080/api/v1/system/readyprexorctl status
# Upgrade and restart controller-1.sudo apt-get install --only-upgrade prexorcloud-controllersudo systemctl start prexorcloud-controller
# Wait until controller-1 reports ready.curl -fs http://controller-1:8080/api/v1/system/ready
# Then repeat on controller-2.sequenceDiagram participant C1 as Controller-1 participant V as Valkey (leases) participant C2 as Controller-2 Note over C1,C2: Both serving traffic, leases distributed C1->>V: stop heartbeat Note over C1: stopped for upgrade V-->>C2: lease expired (after ~timeout) C2->>V: acquire lease, bump fencing token Note over C2: serves all traffic with fresh tokens C1->>V: restart, request leases V-->>C1: distribute new lease set Note over C1,C2: Both serving again
While controllers run mixed versions, the schema must be backwards-compatible. PrexorCloud guarantees this within a single minor release (e.g. 0.7.x ↔ 0.7.y) and during one major hop (e.g. 0.7 ↔ 0.8). Skipping majors (0.7 → 0.9) is not supported during a rolling upgrade — stop all controllers, upgrade Mongo schema, then start them.
Lease handoff timing
Failover is bounded by scheduler.nodeTimeoutSeconds (default 90s for
node sessions) and the lease TTL (typically scheduler.evaluationIntervalSeconds × 2).
For controller-restart failover, the surviving controller sees lease
expiry within ~30 seconds of the stopped controller’s last heartbeat
and resumes mutations immediately. Existing in-flight operations
under the stopped controller are bounded by fencing — the new lease
holder bumps the fencing token, and the stopped controller cannot
write under the old token if it comes back unaware.
Daemon upgrade
Daemons are upgraded one at a time. The controller continues to schedule onto un-upgraded daemons.
# Drain the node first so running instances finish gracefully.prexorctl node drain <node-id> --shutdown=false --timeout 5m
# Wait until the node reports zero running instances.prexorctl node info <node-id>
# Stop, upgrade, start.sudo systemctl stop prexorcloud-daemonsudo apt-get install --only-upgrade prexorcloud-daemonsudo systemctl start prexorcloud-daemon
# Confirm.prexorctl node listprexorctl node undrain <node-id>The daemon’s existing mTLS certificate carries across upgrades — nothing to re-issue. If the upgrade changes the gRPC contract enough that the cert can no longer authenticate (very rare; we treat this as a major-version event), re-issue per Rotate Secrets.
Module upgrade
State-preserving hot reload is intentionally not supported. Upgrading a module triggers a planned controller-side reload for that bundle:
prexorctl module install ./my-module-2.0.0.bundle# This creates a new module-package record. Existing instances keep# the previous version until the group is redeployed.
prexorctl group deploy <group> --module my-module=2.0.0group deploy performs a rolling restart of the affected instances.
Watch progress:
prexorctl group info <group>prexorctl workflow list --filter "group=<group>"Dashboard upgrade
The dashboard is a separate Node project. Backward / forward compatibility is guaranteed across one minor version.
# Compose:docker compose pull dashboarddocker compose up -d dashboard
# Bare-metal: ship the new bundle, restart the systemd unit / nginx /# whatever is serving the static files.Rollback
If the upgrade fails:
- Stop the controller(s).
- Reinstall the previous package version:
Terminal window sudo apt-get install prexorcloud-controller=<previous-version> - Restore the backup taken in pre-flight (only required if a Mongo schema migration ran during the failed upgrade — release notes will say). See Backups and DR.
- Start the controller(s) and verify.
For HA, roll back the upgraded controllers in reverse order before restoring data.
Validation checklist
After a successful upgrade, confirm:
-
/api/v1/system/readyreturns 200 on every controller. -
prexorctl statuslists all expected nodes inREADY. -
prexorctl group listshows expected groups, nodesiredVersion != currentVersiondrift. -
prexorctl module listshows each installed module inACTIVE. -
prexorctl crash list --since "10 min ago"is empty (or only shows pre-existing entries). - No new errors:
journalctl -u prexorcloud-controller --since "10 min ago" | grep -i ERROR. - Audit log shows
controller.startup.completedfor the new version.
Common failures
| Symptom | Likely cause | Fix |
|---|---|---|
Controller fails to start, log says unknown config key | Removed key still present in controller.yml | Edit out the key, restart. |
Controller starts but coordination.store=unavailable | New release requires a Valkey/Redis feature | Upgrade Valkey/Redis to documented minimum. |
| Daemons disconnect after upgrade | mTLS client trust changed (very rare) | Re-issue daemon certificates per Rotate Secrets. |
| Module install rejected after upgrade | Manifest schema bumped | Re-publish module against the new SDK; existing installs keep running. |
| Audit log spikes “migration applied” | Normal — schema migrations run once on startup | None. Confirm no migration failed follows. |
| HA peer can’t take leases after one upgrades | Mixed version skipped > 1 major | Stop both, upgrade in lockstep instead of rolling. |
Why HA rolling works
Two pieces make it safe:
- Lease + fencing tokens. When controller-2 takes a lease that controller-1 just lost, the fencing token bumps. controller-1 cannot mutate under the old token even if it comes back without noticing. See Architecture.
- Persisted workflows. In-flight rolling restarts, drains,
placements, and module mutations live in the
workflow_intentsMongo collection. The new lease holder reads the intent and resumes deterministically.
The harness exercises this — RecoveryTest runs four mid-failover
scenarios (drain, deployment, placement-time, in-flight module
mutation) and asserts no duplicate side effects.
Next up
- Backups and DR — pre-flight backup procedure
- HA Setup — multi-controller install + lease semantics
- Production Checklist — pre-launch hardening