Backups and Disaster Recovery
A backup you have never restored is not a backup. This page defines what PrexorCloud considers a backup, what it captures, the RPO/RTO targets the project commits to, and the manual + automated drills that verify those targets still hold.
What you’ll learn
- What is in a PrexorCloud backup and what isn’t
- The recommended backup cadence and retention shape
- The restore procedure, including selective and dry-run modes
- The RPO/RTO contract per data tier and the drill that exercises it
Recovery targets
| Tier | Source | RPO | RTO | What “recovered” means |
|---|---|---|---|---|
| 1 — Durable state | MongoDB | ≤ 1h | ≤ 30 min | Controller boots, dashboard logs in, every pre-incident group / template / deployment / audit row is back, daemons reconnect with their existing certs. |
| 2 — Coordination | Valkey | best-effort | ≤ 5 min | Empty Valkey is acceptable. Controller rebuilds leases on first reconciliation; JWT revocations are forgiven; SSE clients reconnect with replay-from-zero. |
| 3 — Filesystem | config/, data/certs/, modules/ | ≤ 24h | ≤ 30 min | controller.yml, the CA, and module data are recoverable. The CA private key is the only irreplaceable material — if lost, every daemon must rejoin from scratch. |
| 4 — Daemon hosts | daemon.yml + per-daemon mTLS | n/a | ≤ 15 min/node | Daemon restored; reconnects and reconciles instances from the controller. |
RPO is measured as time-since-last-successful-backup. The targets above are predicated on a backup cadence of at least hourly Mongo, daily filesystem, optional Valkey. If your cadence is slower, your RPO is whatever your cadence is — PrexorCloud does not run the cron for you.
What is in a backup
prexorctl backup create produces a single tarball that captures:
| Tier | Source | Loss impact |
|---|---|---|
| Durable platform | MongoDB (full dump) | Catastrophic — every group, deployment, audit entry, module record. |
| Coordination | Valkey RDB snapshot (optional) | Tolerable — login lockouts and SSE replay reset, JWT revocations forgiven. |
| Filesystem | config/, data/certs/, modules/ | Catastrophic for config and CA; recoverable for module storage. |
| Per-module storage | Mongo (mod_* collections) and Valkey (prexor:v1:platform:<id>:*) | Module-defined. |
Each manifest records createdAt, controllerVersion, host, the
scopes captured, and the file references. The on-disk artefact is the
source of truth; the backups Mongo collection is a searchable
index.
What is not in a backup
- Live state of running MC instances. Per-world / per-player game data is the operator’s responsibility; most server jars include hot snapshots.
- DNS, load balancer, and reverse proxy config in front of the controller.
- Operator dashboard browser sessions (cookies; transient anyway).
Recommended cadence
| Frequency | Scope | Retention |
|---|---|---|
| Hourly | Mongo only | 24 hours |
| Daily | Full (Mongo + Valkey + filesystem) | 14 days |
| Weekly | Full + off-host ship | 90 days |
| Pre-upgrade | Full | Until next stable upgrade window |
The audit log lives in Mongo, so the hourly cadence keeps audit-trail loss bounded.
Take a backup
# Full backup, default scope (Mongo + Valkey + filesystem).prexorctl backup create
# Mongo only — fastest path; suitable for the hourly cadence.prexorctl backup create --scope mongo
# Pre-upgrade snapshot with a label.prexorctl backup create --label "pre-v1.4-upgrade"
# List manifests.prexorctl backup list
# Verify integrity (checksums + structural restore-dry-run).prexorctl backup verify <manifest-id>
# Prune older than retention.prexorctl backup prune --keep 14The CLI wraps BackupCreator and BackupCatalog — the same code
the controller uses internally. Output goes to <install-root>/backups/<id>/.
Manual fallback
When the CLI is unavailable, the manual procedure lives in the backup runbook. TL;DR:
BK=/var/backups/prexorcloud/$(date -u +%Y-%m-%dT%H%M%SZ)sudo mkdir -p "$BK"
mongodump --uri "$MONGO_URI" --gzip --out "$BK/mongo"redis-cli -u "$REDIS_URI" BGSAVEsudo cp /var/lib/valkey/dump.rdb "$BK/valkey-dump.rdb"sudo tar -czf "$BK/etc-prexorcloud.tar.gz" -C /etc prexorcloudWrite a manifest.json matching the CLI’s schema and ship the
directory off-host (encrypted with age or gpg).
Off-host shipping
A backup that lives only on the controller is one disk failure from useless.
# Encrypt then ship.sudo tar -cf - -C /var/backups/prexorcloud "$BK_NAME" \ | age -r age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx... \ > "$BK_NAME.tar.age"
aws s3 cp "$BK_NAME.tar.age" s3://your-backups/prexorcloud/Use whatever encrypted off-host store fits your workflow — S3, restic, borg, Backblaze B2.
Restore
Restore replays a backup taken with the procedure above. The flow is the same whether you are recovering from a corrupt Mongo, a lost host, or a botched upgrade.
Decision tree
| Scenario | Restore? |
|---|---|
| Single controller died, HA peer healthy | No — fail over. See the recover-controller runbook. |
| Mongo corrupted / dropped | Yes, Mongo + filesystem. |
| Valkey emptied / lost | Usually no; controllers rebuild. |
| Bad upgrade rolled back the binary | Maybe — only if the failed upgrade ran a Mongo migration. Release notes call this out. |
| Bad config push | Filesystem only (config/controller.yml). |
| Module accidentally uninstalled | Selective Mongo restore of module_packages and mod_<moduleId>_*. |
When in doubt, restore the full backup to a staging controller first and validate before pointing production at it.
Run a restore
# Always dry-run first.prexorctl restore /var/backups/prexorcloud/<id>/manifest.json --dry-run
# Apply.prexorctl restore /var/backups/prexorcloud/<id>/manifest.json \ --filesystem --datastores--dry-run runs the restore validator and reports scope conflicts
without mutating anything. The validator runs again before any
APPLY, regardless of --dry-run — the dry-run mode just stops there.
Validation checklist
After restore, before declaring it good:
-
/api/v1/system/readyreturns 200. -
prexorctl statusshows expected controllers, nodes, groups. -
prexorctl group listshows expected groups. -
prexorctl module listshows installed modules inACTIVE. - Spot-check the audit log:
db.audit_log.find().sort({createdAt:-1}).limit(20). - All daemons reconnect —
prexorctl node listshowsREADY. - Smoke-test a deploy on a non-prod group.
If daemon mTLS material was not in the restored backup, daemons fail at the TLS layer. Re-issue:
prexorctl token create --description "rejoin-after-restore" --ttl 1h# On each daemon:sudo prexorctl setup --role daemon --rejoin --join-token <token>The DR drill
The drill exists because backup tooling that is never restored isn’t backup tooling.
Nightly automated drill
The dr-drill job in .github/workflows/nightly.yml runs the
end-to-end cycle every night against ephemeral Mongo + Valkey service
containers:
cd java./gradlew :cloud-test-harness:drDrillThe harness:
- Boots an in-process controller with a real Mongo + Valkey.
- Seeds a deterministic fixture (one template, two groups with distinct platform / scaling / priority shapes).
- Snapshots declarative state.
- Calls
POST /api/v1/backups; verifies the manifest withPOST /api/v1/backups/{id}/verify. - Stops the controller, drops the Mongo database, flushes the Valkey logical DB.
- Brings the controller back; asserts the seeded fixtures vanished.
- Calls
POST /api/v1/restorewithdryRun=true, thendryRun=false(filesystem=true,datastores=true). - Re-logs in as admin and re-snapshots state, asserting an exact match against step 3.
The job is @Tag("dr") and excluded from the default test pass; the
drDrill task opts in. CI failure on this job is a real DR
regression — investigate before merging.
The CI job does not measure wall-clock RTO; that target is still the operator’s manual drill. The job’s value is catching backup schema drift, restore-validator regressions, and post-restore state divergence between releases.
Manual quarterly drill
Even with nightly CI green, run a real-environment drill at least quarterly. See Disaster Drill for the step-by-step.
Production credibility is the median time-to-restore from your last drill, not the last green backup-create job.
Common failures
| Symptom | Likely cause | Fix |
|---|---|---|
mongorestore fails with unsupported BSON version | Restoring with much-older mongorestore | Use the binary matching the source Mongo version. |
Controller starts then exits with migration failed | Schema mismatch | Restore into the same controller version that took the backup. |
Daemons can’t connect: peer not found in trust store | CA was not restored | Restore data/certs/, restart controller. |
coordination.store=unavailable after restore | Valkey URI changed | Update controller.yml, restart. |
Modules show LOAD_FAILED | Bundle file removed but record kept | Reinstall via prexorctl module install. |
First login rejected with Locked | Restored login-attempt counters | Wait the lockout window or prexorctl user unlock <username>. |
Next up
- Disaster Drill — walk a real scenario step by step
- HA Setup — failure modes that don’t need a restore
- Configuration Reference —
backup.directory,backup.retentionCount,scheduler.auditRetentionDays