Skip to content

Backups and Disaster Recovery

A backup you have never restored is not a backup. This page defines what PrexorCloud considers a backup, what it captures, the RPO/RTO targets the project commits to, and the manual + automated drills that verify those targets still hold.

What you’ll learn

  • What is in a PrexorCloud backup and what isn’t
  • The recommended backup cadence and retention shape
  • The restore procedure, including selective and dry-run modes
  • The RPO/RTO contract per data tier and the drill that exercises it

Recovery targets

TierSourceRPORTOWhat “recovered” means
1 — Durable stateMongoDB≤ 1h≤ 30 minController boots, dashboard logs in, every pre-incident group / template / deployment / audit row is back, daemons reconnect with their existing certs.
2 — CoordinationValkeybest-effort≤ 5 minEmpty Valkey is acceptable. Controller rebuilds leases on first reconciliation; JWT revocations are forgiven; SSE clients reconnect with replay-from-zero.
3 — Filesystemconfig/, data/certs/, modules/≤ 24h≤ 30 mincontroller.yml, the CA, and module data are recoverable. The CA private key is the only irreplaceable material — if lost, every daemon must rejoin from scratch.
4 — Daemon hostsdaemon.yml + per-daemon mTLSn/a≤ 15 min/nodeDaemon restored; reconnects and reconciles instances from the controller.

RPO is measured as time-since-last-successful-backup. The targets above are predicated on a backup cadence of at least hourly Mongo, daily filesystem, optional Valkey. If your cadence is slower, your RPO is whatever your cadence is — PrexorCloud does not run the cron for you.

What is in a backup

prexorctl backup create produces a single tarball that captures:

TierSourceLoss impact
Durable platformMongoDB (full dump)Catastrophic — every group, deployment, audit entry, module record.
CoordinationValkey RDB snapshot (optional)Tolerable — login lockouts and SSE replay reset, JWT revocations forgiven.
Filesystemconfig/, data/certs/, modules/Catastrophic for config and CA; recoverable for module storage.
Per-module storageMongo (mod_* collections) and Valkey (prexor:v1:platform:<id>:*)Module-defined.

Each manifest records createdAt, controllerVersion, host, the scopes captured, and the file references. The on-disk artefact is the source of truth; the backups Mongo collection is a searchable index.

What is not in a backup

  • Live state of running MC instances. Per-world / per-player game data is the operator’s responsibility; most server jars include hot snapshots.
  • DNS, load balancer, and reverse proxy config in front of the controller.
  • Operator dashboard browser sessions (cookies; transient anyway).
FrequencyScopeRetention
HourlyMongo only24 hours
DailyFull (Mongo + Valkey + filesystem)14 days
WeeklyFull + off-host ship90 days
Pre-upgradeFullUntil next stable upgrade window

The audit log lives in Mongo, so the hourly cadence keeps audit-trail loss bounded.

Take a backup

Terminal window
# Full backup, default scope (Mongo + Valkey + filesystem).
prexorctl backup create
# Mongo only — fastest path; suitable for the hourly cadence.
prexorctl backup create --scope mongo
# Pre-upgrade snapshot with a label.
prexorctl backup create --label "pre-v1.4-upgrade"
# List manifests.
prexorctl backup list
# Verify integrity (checksums + structural restore-dry-run).
prexorctl backup verify <manifest-id>
# Prune older than retention.
prexorctl backup prune --keep 14

The CLI wraps BackupCreator and BackupCatalog — the same code the controller uses internally. Output goes to <install-root>/backups/<id>/.

Manual fallback

When the CLI is unavailable, the manual procedure lives in the backup runbook. TL;DR:

Terminal window
BK=/var/backups/prexorcloud/$(date -u +%Y-%m-%dT%H%M%SZ)
sudo mkdir -p "$BK"
mongodump --uri "$MONGO_URI" --gzip --out "$BK/mongo"
redis-cli -u "$REDIS_URI" BGSAVE
sudo cp /var/lib/valkey/dump.rdb "$BK/valkey-dump.rdb"
sudo tar -czf "$BK/etc-prexorcloud.tar.gz" -C /etc prexorcloud

Write a manifest.json matching the CLI’s schema and ship the directory off-host (encrypted with age or gpg).

Off-host shipping

A backup that lives only on the controller is one disk failure from useless.

Terminal window
# Encrypt then ship.
sudo tar -cf - -C /var/backups/prexorcloud "$BK_NAME" \
| age -r age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx... \
> "$BK_NAME.tar.age"
aws s3 cp "$BK_NAME.tar.age" s3://your-backups/prexorcloud/

Use whatever encrypted off-host store fits your workflow — S3, restic, borg, Backblaze B2.

Restore

Restore replays a backup taken with the procedure above. The flow is the same whether you are recovering from a corrupt Mongo, a lost host, or a botched upgrade.

Decision tree

ScenarioRestore?
Single controller died, HA peer healthyNo — fail over. See the recover-controller runbook.
Mongo corrupted / droppedYes, Mongo + filesystem.
Valkey emptied / lostUsually no; controllers rebuild.
Bad upgrade rolled back the binaryMaybe — only if the failed upgrade ran a Mongo migration. Release notes call this out.
Bad config pushFilesystem only (config/controller.yml).
Module accidentally uninstalledSelective Mongo restore of module_packages and mod_<moduleId>_*.

When in doubt, restore the full backup to a staging controller first and validate before pointing production at it.

Run a restore

Terminal window
# Always dry-run first.
prexorctl restore /var/backups/prexorcloud/<id>/manifest.json --dry-run
# Apply.
prexorctl restore /var/backups/prexorcloud/<id>/manifest.json \
--filesystem --datastores

--dry-run runs the restore validator and reports scope conflicts without mutating anything. The validator runs again before any APPLY, regardless of --dry-run — the dry-run mode just stops there.

Validation checklist

After restore, before declaring it good:

  • /api/v1/system/ready returns 200.
  • prexorctl status shows expected controllers, nodes, groups.
  • prexorctl group list shows expected groups.
  • prexorctl module list shows installed modules in ACTIVE.
  • Spot-check the audit log: db.audit_log.find().sort({createdAt:-1}).limit(20).
  • All daemons reconnect — prexorctl node list shows READY.
  • Smoke-test a deploy on a non-prod group.

If daemon mTLS material was not in the restored backup, daemons fail at the TLS layer. Re-issue:

Terminal window
prexorctl token create --description "rejoin-after-restore" --ttl 1h
# On each daemon:
sudo prexorctl setup --role daemon --rejoin --join-token <token>

The DR drill

The drill exists because backup tooling that is never restored isn’t backup tooling.

Nightly automated drill

The dr-drill job in .github/workflows/nightly.yml runs the end-to-end cycle every night against ephemeral Mongo + Valkey service containers:

Terminal window
cd java
./gradlew :cloud-test-harness:drDrill

The harness:

  1. Boots an in-process controller with a real Mongo + Valkey.
  2. Seeds a deterministic fixture (one template, two groups with distinct platform / scaling / priority shapes).
  3. Snapshots declarative state.
  4. Calls POST /api/v1/backups; verifies the manifest with POST /api/v1/backups/{id}/verify.
  5. Stops the controller, drops the Mongo database, flushes the Valkey logical DB.
  6. Brings the controller back; asserts the seeded fixtures vanished.
  7. Calls POST /api/v1/restore with dryRun=true, then dryRun=false (filesystem=true, datastores=true).
  8. Re-logs in as admin and re-snapshots state, asserting an exact match against step 3.

The job is @Tag("dr") and excluded from the default test pass; the drDrill task opts in. CI failure on this job is a real DR regression — investigate before merging.

The CI job does not measure wall-clock RTO; that target is still the operator’s manual drill. The job’s value is catching backup schema drift, restore-validator regressions, and post-restore state divergence between releases.

Manual quarterly drill

Even with nightly CI green, run a real-environment drill at least quarterly. See Disaster Drill for the step-by-step.

Production credibility is the median time-to-restore from your last drill, not the last green backup-create job.

Common failures

SymptomLikely causeFix
mongorestore fails with unsupported BSON versionRestoring with much-older mongorestoreUse the binary matching the source Mongo version.
Controller starts then exits with migration failedSchema mismatchRestore into the same controller version that took the backup.
Daemons can’t connect: peer not found in trust storeCA was not restoredRestore data/certs/, restart controller.
coordination.store=unavailable after restoreValkey URI changedUpdate controller.yml, restart.
Modules show LOAD_FAILEDBundle file removed but record keptReinstall via prexorctl module install.
First login rejected with LockedRestored login-attempt countersWait the lockout window or prexorctl user unlock <username>.

Next up