Backup + Restore
A PrexorCloud backup captures four things: durable platform state
(Mongo), coordination state (Valkey), the controller filesystem
(/etc/prexorcloud/), and per-module storage. This guide runs you
through prexorctl backup for the happy path, prexorctl restore for
a recovery, and the nightly DR drill that exercises both end-to-end.
What you’ll build
flowchart LR Ctl["controller"] --> B["prexorctl backup create"] B --> M["mongo dump<br/><sub>.gz</sub>"] B --> V["valkey RDB"] B --> F["etc-prexorcloud.tar.gz"] B --> J["manifest.json"] J --> S3["off-host store<br/><sub>S3 / borg / restic</sub>"] S3 --> R["prexorctl restore <id>"]
End state: a daily full backup, an off-host copy, a quarterly verified restore drill, and a CI job that gates merges on the backup→restore loop staying green.
Prerequisites
- PrexorCloud v1.0+ controller. The CLI talks to the same
BackupCreatorthe controller uses internally. - Disk space for the backup directory (Mongo dump dominates; rule of thumb: 1 GiB per 100 instances per month of audit retention).
- An off-host destination (S3 bucket, borg repo, restic repo, or just
another host’s
/var/backups). Optional but strongly recommended.
1. Take a backup
The CLI wraps the controller’s backup logic and writes a single tarball plus a manifest:
prexorctl backup create --description "pre-1.0.1-upgrade"# -> Backup bk-2026-05-10-001 created# files: mongo, valkey, filesystem# size: 142 MiB# path: /var/backups/prexorcloud/bk-2026-05-10-001/What got captured:
| Tier | Source | Loss impact |
|---|---|---|
| Durable platform | Mongo mongodump | Catastrophic — every group, deployment, audit entry, module record. |
| Coordination | Valkey BGSAVE snapshot | Tolerable — login lockouts and SSE replay reset; in-flight workflows resume from Mongo intent. |
| Filesystem | tar of /etc/prexorcloud/ | Catastrophic for config and CA; recoverable for module storage. |
| Per-module storage | Captured inside Mongo + Valkey backups | Module-defined. |
The MongoDB dump is --gzip --out; the Valkey snapshot is the post-BGSAVE
dump.rdb; the filesystem tarball excludes runtime logs.
List backups:
prexorctl backup list# ID CREATED SIZE SCOPE# bk-2026-05-10-001 2026-05-10T08:00:00Z 142 MiB mongo,valkey,filesystem# bk-2026-05-09-001 2026-05-09T08:00:00Z 138 MiB mongo,valkey,filesystem2. Schedule + ship off-host
Run nightly via systemd timer or cron. Encrypt with age and ship to
your off-host store:
[Service]Type=oneshotExecStart=/usr/local/bin/prexorctl backup create --description nightlyExecStartPost=/bin/sh -c 'BK=$(prexorctl backup list --json | jq -r ".[0].path"); \ tar -cf - -C /var/backups/prexorcloud "$(basename $BK)" \ | age -r age1xxx... > /tmp/$(basename $BK).tar.age; \ aws s3 cp /tmp/$(basename $BK).tar.age s3://prexor-backups/'ExecStartPost=/usr/local/bin/prexorctl backup prune --keep-days 14[Timer]OnCalendar=dailyPersistent=trueReplace age1xxx… with your real recipient (age-keygen | tee ~/.config/age/key.txt) and the S3 URI with your bucket. The same shape
works with borg create, restic backup, or rclone copy.
Recommended cadence:
| Frequency | Scope | Retention |
|---|---|---|
| Hourly | Mongo only | 24 hours |
| Daily | Full + off-host ship | 14 days |
| Weekly | Full + off-host ship | 90 days |
| Pre-upgrade | Full | Until next stable upgrade window |
3. Verify a backup without restoring
prexorctl backup verify validates checksums, parses the Mongo dump
without restoring, and runs a structural restore-dry-run:
prexorctl backup verify bk-2026-05-10-001# checksums OK# manifest schema OK# mongo dump bson OK (12 collections, 142,884 docs)# valkey rdb OK (preamble version 11)# filesystem tar OKA backup you’ve never restored is not a backup. Run a real restore drill in a throwaway environment at least quarterly — see step 5.
4. Restore
Stop every controller talking to the target Mongo + Valkey first:
sudo systemctl stop prexorcloud-controller # on each controller hostThen restore:
prexorctl restore bk-2026-05-10-001 --dry-run# … reports what would be replaced …
prexorctl restore bk-2026-05-10-001 \ --datastores \ --filesystem# RestoreExecutor: drop+restore mongo... OK# RestoreExecutor: replace valkey RDB... OK# RestoreExecutor: untar /etc/prexorcloud... OK# Run `systemctl start prexorcloud-controller` to bring the cluster back.--datastores covers Mongo and Valkey; --filesystem covers
/etc/prexorcloud/. Use --filesystem alone for a config-only
recovery, --datastores alone if your config is intact.
Bring the controller back:
sudo systemctl start prexorcloud-controllersudo journalctl -u prexorcloud-controller -fWatch for migration applied: (normal for older backups into the same
controller version), coordination.store=available, and
state.store=available. Daemons reconnect automatically; if their certs
predate the restored CA, re-issue with prexorctl token create --description rejoin --ttl 1h and prexorctl setup --rejoin.
How to verify it works
After every restore drill, confirm:
prexorctl statuslists every controller and node as healthy.prexorctl group listshows the same groups as before, allcurrentRevision == desiredRevision.prexorctl module listshows installed modules inACTIVE.prexorctl crash list --since "1 day ago"matches the backup era.- A smoke deploy on a non-prod group succeeds.
5. The nightly DR drill
The repo’s .github/workflows/nightly.yml runs a dr-drill job that
spins up ephemeral Mongo + Valkey containers, takes a real backup,
restores into a second pair of containers, and verifies cluster state
matches. If a drill fails, the workflow opens an issue automatically.
The harness lives under cloud-test-harness:drDrill.
You can run the same drill locally:
cd java./gradlew :cloud-test-harness:drDrillThis is the single best signal that your backup/restore pipeline is healthy without needing a real disaster.
Common pitfalls
| Symptom | Likely cause |
|---|---|
mongorestore fails with unsupported BSON version | The local mongorestore is older than the source Mongo. Match versions. |
Controller starts then logs migration failed | Restoring across a major-version gap. Restore into the same controller version that took the backup. |
Daemons can’t connect: peer not found in trust store | CA was not in the restored backup. Restore data/certs/. |
First login rejected with Locked | Restored login-attempt counters. Wait out the lockout or prexorctl user unlock <username>. |
Module shows LOAD_FAILED after restore | Module jar was outside the backup. Reinstall: prexorctl module install <bundle>. |
Where to go next
- Operations → Backups & DR — the full
operator runbook with manual
mongodump/mongorestorefallbacks. - Operations → Disaster Drill — what the nightly drill actually tests, and how to interpret a failure.
- Guides → HA Controller (Redis) — the HA shape changes restore semantics; read this if you run multiple controllers.