Production Checklist

A green install isn’t a production install. This is the list to walk before you put PrexorCloud in front of paying users — covering auth, network exposure, signing, backups, alerts, and the boring details that matter at 03:14.

What you’ll learn

The exact knobs that flip a fresh install from “demo” to “production”
Sizing rules of thumb for controllers, daemons, MongoDB, and Valkey
The minimum alert set that pages when something is actually broken
The bootstrap-secret hygiene every new operator forgets once

1. Lock down the bootstrap

Within five minutes of first login:

# Change the admin password.
prexorctl user set-password admin

# Shred the bootstrap file.
sudo shred -u /etc/prexorcloud/config/.initial-admin-password

# Confirm it is gone.
sudo ls -la /etc/prexorcloud/config/ | grep -i initial

If that file ever made it into a backup, treat the password as compromised and rotate every operator credential.

2. Set the runtime profile to `production`

runtime:
  profile: production

production requires Valkey/Redis at startup. The wiring graph swaps in real coordination accessors for leases, JWT revocation, login lockouts, SSE replay, and rate limits. Without this flag the controller runs single-writer with in-memory equivalents — never deploy that profile to production.

3. Restrict network exposure

The controller listens on 0.0.0.0 by default. Pin it down:

network:
  allowedSubnets:
    - "10.0.0.0/8"        # operator VPN
    - "10.42.0.0/16"      # daemon subnet
http:
  cors:
    allowedOrigins:
      - "https://dash.example.com"

Put the REST + dashboard behind a TLS-terminating reverse proxy. PrexorCloud serves plain HTTP — TLS termination is the proxy’s job. If you do that, configure trusted proxy CIDRs so allowedSubnets evaluates the real client IP and not the proxy’s.

Daemons connect over gRPC on :9090; that port should be reachable from daemon hosts only.

4. Generate a real JWT signing secret

openssl rand -base64 48

Paste it into controller.yml:

security:
  jwtSecret: "<generated>"
  jwtExpirationMinutes: 1440
  jwtPreviousSecrets: []      # populated only during rotation

Auto-generated secrets work for first boot but rotate to a managed value before going live. See Rotate Secrets.

5. Turn lockout on (it’s on by default — confirm)

security:
  lockout:
    enabled: true
    maxAttempts: 5
    windowSeconds: 900
    lockoutSeconds: 900

Production-profile lockout state lives in Valkey and is shared across controllers — one IP can’t cycle controllers to bypass.

6. Enforce module signatures

If you plan to install third-party modules:

modules:
  signing:
    required: true
    mode: COSIGN_BUNDLE
    trustRoot: "/etc/prexorcloud/config/security/module-trust.pem"
    rekor:
      policy: REQUIRE_SET    # offline Rekor SET enforcement
      publicKey: "/etc/prexorcloud/config/security/rekor.pub"

required=true makes module install fail-closed when a signature cannot be verified. REQUIRE_SET binds the signature to a Rekor log entry without contacting Rekor at install time. See Cosign Pipeline.

7. Configure backups before deploying real workloads

Schedule the CLI backup as a cron / systemd timer on the controller host:

prexorctl backup create --scope mongo

# /etc/cron.daily/prexorcloud-backup-full
prexorctl backup create --scope full

Recommended baseline:

Frequency	Scope	Retention
Hourly	Mongo only	24h
Daily	Full (Mongo + Valkey + filesystem)	14 days
Weekly	Full + off-host ship	90 days
Pre-upgrade	Full	Until next stable upgrade window

Ship the dailies off-host. A backup that lives only on the controller is one disk failure from useless.

8. Wire Prometheus + alerts

Scrape /metrics:

scrape_configs:
  - job_name: prexorcloud
    metrics_path: /metrics
    scrape_interval: 15s
    static_configs:
      - targets: ['controller-1:8080', 'controller-2:8080']

Install at minimum these alerts (full set in Monitoring):

groups:
  - name: prexorcloud-must-page
    rules:
      - alert: PrexorCloudControllerDown
        expr: up{job="prexorcloud"} == 0
        for: 2m
        labels: { severity: critical }
      - alert: PrexorCloudCrashLoop
        expr: increase(prexorcloud_crash_loops_total[1h]) > 0
        labels: { severity: critical }
      - alert: PrexorCloudSchedulerLag
        expr: prexorcloud_scheduler_last_tick_lag_ms > 30000
        for: 2m
        labels: { severity: warning }
      - alert: PrexorCloudHttpErrorBudget
        expr: |
          sum(rate(prexorcloud_http_requests_total{status_class="5xx"}[5m]))
            / sum(rate(prexorcloud_http_requests_total[5m])) > 0.05
        for: 5m
        labels: { severity: critical }

PrexorCloud does not ship a Grafana dashboard pack — the metrics are stable and labelled, build the panels you need.

9. Size the boxes

Rule-of-thumb starting points; trend with your own load.

Resource	Headroom
Controller CPU	1 vCPU per ~500 instances + 1 vCPU per 10k SSE clients
Controller heap	1 GiB baseline + 1 MiB per active instance + module overhead
Daemon CPU	Driven by hosted MC heaps; daemon overhead negligible
Daemon memory	Σ MC instance heaps + ~256 MiB daemon overhead
MongoDB	~1 GiB per 100 instances per month of audit retention
Valkey	~50 MiB per 1000 instances; SSE replay buffer dominates

Set JVM -Xmx slightly below the cgroup / container limit so the killer doesn’t reach for the controller before it OOMEs cleanly.

10. Ulimits and host hygiene

For the daemon process specifically (it spawns many MC JVMs):

prexorcloud  soft  nofile  131072
prexorcloud  hard  nofile  131072
prexorcloud  soft  nproc   unlimited
prexorcloud  hard  nproc   unlimited

systemd reference units in deploy/systemd/ already set LimitNOFILE=131072 and TasksMax=infinity on the daemon. Match the limit if you’re rolling your own units.

Disable swap on Mongo and Valkey hosts. Confirm chronyd or systemd-timesyncd is enabled on every controller — fencing tokens tolerate skew but lease expiry timing is real.

11. Plan controller HA

Single-controller is fine for staging. Production deserves at least two controllers sharing one MongoDB and one Valkey. The HA model is active-active with lease-scoped work, not active-passive.

# On controller-2:
sudo prexorctl setup --role controller \
    --mongo-uri "$EXISTING_MONGO_URI" \
    --redis-uri "$EXISTING_VALKEY_URI" \
    --bootstrap=false

--bootstrap=false skips admin-user creation and CA generation; the new controller reads the existing CA from Mongo. See HA Setup.

12. Run a DR drill before launch

cd java
./gradlew :cloud-test-harness:drDrill

The harness boots a real controller against an ephemeral Mongo + Valkey, takes a backup, wipes both stores, restores, and asserts state matches. Run it once before launch to confirm your local environment isn’t missing anything. The same job runs nightly in CI (.github/workflows/nightly.yml :: dr-drill); a CI failure on that job is a real DR regression.

A real-environment quarterly drill remains on top of CI — see Disaster Drill.

13. Verify cosign before extracting any release

Every release tag publishes:

prexorctl archives signed via cosign keyless on release.yml
GHCR images for controller / daemon / dashboard signed via cosign keyless on release-images.yml

Verify before you run:

cosign verify-blob \
  --certificate-identity-regexp "^https://github.com/prexorjustin/prexorcloud/.github/workflows/release.yml@refs/tags/" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  --signature checksums.txt.sig \
  --certificate checksums.txt.pem \
  checksums.txt
sha256sum -c checksums.txt

For images:

cosign verify \
  --certificate-identity-regexp "^https://github.com/prexorjustin/prexorcloud/.github/workflows/release-images.yml@refs/tags/" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  ghcr.io/prexorjustin/prexorcloud-controller:<semver>

Both must succeed. See Cosign Pipeline.

14. Final pre-launch checklist

Next up

Configuration Reference — every key, every default
Monitoring — Prometheus + alert rules
HA Setup — multi-controller deployment
Disaster Drill — walk a real scenario