Crash Recovery

A crash is any instance exit that wasn’t an operator-initiated stop. The daemon classifies it, the controller persists a CrashRecord, the Player Journey Bus appends INSTANCE_CRASHED for every affected player, and the scheduler decides whether to restart or quarantine the group. This guide shows the operator-visible side of all four steps and how to act on them.

What you’ll build

flowchart LR
  P["MC process"] -->|exit != 0| D["daemon classify"]
  D --> C["CrashReport gRPC"]
  C --> CR["controller<br/>CrashRecord<br/><sub>Mongo</sub>"]
  CR --> J["Player Journey<br/>INSTANCE_CRASHED"]
  CR --> S["scheduler<br/>restart / pause"]
  CR --> A["webhook-alerts<br/>Discord ping"]

End state: every crash is one Mongo row, one SSE event, one webhook delivery (if configured), one journey entry per affected player, and a deterministic restart-vs-pause decision.

Prerequisites

PrexorCloud v1.0+ controller and at least one daemon.
A group with ≥1 running instance. Examples below use lobby.
Optional: the webhook-alerts module installed for Discord/Slack notifications — see Recipes → Discord Notifications.

1. Inspect a crash

Force a crash for the demonstration (skip if you have a real one):

prexorctl instance stop lobby-1 --force --no-graceful

--no-graceful sends SIGKILL rather than STOP /save-all followed by SIGTERM, so the daemon classifies the exit as KILLED rather than CLEAN. List recent crashes:

prexorctl crash list --since "5 min ago"
# CRASH-ID         INSTANCE  GROUP   EXIT  CLASS         AGE
# crash-A1B2C3D4   lobby-1   lobby   137   KILLED        12s

Inspect:

prexorctl crash info crash-A1B2C3D4

You get exit code, classification (CLEAN, KILLED, OOM, STARTUP_FAILURE, RUNTIME_FAILURE), uptime in ms, console tail (default last 200 lines), and the responsible node. The console tail is captured by the daemon’s stdio reader; the full record lives in the crash_records Mongo collection. See Concepts → Cluster Model for how the daemon classifies exits.

2. Read the journey for affected players

Every player who was on the instance at crash time gets an INSTANCE_CRASHED entry on the Player Journey Bus.

prexorctl player journey <player-uuid> --limit 20
# 2026-05-10T12:00:01Z  PLAYER_CONNECTED      proxy-1
# 2026-05-10T12:00:02Z  PLAYER_TRANSFER       proxy-1 -> lobby-1
# 2026-05-10T12:15:42Z  INSTANCE_CRASHED      lobby-1   exit=137 class=KILLED
# 2026-05-10T12:15:43Z  PLAYER_TRANSFER       lobby-1 -> lobby-2  (fallback)

The proxy plugin walked the fallbackGroups chain on KickedFromServerEvent and sent the player to a healthy lobby. For this to work the group must be behind a Velocity/Bungee proxy with a Network Composition — see Your First Network.

3. Understand the restart decision

The crash-loop detector runs in the controller and counts crashes per group in a rolling window. Defaults: 3 crashes within 60 seconds pause the group. Configure under /etc/prexorcloud/controller.yml:

scheduler:
  crashLoop:
    windowSeconds: 60
    maxCrashes: 3
    backoffSeconds: 30      # delay between auto-restart attempts

When tripped, the controller emits GROUP_CRASH_LOOP and sets the group to paused with reason crash-loop. New placements stop until you resume. Inspect:

prexorctl group info lobby
# STATE   PAUSED
# REASON  crash-loop  (3 crashes in 47s)

prexorctl crash list --group lobby --since "5 min ago"

4. Resume the group

Fix the underlying issue (bad template, OOM, port conflict, missing plugin) and resume:

prexorctl group resume lobby
# Group lobby resumed. Scheduler reactivated.

The scheduler immediately re-evaluates desired state and re-places missing instances. If the root cause persists, the loop trips again and the group re-pauses; that’s the safety mechanism doing its job.

How to verify it works

After a real crash, the after-incident checklist:

prexorctl crash list --since "<incident start>" shows the crash.
prexorctl crash info <id> console tail surfaces the root cause.
For affected players: prexorctl player journey <uuid> shows the INSTANCE_CRASHED → PLAYER_TRANSFER redirect.
prexorctl group info <group> shows either restored or PAUSED — if paused, address the cause and group resume.
If webhook-alerts is installed, the configured webhooks received one POST per instance_crashed and (if tripped) one for crash_loop.

OOM-specific recovery

OutOfMemoryError shows up as class=OOM in the crash record. Check the heap-dump path (set in the group’s resources.jvmArgs or the template):

-Xmx2G
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/prexorcloud/heapdumps/

If you see repeated OOMs, bump resources.memoryMB on the group and roll a deploy:

prexorctl group update lobby --memory 3072
prexorctl deploy lobby --strategy rolling

Common pitfalls

Symptom	Likely cause
`crash list` empty after a visible crash	Daemon couldn’t reach the controller. Check `prexorctl node list` for the daemon’s state.
Group keeps re-pausing post-fix	Crash-loop counter window not yet expired. Wait `windowSeconds`, or restart the controller to clear in-process counters.
Players see `Connection lost` instead of fallback	No Network Composition for the group’s proxy. Apply one.
`class=STARTUP_FAILURE` repeats	Template is broken. Roll back: `prexorctl template rollback <name>`.

Where to go next

Recipes → Discord Notifications — pipe crash events to a Discord channel.
Concepts → Events — every crash-related SSE event, with payload shape.
Guides → Backup + Restore — recover state if a crash takes Mongo with it (rare, but covered).