Scheduling + Auto-Scaling

The scheduler is the scariest subsystem in the controller and the one operators look at most. It decides where instances run, when scaling fires, when crashes pause a group, when drains complete. This page is the mental model — what fires when, what gates each decision, and how declarative overlays change the inputs without changing the engine.

What you’ll learn

How the scheduler is structured: per-group leases, a single tick loop, topological group ordering.
The placement algorithm: weighted node scoring with affinity, anti-affinity, capacity, and spread.
The three scaling modes (STATIC, DYNAMIC, MANUAL) and the cooldowns that prevent flapping.
How Event Choreography overlays time-bound config changes onto groups.
What crash-loop detection does, and how drains work.

How the scheduler ticks

The scheduler runs a periodic evaluation loop. Each tick:

Acquire the per-group lease for every group eligible to evaluate. Groups whose leases are held by another controller are skipped this tick (active-active gating — see Cluster Model).
Topologically sort the leased groups by dependsOn (Kahn’s algorithm). Cyclic dependencies are rejected at config-load time.
For each group, walk the evaluation pipeline:
- Compute desired state (apply Choreography overlays, scaling rules, maintenance flags).
- Compare with observed state (ClusterState).
- For each missing instance, plan and dispatch a start.
- For each excess instance, plan and dispatch a stop.
Release leases.

The tick interval is configurable (scheduler.tickIntervalSeconds, default 5). prexorcloud_scheduler_last_tick_lag_ms and prexorcloud_scheduler_tick_duration are the two metrics that matter for diagnosing scheduler health — see Observability.

flowchart LR
  Tick[Tick fires] --> Lease[Acquire group leases]
  Lease --> Topo[Topological sort]
  Topo --> Plan[Plan desired state]
  Plan --> Diff[Diff vs observed]
  Diff --> Up{Missing instances?}
  Up -- yes --> Place[Pick node + place]
  Up -- no --> Down{Excess instances?}
  Place --> Down
  Down -- yes --> Stop[Stop excess]
  Down -- no --> Done[Release leases]
  Stop --> Done

Placement: the weighted node selector

When the scheduler decides an instance is missing, it asks WeightedNodeSelector to pick a node. The selector scores every candidate node and returns the highest-scoring one.

The score

The score for a candidate node combines four signals:

Signal	Direction	What it captures
Affinity match	required	Group `affinity: { region: eu-west }` requires the node to carry the matching label
Anti-affinity match	excluding	Group `antiAffinity: { gpu: true }` excludes nodes carrying the label
Capacity headroom	maximizing	Free CPU, free memory, free instance slots
Anti-spread	maximizing	Avoid stacking too many of the same group on one node

Affinity and anti-affinity are filters, not scores — they determine eligibility. A node that fails an affinity match cannot be picked at all. Anti-spread and headroom are weighted scores within the eligible set.

The default weighting is biased toward spread (we’d rather lose the group’s two instances on two nodes than two instances on the same node). Operators can tune the weights via scheduler.placement.weights.

Port allocation

Within the chosen node, the daemon scans the group’s port range (ports.range: 25600-25699) sequentially for the first unused port. If the range is exhausted, the placement fails with a clear error and the scheduler will retry on the next tick (likely picking a different node).

Failure handling

A placement attempt can fail at three points:

Failure	Behaviour
No eligible nodes	Instance stays `SCHEDULED`; controller retries next tick. Emits a `INSTANCE_UNPLACED` event.
Port allocation failed on chosen node	Same — retry next tick.
Daemon never acks the start within `scheduler.startTimeoutSeconds`	Plan retried on the next tick (idempotent — same `planHash`). After `scheduler.startMaxAttempts`, instance fails with a `START_FAILED` event.

Scaling

Three scaling modes per group, declared in the group config:

`STATIC`

The scheduler maintains exactly min instances. To scale, edit the group. Right for groups with deterministic IDs — the proxy group, a hub, a creative-mode overworld.

`DYNAMIC`

Auto-scale between min and max based on player-load thresholds.

scaling:
  mode: DYNAMIC
  min: 2
  max: 10
  target: 0.7              # average player load triggers scale up
  scaleDownTarget: 0.3     # average load triggers scale down
  cooldownSeconds: 60
  scaleStep: 1             # how many instances to add per scaling event

The ScalingEvaluator:

Computes average player load across the group’s RUNNING instances.
If load > target and we’re under max, schedule a scale-up.
If load < scaleDownTarget and we’re over min, schedule a scale-down.
Apply cooldownSeconds after each scaling event — no further scaling in either direction until the cooldown expires. This prevents flapping.

Scaling events emit INSTANCE_SCHEDULED (up) or INSTANCE_STOPPING (down) on the SSE bus and increment prexorcloud_scaling_events_total{group, direction}.

`MANUAL`

The scheduler does not add or remove instances. Operators do it explicitly:

prexorctl group scale lobby --target 5

Useful for staging environments where you want full control or for in-progress migrations where automatic scaling would interfere.

Crash-loop detection

When an instance crashes, the daemon classifies the exit (OOM, SIGKILL, clean, unknown), captures the console tail, and reports a CrashReport to the controller. The controller appends to the crashes collection and updates the in-memory CrashLoopDetector.

The detector watches a sliding window per group: if N crashes occur within window W, the group is automatically paused. Defaults:

scheduler:
  crashLoop:
    threshold: 3            # crashes
    windowSeconds: 300      # within 5 minutes
    action: PAUSE_GROUP

A paused group is left alone — no new instances are scheduled, no existing instances are restarted by the scheduler. Existing instances continue running until they crash or stop. Operators investigate, fix, and unpause:

prexorctl group resume lobby

The pause emits a GROUP_PAUSED event with the reason (crash_loop_threshold_exceeded); the dashboard shows the group with a red pause badge and the recent crash records inline.

Drains

Operators take a node out of service via prexorctl node drain <id> (gated on nodes.manage). The drain workflow:

The node is marked drainable. The scheduler stops placing new instances on it.
For each RUNNING instance on the node, the controller emits INSTANCE_STOPPING and instructs the daemon to stop gracefully.
The proxy plugins observe the SSE event and migrate connected players to fallback instances via the Network Composition chain.
Once the daemon reports the instance STOPPED, the controller picks a new node and schedules a replacement (subject to the group’s scaling rules).
When the node has no remaining instances, it is DRAINED. Operators can stop the daemon process, take the host offline, or do maintenance.

Drains are a workflow — their state is persisted to workflow_intents in MongoDB and can resume across controller failover. The owner of the node-ownership lease drives the drain. If the controller dies mid-drain, another controller picks up the workflow and continues.

The full state machine for drains is in the recovery runbook.

Event Choreography overlays

The scaler is a pure function of group config. To get time-bound behaviour (“scale lobby up Friday peak hours, scale down at 02:00 UTC, maintenance window Sunday morning”) without writing a parallel scheduler, PrexorCloud overlays config via Event Choreography.

events:
  - id: peak-hours
    cron: "0 19 * * 5"            # Fri 19:00
    duration: "PT7H"
    targetGroup: lobby
    overlay:
      minInstances: 4
      maxInstances: 20
      scalingMode: DYNAMIC

  - id: maintenance-window
    cron: "0 8 * * 0"             # Sun 08:00
    duration: "PT1H"
    targetGroup: lobby
    overlay:
      maintenance: true
      maintenanceMessage: "<yellow>Patching, back in 1h"

EventChoreographer is a 5-field cron parser with Vixie OR semantics on day-of-month / day-of-week and IANA timezone support. It evaluates active overlays each scheduler tick and feeds them into SchedulerDesiredStatePlanner.planGroup. The scaler does the same job it always did; overlays are just additional inputs.

State changes emit CHOREOGRAPHY_OVERLAY_ACTIVATED / _DEACTIVATED on the SSE bus, so the dashboard shows a “peak hours active” pill on the affected group.

Read-only REST surface:

GET /api/v1/events — configured entries.
GET /api/v1/events/active — currently firing overlays.

Both gated on events.view. There is no UI for editing overlays in v1 — they are config-driven. Only minInstances, maxInstances, scalingMode, and maintenance are overlay-able.

Maintenance mode

A group can be put into maintenance mode (per-group) or the whole cluster can be (global). Maintenance:

Stops the scheduler from scheduling new instances in the group.
Does not stop existing instances.
Optionally rejects new player connections at the proxy with a custom message.

Useful for planned outages, large config rollouts, or pinning a known-bad group while you debug.

prexorctl group maintenance lobby --enable --message "Patching, back in 30m"
prexorctl group maintenance lobby --disable

Choreography overlays can flip this on and off automatically — see the overlay example above.

Tuning knobs

The scheduler is conservative by default. The knobs operators most often adjust:

Knob	Default	What it changes
`scheduler.tickIntervalSeconds`	5	How often the loop runs. Lower for faster reaction, higher for less work under quiet load.
`scheduler.startTimeoutSeconds`	60	How long to wait for a daemon to ack a start before retrying. Increase on slow disks.
`scheduler.startMaxAttempts`	3	Per-instance start retries before failing the placement.
`scaling.cooldownSeconds`	60	Per-group scaling cooldown. Increase if you see flap.
`crashLoop.threshold` / `windowSeconds`	3 / 300	When to auto-pause a group.
`placement.weights`	balanced	Affinity, anti-spread, headroom weights.

The full configuration reference lives at Operations / Configuration.

Next up

Deployments — rolling restarts as the template-swap mechanism.
Cluster Model — leases, fencing, what drives per-group lease ownership.
Events — every scheduler decision emits an SSE event.
Observability — the metrics that tell you the scheduler is healthy.