Monitoring + Metrics
PrexorCloud emits three signals: metrics for “is the cluster degrading over time?”, logs for “what did we do at 03:14?”, and SSE events for “what is happening right now?”. This page covers the metrics. Logs live in Logs and Audit; SSE is documented under Architecture.
What you’ll learn
- How to scrape PrexorCloud with Prometheus
- The full canonical metric set, broken down by area
- PromQL recipes for the questions operators ask first
- Alert rules with sensible thresholds
What you do not get
- A pre-built Grafana dashboard pack. By design — see Architecture decisions. Metric names and labels are stable; build the panels you need.
- Distributed tracing. PrexorCloud is two services with one well-defined gRPC contract; OTel adds runtime cost without buying anything.
- In-app alert configuration. Use Alertmanager.
Scrape config
The controller serves Prometheus exposition at GET /metrics. No auth
by default — gate it via reverse-proxy ACL if needed.
scrape_configs: - job_name: prexorcloud metrics_path: /metrics scrape_interval: 15s static_configs: - targets: - 'controller-1:8080' - 'controller-2:8080'metrics.enabled is on by default. Set metrics.enabled=false if you
want to disable the endpoint completely.
Naming convention: prexorcloud_<area>_<thing>_<unit>. Labels are
short and stable.
Cluster metrics
| Metric | Type | Labels |
|---|---|---|
prexorcloud_nodes_total | gauge | — |
prexorcloud_nodes_ready | gauge | — |
prexorcloud_groups_total | gauge | — |
prexorcloud_instances_total | gauge | — |
prexorcloud_instances_by_state | gauge | state, group |
prexorcloud_players_total | gauge | — |
prexorcloud_players_by_group | gauge | group |
prexorcloud_crashes_total | counter | group, exit_reason |
prexorcloud_crash_loops_total | counter | group |
prexorcloud_scaling_events_total | counter | group, direction |
prexorcloud_deployments_active | gauge | — |
Per-node
| Metric | Type | Labels |
|---|---|---|
prexorcloud_node_cpu_usage | gauge | node |
prexorcloud_node_memory_used_bytes | gauge | node |
prexorcloud_node_memory_total_bytes | gauge | node |
prexorcloud_node_disk_used_bytes | gauge | node |
prexorcloud_node_instances | gauge | node |
prexorcloud_node_heartbeat_latency_ms | histogram | node |
Scheduler
| Metric | Type | Labels |
|---|---|---|
prexorcloud_scheduler_tick_duration | histogram | — |
prexorcloud_scheduler_tick_failures_total | counter | — |
prexorcloud_scheduler_groups_per_tick | gauge | — |
prexorcloud_scheduler_last_tick_lag_ms | gauge | — |
gRPC
| Metric | Type | Labels |
|---|---|---|
prexorcloud_grpc_daemon_sessions_active | gauge | — |
prexorcloud_grpc_inbound_messages_total | counter | payload_case |
prexorcloud_grpc_outbound_messages_total | counter | payload_case |
prexorcloud_grpc_outbound_dropped_total | counter | reason |
Coordination + auth
| Metric | Type | Labels |
|---|---|---|
prexorcloud_coordination_lease_acquire_total | counter | scope |
prexorcloud_coordination_lease_renew_total | counter | scope |
prexorcloud_coordination_lease_contention_total | counter | scope |
prexorcloud_coordination_jwt_revocations_total | counter | — |
prexorcloud_sse_clients_active | gauge | — |
prexorcloud_sse_replay_buffer_depth | gauge | — |
HTTP
| Metric | Type | Labels |
|---|---|---|
prexorcloud_http_requests_total | counter | method, status_class |
prexorcloud_http_request_duration_ms | histogram | method, status_class |
Module classloader
These pair with the leaked-classloader endpoint at
GET /api/v1/modules/platform/leaked-classloaders.
| Metric | Type | Labels |
|---|---|---|
prexorcloud_module_classloader_tracked_total | counter | moduleId |
prexorcloud_module_classloader_collected_total | counter | moduleId |
prexorcloud_module_classloader_leaked | counter | moduleId |
prexorcloud_module_classloader_pending | gauge | — |
PromQL recipes
Crash rate per group over the last hour:
rate(prexorcloud_crashes_total[1h])Scheduler tick p95 (target: under 200ms at 1k groups):
histogram_quantile(0.95, rate(prexorcloud_scheduler_tick_duration_bucket[5m]))Lease contention rate (early-warning of HA noise):
rate(prexorcloud_coordination_lease_contention_total[5m])HTTP error budget (5xx ratio):
sum(rate(prexorcloud_http_requests_total{status_class="5xx"}[5m])) / sum(rate(prexorcloud_http_requests_total[5m]))Instance state distribution per group (stacked area panel):
sum by (group, state) (prexorcloud_instances_by_state)Per-node memory pressure:
prexorcloud_node_memory_used_bytes / prexorcloud_node_memory_total_bytesModule classloader leak signal:
rate(prexorcloud_module_classloader_leaked[1h])Alerts
This is the recommended baseline. Tune thresholds to your environment.
groups: - name: prexorcloud rules: - alert: PrexorCloudControllerDown expr: up{job="prexorcloud"} == 0 for: 2m labels: { severity: critical } annotations: summary: "Controller scrape target is down"
- alert: PrexorCloudCrashLoop expr: increase(prexorcloud_crash_loops_total[1h]) > 0 labels: { severity: critical } annotations: summary: "Crash loop in group {{ $labels.group }}"
- alert: PrexorCloudSchedulerLag expr: prexorcloud_scheduler_last_tick_lag_ms > 30000 for: 2m labels: { severity: warning } annotations: summary: "Scheduler tick is more than 30s behind"
- alert: PrexorCloudLeaseContention expr: rate(prexorcloud_coordination_lease_contention_total[5m]) > 1 for: 10m labels: { severity: warning } annotations: summary: "Sustained lease contention — multiple controllers fighting for the same scope"
- alert: PrexorCloudClassloaderLeak expr: increase(prexorcloud_module_classloader_leaked[24h]) > 0 labels: { severity: warning } annotations: summary: "Module {{ $labels.moduleId }} leaked a classloader"
- alert: PrexorCloudHttpErrorBudget expr: | sum(rate(prexorcloud_http_requests_total{status_class="5xx"}[5m])) / sum(rate(prexorcloud_http_requests_total[5m])) > 0.05 for: 5m labels: { severity: critical } annotations: summary: "HTTP 5xx ratio > 5% for 5 minutes"
- alert: PrexorCloudNodeOffline expr: prexorcloud_nodes_total - prexorcloud_nodes_ready > 0 for: 5m labels: { severity: warning } annotations: summary: "{{ $value }} daemon node(s) not ready"Building Grafana boards
Suggested rows for a “single pane of glass” board (we don’t ship one, but this is what we’d build first):
- Cluster overview —
nodes_ready,nodes_total,groups_total,instances_total,players_total. Big-number panels. - Instance state breakdown — stacked series of
prexorcloud_instances_by_stateby group. - Scheduler health — tick p95 + tick lag + scheduler failure rate.
- HTTP — RPS by
method, p95 bystatus_class, 5xx ratio. - HA health — lease acquire / renew / contention rates by
scope. - Per-node — CPU, memory, disk, instance count, heartbeat latency.
- Modules — classloader tracked / collected / leaked / pending.
The controller version label on up{job="prexorcloud"} is your
canonical “what’s running where” dimension; pin it to a row header.
Performance baselines
The nightly CI job :cloud-test-harness:perfBaselines (tagged
@Tag("perf")) runs four scenarios — controller cold start,
coordination-store latency, SSE latency, and scheduler tick at 1k
groups — and surfaces drift > 25% as a soft signal in the run summary.
This is a regression-detection nudge, not a CI gate. See the
perf-baselines doc
for the methodology.
Capture local numbers and trend them in your own monitoring.
Diagnostics bundle
prexorctl diagnostics bundle produces a tar.gz with redacted
controller config, /system/readiness, /system/overview,
/system/settings, Valkey keyspace summary, lease state, and log
statistics. Attach it to incident reports — secrets are blanked by
default; review before sharing.
Next up
- Logs and Audit — controller / daemon / module logs and the Mongo audit trail
- Production Checklist — alert wiring step-by-step
- Architecture — what each metric measures