Skip to content

Monitoring + Metrics

PrexorCloud emits three signals: metrics for “is the cluster degrading over time?”, logs for “what did we do at 03:14?”, and SSE events for “what is happening right now?”. This page covers the metrics. Logs live in Logs and Audit; SSE is documented under Architecture.

What you’ll learn

  • How to scrape PrexorCloud with Prometheus
  • The full canonical metric set, broken down by area
  • PromQL recipes for the questions operators ask first
  • Alert rules with sensible thresholds

What you do not get

  • A pre-built Grafana dashboard pack. By design — see Architecture decisions. Metric names and labels are stable; build the panels you need.
  • Distributed tracing. PrexorCloud is two services with one well-defined gRPC contract; OTel adds runtime cost without buying anything.
  • In-app alert configuration. Use Alertmanager.

Scrape config

The controller serves Prometheus exposition at GET /metrics. No auth by default — gate it via reverse-proxy ACL if needed.

prometheus.yml
scrape_configs:
- job_name: prexorcloud
metrics_path: /metrics
scrape_interval: 15s
static_configs:
- targets:
- 'controller-1:8080'
- 'controller-2:8080'

metrics.enabled is on by default. Set metrics.enabled=false if you want to disable the endpoint completely.

Naming convention: prexorcloud_<area>_<thing>_<unit>. Labels are short and stable.

Cluster metrics

MetricTypeLabels
prexorcloud_nodes_totalgauge
prexorcloud_nodes_readygauge
prexorcloud_groups_totalgauge
prexorcloud_instances_totalgauge
prexorcloud_instances_by_stategaugestate, group
prexorcloud_players_totalgauge
prexorcloud_players_by_groupgaugegroup
prexorcloud_crashes_totalcountergroup, exit_reason
prexorcloud_crash_loops_totalcountergroup
prexorcloud_scaling_events_totalcountergroup, direction
prexorcloud_deployments_activegauge

Per-node

MetricTypeLabels
prexorcloud_node_cpu_usagegaugenode
prexorcloud_node_memory_used_bytesgaugenode
prexorcloud_node_memory_total_bytesgaugenode
prexorcloud_node_disk_used_bytesgaugenode
prexorcloud_node_instancesgaugenode
prexorcloud_node_heartbeat_latency_mshistogramnode

Scheduler

MetricTypeLabels
prexorcloud_scheduler_tick_durationhistogram
prexorcloud_scheduler_tick_failures_totalcounter
prexorcloud_scheduler_groups_per_tickgauge
prexorcloud_scheduler_last_tick_lag_msgauge

gRPC

MetricTypeLabels
prexorcloud_grpc_daemon_sessions_activegauge
prexorcloud_grpc_inbound_messages_totalcounterpayload_case
prexorcloud_grpc_outbound_messages_totalcounterpayload_case
prexorcloud_grpc_outbound_dropped_totalcounterreason

Coordination + auth

MetricTypeLabels
prexorcloud_coordination_lease_acquire_totalcounterscope
prexorcloud_coordination_lease_renew_totalcounterscope
prexorcloud_coordination_lease_contention_totalcounterscope
prexorcloud_coordination_jwt_revocations_totalcounter
prexorcloud_sse_clients_activegauge
prexorcloud_sse_replay_buffer_depthgauge

HTTP

MetricTypeLabels
prexorcloud_http_requests_totalcountermethod, status_class
prexorcloud_http_request_duration_mshistogrammethod, status_class

Module classloader

These pair with the leaked-classloader endpoint at GET /api/v1/modules/platform/leaked-classloaders.

MetricTypeLabels
prexorcloud_module_classloader_tracked_totalcountermoduleId
prexorcloud_module_classloader_collected_totalcountermoduleId
prexorcloud_module_classloader_leakedcountermoduleId
prexorcloud_module_classloader_pendinggauge

PromQL recipes

Crash rate per group over the last hour:

rate(prexorcloud_crashes_total[1h])

Scheduler tick p95 (target: under 200ms at 1k groups):

histogram_quantile(0.95, rate(prexorcloud_scheduler_tick_duration_bucket[5m]))

Lease contention rate (early-warning of HA noise):

rate(prexorcloud_coordination_lease_contention_total[5m])

HTTP error budget (5xx ratio):

sum(rate(prexorcloud_http_requests_total{status_class="5xx"}[5m]))
/ sum(rate(prexorcloud_http_requests_total[5m]))

Instance state distribution per group (stacked area panel):

sum by (group, state) (prexorcloud_instances_by_state)

Per-node memory pressure:

prexorcloud_node_memory_used_bytes / prexorcloud_node_memory_total_bytes

Module classloader leak signal:

rate(prexorcloud_module_classloader_leaked[1h])

Alerts

This is the recommended baseline. Tune thresholds to your environment.

groups:
- name: prexorcloud
rules:
- alert: PrexorCloudControllerDown
expr: up{job="prexorcloud"} == 0
for: 2m
labels: { severity: critical }
annotations:
summary: "Controller scrape target is down"
- alert: PrexorCloudCrashLoop
expr: increase(prexorcloud_crash_loops_total[1h]) > 0
labels: { severity: critical }
annotations:
summary: "Crash loop in group {{ $labels.group }}"
- alert: PrexorCloudSchedulerLag
expr: prexorcloud_scheduler_last_tick_lag_ms > 30000
for: 2m
labels: { severity: warning }
annotations:
summary: "Scheduler tick is more than 30s behind"
- alert: PrexorCloudLeaseContention
expr: rate(prexorcloud_coordination_lease_contention_total[5m]) > 1
for: 10m
labels: { severity: warning }
annotations:
summary: "Sustained lease contention — multiple controllers fighting for the same scope"
- alert: PrexorCloudClassloaderLeak
expr: increase(prexorcloud_module_classloader_leaked[24h]) > 0
labels: { severity: warning }
annotations:
summary: "Module {{ $labels.moduleId }} leaked a classloader"
- alert: PrexorCloudHttpErrorBudget
expr: |
sum(rate(prexorcloud_http_requests_total{status_class="5xx"}[5m])) /
sum(rate(prexorcloud_http_requests_total[5m])) > 0.05
for: 5m
labels: { severity: critical }
annotations:
summary: "HTTP 5xx ratio > 5% for 5 minutes"
- alert: PrexorCloudNodeOffline
expr: prexorcloud_nodes_total - prexorcloud_nodes_ready > 0
for: 5m
labels: { severity: warning }
annotations:
summary: "{{ $value }} daemon node(s) not ready"

Building Grafana boards

Suggested rows for a “single pane of glass” board (we don’t ship one, but this is what we’d build first):

  1. Cluster overviewnodes_ready, nodes_total, groups_total, instances_total, players_total. Big-number panels.
  2. Instance state breakdown — stacked series of prexorcloud_instances_by_state by group.
  3. Scheduler health — tick p95 + tick lag + scheduler failure rate.
  4. HTTP — RPS by method, p95 by status_class, 5xx ratio.
  5. HA health — lease acquire / renew / contention rates by scope.
  6. Per-node — CPU, memory, disk, instance count, heartbeat latency.
  7. Modules — classloader tracked / collected / leaked / pending.

The controller version label on up{job="prexorcloud"} is your canonical “what’s running where” dimension; pin it to a row header.

Performance baselines

The nightly CI job :cloud-test-harness:perfBaselines (tagged @Tag("perf")) runs four scenarios — controller cold start, coordination-store latency, SSE latency, and scheduler tick at 1k groups — and surfaces drift > 25% as a soft signal in the run summary. This is a regression-detection nudge, not a CI gate. See the perf-baselines doc for the methodology.

Capture local numbers and trend them in your own monitoring.

Diagnostics bundle

prexorctl diagnostics bundle produces a tar.gz with redacted controller config, /system/readiness, /system/overview, /system/settings, Valkey keyspace summary, lease state, and log statistics. Attach it to incident reports — secrets are blanked by default; review before sharing.

Next up