12,000+ signals per training run
The deepest read of a GPU training run in the industry — more training-run signals than any general-purpose observability product. Every one of them collected by a lightweight on-node agent: no SDK, no code changes, no special logging sink. This is everything the verdict is allowed to cite.
Counting methodology ships with our press kit — every claim on this page is verifiable against public documentation.
What the agent collects
GPU telemetry
per physical GPU
- Memory — used and total framebuffer bytes, and the allocation curve over the run
- Compute — engine-active and SM-active duty cycles (the difference distinguishes a busy-waiting GPU from a productive one)
- Health — XID errors, correctable and uncorrectable ECC counts, HBM row-remap counts and the row-remap-failure flag, NVLink errors, clock-throttle reasons (thermal / power), temperature, performance state, and a device-health roll-up
CUDA & NCCL runtime tracing
kernel-level, via eBPF — no SDK
- CUDA kernel-launch counts and whether a process ever attached to the GPU — the signal that separates an idle hang from real work
- CUDA API call durations and host-to-device / device-to-host memory-copy volumes
- NCCL communicator and collective state — in-flight collectives, completion counts, and enqueue durations — used to tell an advancing run from a frozen one
Job logs
standard output & standard error
- Loss trajectory, gradient norm, learning rate, and checkpoint events — read via framework-specific parsers
- Framework progress lines, Python tracebacks, and the distributed launcher’s failure summary (including signal-kill summaries)
- Everything captured at the kernel level and attributed by training rank, node, and process — without the application writing anywhere special
Slurm scheduler records
job, step, node & controller
- Full job and step lifecycle as events — submitted, started, blocked (with the pending reason), suspended / resumed, requeued, updated, and the terminal disposition (completed, failed, cancelled, timeout, node-fail, preempted) — with exit codes, signals, elapsed time, the node list, restart count, and the submit line
- Per-node state (idle, allocated, down, drain) and down/drain seconds with reasons; per-node CPU load, memory, and GPU allocation
- Controller and scheduler health — queue depths, job throughput and failure counters, and RPC stats — plus per-job time limit, priority, and I/O paths
Topology & attribution
the map that ties it together
- A continuously-maintained map from container to pod to Slurm job to rank — so every metric, log line, and kernel event ties back to the specific training run and GPU that produced it
This breadth is what lets the agent corroborate a cause across stores before concluding. No single signal is trusted alone — the verdict is only as strong as the evidence that agrees across metrics, logs, and scheduler records.



