[ Book a Demo ]
Evidence Index

12,000+ signals per training run

The deepest read of a GPU training run in the industry — more training-run signals than any general-purpose observability product. Every one of them collected by a lightweight on-node agent: no SDK, no code changes, no special logging sink. This is everything the verdict is allowed to cite.

Counting methodology ships with our press kit — every claim on this page is verifiable against public documentation.

What the agent collects

GPU telemetry

per physical GPU

  • Memory — used and total framebuffer bytes, and the allocation curve over the run
  • Compute — engine-active and SM-active duty cycles (the difference distinguishes a busy-waiting GPU from a productive one)
  • Health — XID errors, correctable and uncorrectable ECC counts, HBM row-remap counts and the row-remap-failure flag, NVLink errors, clock-throttle reasons (thermal / power), temperature, performance state, and a device-health roll-up

CUDA & NCCL runtime tracing

kernel-level, via eBPF — no SDK

  • CUDA kernel-launch counts and whether a process ever attached to the GPU — the signal that separates an idle hang from real work
  • CUDA API call durations and host-to-device / device-to-host memory-copy volumes
  • NCCL communicator and collective state — in-flight collectives, completion counts, and enqueue durations — used to tell an advancing run from a frozen one

Job logs

standard output & standard error

  • Loss trajectory, gradient norm, learning rate, and checkpoint events — read via framework-specific parsers
  • Framework progress lines, Python tracebacks, and the distributed launcher’s failure summary (including signal-kill summaries)
  • Everything captured at the kernel level and attributed by training rank, node, and process — without the application writing anywhere special

Slurm scheduler records

job, step, node & controller

  • Full job and step lifecycle as events — submitted, started, blocked (with the pending reason), suspended / resumed, requeued, updated, and the terminal disposition (completed, failed, cancelled, timeout, node-fail, preempted) — with exit codes, signals, elapsed time, the node list, restart count, and the submit line
  • Per-node state (idle, allocated, down, drain) and down/drain seconds with reasons; per-node CPU load, memory, and GPU allocation
  • Controller and scheduler health — queue depths, job throughput and failure counters, and RPC stats — plus per-job time limit, priority, and I/O paths

Topology & attribution

the map that ties it together

  • A continuously-maintained map from container to pod to Slurm job to rank — so every metric, log line, and kernel event ties back to the specific training run and GPU that produced it

This breadth is what lets the agent corroborate a cause across stores before concluding. No single signal is trusted alone — the verdict is only as strong as the evidence that agrees across metrics, logs, and scheduler records.

Stop guessing.
Get the verdict
Fix it fast.

30-minute demo on a sample cluster — we walk through a real failure report.

Book a Demo