The catalog

20+ named failure modes

The agent rules each family in or out on every run — with counterfactuals — before concluding. Not anomaly scores: named, documented failure modes, grouped by where in the run's life the problem appears. Select any mode to see how it is judged.

AStartup & configuration

BGPU memory & hardware

CDistributed coordination & hangs

DThroughput & utilization

ETraining correctness

FCluster & scheduler

GHealthy baseline

Loss divergence / not-learning

The most dangerous false-negative: GPUs busy, no OOM, hardware clean — and the model learned nothing. The loss spiked and dead-plateaued (or went NaN) while every infrastructure tool read "healthy". Reading the loss trajectory is the only way to catch it — and the agent reads the loss.

Book a Demo See the Evidence Index

20+ named failure modes

Loss divergence / not-learning

Stop guessing.Get the verdictFix it fast.

Stop guessing.
Get the verdict
Fix it fast.