The catalog
20+ named failure modes
The agent rules each family in or out on every run — with counterfactuals — before concluding. Not anomaly scores: named, documented failure modes, grouped by where in the run's life the problem appears. Select any mode to see how it is judged.
AStartup & configuration
BGPU memory & hardware
CDistributed coordination & hangs
DThroughput & utilization
ETraining correctness
FCluster & scheduler
GHealthy baseline
E1
Loss divergence / not-learning
The most dangerous false-negative: GPUs busy, no OOM, hardware clean — and the model learned nothing. The loss spiked and dead-plateaued (or went NaN) while every infrastructure tool read "healthy". Reading the loss trajectory is the only way to catch it — and the agent reads the loss.



