[ Book a Demo ]
Est. 2023 · San Jose, CA

Who We Are

RidgeScope is built by engineers who spent 20 years preventing million-dollar per-minute outages. Now we're bringing that discipline to the most expensive compute there is — GPU training.

How It Started

Our Story

For years, we helped hundreds of companies keep their applications running. Then we watched their GPU clusters become the most expensive — and least observable — infrastructure they own.

“A multi-node training job that crashes, hangs, or silently produces a bad model burns days of expensive GPU before anyone notices — the run completes cleanly, so every tool reads green.”

The pattern we kept seeing on GPU clusters

Existing observability is built by DevOps people, for DevOps people — and it is blind to the training itself. It demands SDKs that ML teams won't add, reads infrastructure health instead of the loss trajectory, and can't separate a code fault from a node failure. So engineers burn up to a day — roughly $10,000 of GPU on a 128-GPU cluster — grepping logs by hand.

ML engineers deserve a verdict — what failed, why, and what to do — not another dashboard to read by hand. That's why we built RidgeScope.

You train the models. We rule on every run.

20+Years of reliability engineering
4mAverage time to root cause
What We Believe

Our Vision

1

Zero instrumentation is the only honest option

ML teams won't add SDKs or code changes to their training jobs — and they shouldn't have to. The evidence already exists: GPU telemetry, CUDA and NCCL traces, job logs, scheduler records. A tool that demands instrumentation has already failed.

2

“Infrastructure green” is not “training healthy”

A run can complete cleanly, keep every GPU busy — and produce a model that learned nothing. Health must be confirmed on positive evidence: real compute, a falling loss, clean counters. Never on the mere absence of errors.

3

There is no single root cause

A training failure is a chain that crosses layers — framework, CUDA and NCCL, node hardware, the scheduler. No system can reliably point to one root cause — that promise is false. We bring every layer together so you understand the whole story.

4

AI still needs humans

We show what we think broke the run — cited to the exact metric and log, with a confidence score and the limits of the evidence stated. We do not pretend AI can do all the work: your engineers understand the training. We give them the evidence to make the call.

5

No vendor lock-in

Your telemetry belongs to you. Export it, use other tools alongside ours, run local LLMs so nothing leaves your environment — air-gapped if you need it. We will never lock you in.

6

We're here for you

We learn your training stack and know what matters to you. The people who build the product handle your support — when a run worth thousands of dollars is stuck, you call, we answer.

The People

The team

Evgeny Potapov

Evgeny Potapov

CEO, co-founder

SRE/DevOps expert with a passion for software development and engineering management, bringing over 15 years of entrepreneurial experience.

Andrei Shamakhov

Andrei Shamakhov

CTO, co-founder

15 years scaling full-stack architecture and leading monitoring teams, combining deep distributed systems experience with an AI research background.

Elena Kuznetsova

Elena Kuznetsova

Engineering Lead, co-founder

15 years of experience in developing large-scale distributed systems and managing remote cross-functional teams.

Karen Martirosyan

Karen Martirosyan

Full-Stack Engineer

5 years of experience developing scalable web platforms using React and Python, with expertise in modern development practices.

Artur Asadullin

Artur Asadullin

Lead Infrastructure Engineer

10 years of experience in DevOps and Site Reliability Engineering, specializing in observability and building resilient systems.

Ivan Batsuev

Ivan Batsuev

AI, ML & Data Engineer

Over 15 years of experience in web development and 5 years building data and machine learning systems, with dozens of successful projects delivered across the stack.

Vasilii Davydov

Vasilii Davydov

UX/Product Designer

15 years of UX and product design experience creating user-centered solutions for high-traffic B2B and B2C platforms.

Andrei Surkov

Andrei Surkov

COO

Project Manager with over 20 years of experience leading complex projects across technology and digital products. Works with startups and cross-functional teams, helping turn ideas into scalable, well-executed solutions.

Angel Tevatrosyan

Angel Tevatrosyan

General Counsel

International lawyer with extensive corporate and VC background including numerous investments and M&A transactions, restructuring, funds governance, compliance and investor & admin relations.

Our Location

Based in San Jose, CA

Silicon Valley — where reliability engineering meets innovation.

HQ100 S Murphy Ave, Suite 200
Sunnyvale, CA 94086

Stop guessing.
Get the verdict
Fix it fast.

30-minute demo on a sample cluster — we walk through a real failure report.

Book a Demo