THE STACK

Agent-era infrastructure.

Pulsar validates data. Pasteur stress-tests models. Topos audits generated code. Together, they turn agent behavior into evidence teams can review before deployment.

Pulsar

Data

Find persistent structure in messy datasets, then flag decisions made outside known regions.

PasteurIn stealth

Models

Stress-test model behavior and surface operational failure modes before signoff.

ToposIn dev

Code

Audit agent-written software for structure, security, and maintainability before merge.

EXAMPLE USE CASES

Concrete work, not abstract scores.

01 / Pulsar

Rare-disease trial matching

Find phenotype segments and missing screening signals before manual chart review.

Explore Pulsar→

02 / Pasteur

Hospital model signoff

Map where a clinical decision-support model should be trusted, flagged, or blocked.

Explore Pasteur→

03 / Topos

Agent engineering governance

Catch insecure or structurally weak generated code before merge.

Explore Topos→

THE OPERATING ENVELOPE

Average-case accuracy is not a safety claim.

A benchmark is an average over a frozen test set. Deployed agents meet rare inputs, drift, and chain decisions across dozens of steps.

WHAT A SINGLE NUMBER HIDES

A strong overall score can still hide the cases that matter most. Epic's sepsis model passed its vendor benchmark, then scored just 0.63 AUC in external validation — missing 67% of sepsis patients. We surface the subgroups where accuracy collapses.

Test sets are frozen; the world keeps moving — missing labs, new workflows, a shifted population. Deployed agents routinely show a ~37% gap between benchmark and real-world performance.

A 95%-accurate step is only 59%reliable over ten steps — error compounds with every tool call. It's one reason 78% of agentic pilots stall before production.

Models pass benchmarks through shortcuts they can't repeat live, overstating real reliability by 20–40%. We check the structure behind a decision, not just whether the output matched.

The job is to show exactly where a system is safe, where it should be flagged, and where it must be blocked — as evidence a governance committee or auditor can act on.

We started where being wrong costs a life — the same mathematics maps the boundary wherever mistakes are expensive.

GET STARTED

Bring proof to your agents.

The methods behind Krv are peer-reviewed and field-tested well beyond healthcare — including work published in Nature Energy. See what they reveal about your data, models, and code.

Clinical AIDrug DiscoveryEnergy SystemsAgent Code

Book a demo View on GitHub →