Agent-era infrastructure.
Pulsar validates data. Pasteur stress-tests models. Topos audits generated code. Together, they turn agent behavior into evidence teams can review before deployment.
Data
Find persistent structure in messy datasets, then flag decisions made outside known regions.
Models
Stress-test model behavior and surface operational failure modes before signoff.
Code
Audit agent-written software for structure, security, and maintainability before merge.
Concrete work, not abstract scores.
Rare-disease trial matching
Find phenotype segments and missing screening signals before manual chart review.
Explore Pulsar→Hospital model signoff
Map where a clinical decision-support model should be trusted, flagged, or blocked.
Explore Pasteur→Agent engineering governance
Catch insecure or structurally weak generated code before merge.
Explore Topos→Average-case accuracy is not a safety claim.
A benchmark is an average over a frozen test set. Deployed agents meet rare inputs, drift, and chain decisions across dozens of steps.
A strong overall score can still hide the cases that matter most. Epic's sepsis model passed its vendor benchmark, then scored just 0.63 AUC in external validation — missing 67% of sepsis patients. We surface the subgroups where accuracy collapses.
Test sets are frozen; the world keeps moving — missing labs, new workflows, a shifted population. Deployed agents routinely show a ~37% gap between benchmark and real-world performance.
A 95%-accurate step is only 59%reliable over ten steps — error compounds with every tool call. It's one reason 78% of agentic pilots stall before production.
Models pass benchmarks through shortcuts they can't repeat live, overstating real reliability by 20–40%. We check the structure behind a decision, not just whether the output matched.
The job is to show exactly where a system is safe, where it should be flagged, and where it must be blocked — as evidence a governance committee or auditor can act on.
We started where being wrong costs a life — the same mathematics maps the boundary wherever mistakes are expensive.
Bring proof to your agents.
The methods behind Krv are peer-reviewed and field-tested well beyond healthcare — including work published in Nature Energy. See what they reveal about your data, models, and code.