Opinion
Current evaluations in academia and industry are falling behind the technology. We are building the stack to provide principled evaluations of code, models, and data.
ByJeremy Wayland
Opinion
AI capabilities are expanding at light speed, but our ability to measure them remains stubbornly primitive.
Most teams still rely on binary “checkmark” metrics: Did the test pass? Was the accuracy high? But high-stakes environments—from clinical medicine to energy infrastructure—demand more than a pass/fail verdict.
At Krv, we believe that the next era of AI deployment depends on moving beyond these checkmarks toward a principled evaluation layer.
We are building for a future of Agentic Trust. To bridge the gap between research and reality, our strategy centers on three interconnected pillars: Data, Models, and Code. Together, they form a complete evaluation surface for the agentic systems that are replacing monolithic deployments.
We are engineering systems faster than we can understand them. Capability is entering the world at an unprecedented pace—and we are still unequipped to properly evaluate how these systems behave once they leave the lab.
This is the chasm. Not a gap between benchmark and deployment alone, but between building and knowing: we can ship agents, orchestrators, and models at scale, yet we lack the tools to hold them accountable, probe their failure modes, and explain why they act as they do.
In our takeaways from the Pediatric & Lifespan Data Science Conference, we saw the same pattern in clinical settings—models that excel in the sandbox routinely underperform in the field. It appears wherever AI meets high-stakes operations.
Bridging the chasm means building evaluation infrastructure: methods and tools that measure compositional systems end to end, enforce principled accountability, and probe behavior in ways that yield real explainability. At Krv, our strategy addresses three distinct surfaces:
The context used to ground decisions and retrieve information.
The capabilities invoked as tools and specialist oracles.
The execution layer that integrates and acts on the environment.
Miss one layer, and the failure modes are compositional: pristine models fed biased data; robust tools wired together by fragile glue code; or excellent benchmarks on data that doesn't resemble production.
Below, we walk through our stack for each pillar, in the order agents actually use them.
The first lever agents pull is context. At Krv, we believe data evaluation must go beyond simple statistics.
Summary statistics hide the semantic islands and structural anomalies that actually break deployments. We approach data evaluation by mapping the manifold:
We believe specialist models must be treated as precision instruments, not oracles.
A model that performs well on a static test set often collapses when it meets the stochastic reality of production inputs. At Krv, we treat the orchestration layer itself as a primary evaluation surface.
If a system is going to fail, we want it to fail in simulation—not in the ICU or on the trading floor.
We believe that “Pass@k” and passing unit tests are insufficient proxies for safety and quality.
They tell you whether code ran, but say little about whether it is secure, maintainable, or safe to merge.
Krv is building the foundational layer for Agentic Trust.
As agents take on more autonomy, the bottleneck for deployment becomes verification. You cannot trust an orchestrator if you cannot evaluate the data it grounds on, the models it invokes, or the code it executes.
By providing a unified open stack for Data (Trailed & Pulsar), Models (Pasteur), and Code (Topos), we are giving teams the empirical evidence required to bridge the evaluation chasm.
The challenges of AI evaluation are too large for any one team to solve alone. If you believe in moving beyond binary metrics, follow our work on GitHub.
Why clinical AI fails at deployment—and why agentic architectures need rigorously evaluated specialist tools.
Open source framework for defining and optimizing code generation priorities using finite lattices.
High-performance Rust implementation of topological data analysis for massive datasets.
Representation learning methods for robust data evaluation and embedding extraction.