Field Notes from the 3rd Annual Pediatric & Lifespan Data Science Conference

We recently attended the 3rd Annual Pediatric and Lifespan Data Science Conference. The presentations and hallway conversations left us with a familiar feeling—not discouragement, but clarity. Clinical leaders and researchers made one thing plain: the clinical AI field has a deployment problem, and the way we evaluate models today is a large part of why.

This opinion piece distills our three main takeaways from the conference: the reality of the model evaluation gap, the misconception that AGI will autonomously “solve” medicine, and why the future relies on agents equipped with specialized tools.

Why Clinical AI Keeps Failing at Deployment

During a keynote panel, Dr. Anthony Chang put it plainly: there is a chasm between research performance and deployment success. Models that look excellent in retrospective validation routinely underperform or go unused once they meet the complexity of real clinical environments.

This is not a new observation, but the directness of the clinical voices in the room made it land differently. Vendors and EHR platforms are shipping functionality that clinicians did not ask for and do not find useful. Features arrive because they are technically possible, not because they solve a real workflow problem. The asymmetry between evidence and adoption is the field's defining failure mode—a paradox Eric Topol has written about extensively.

The root of this chasm lies in three specific holes in how we evaluate clinical AI:

Sandbox Performance vs. Robustness:Most evaluation stops at hold-out accuracy on clean data. It doesn't reveal how a model degrades under input noise or patient distribution shifts. In our sepsis evaluation post, we demonstrated how testing for robustness flipped the apparent winner between two architectures. The deployment decision would have been wrong if we had stopped at the headline numbers.
Safety and Compliance: While bias and hallucination tracking are becoming understood, applying these checks rigorously remains a checkbox exercise for many procurement teams rather than an institutional habit.
Measuring Clinical Utility: The most crucial gap. An accurate model ignored by clinicians has zero clinical impact. Pontiro is doing notable work here, building structured frameworks for measuring whether AI tools deliver real clinical value. Jan Beger at GE HealthCare has similarly called for shared evaluation standards that prioritize clinical safety, economic value, and patient experience over raw technical accuracy.

Why AGI Won't Solve This

Before we talk about fixing procurement, we need to address an assumption that is quietly shaping product strategy: the idea that a sufficiently large general AI (AGI) will eventually just know medicine the way GPT-4 knows language.

It won't. The emergent properties of LLMs rely on massive, publicly accessible scale. Healthcare data, however, is tightly guarded behind institutional moats. Medicine simply cannot contribute proportionally to the public training corpus.

You can train an LLM on every published medical textbook and journal. But as Dr. Aldo Faisal noted at the conference, that is like an English literature major reading a medical textbook and writing an elegant description. The fluency is real, but the clinical understanding is superficial. Foundational models trained on multimodal clinical data will excel at specific tasks, but they will hit ceilings because the data required to push past those ceilings sits inside individual health systems—locked behind institutional walls that public training corpora cannot reach.

There is a deeper problem too. Language has rules that generalize across billions of examples. The human body and the operational complexity of healthcare do not. Even if every clinical record ever produced were pooled into a single training run, it is far from clear that this would be sufficient to build a model capable of reasoning reliably across the dynamic, high-stakes decisions medicine demands. The data ceiling and the complexity ceiling are not the same problem—but they both point in the same direction.

The Case for Specialized Agents

If a monolithic AGI isn't the answer, what does effective clinical AI look like? The consensus points toward an agentic architecture.

Consider the Model Context Protocol (MCP) in software engineering—it lets AI agents connect to external tools at runtime rather than carrying all knowledge internally. Healthcare needs the same pattern. A general-purpose AI agent won't contain all relevant medical knowledge. Instead, it will act as an orchestrator, querying deeply trusted, institution-trained tools.

Those tools could be a sepsis risk model built on your patient population, or a readmission predictor calibrated to your workflows. They are precision instruments that a general-purpose agent invokes to get reliable, contextualized predictions. For this architecture to work securely, each instrument must be rigorously evaluated for robustness in the environment where it will be used.

A New Procurement Problem

As AI becomes more deeply embedded in EHR platforms, health systems will face an unprecedented procurement challenge. The question will not simply be “does this model work?” It will be: Which of these dozens of embedded AI tools should we adopt? Are they safe? Are clinicians actually using them, and are they improving outcomes?

The clinical voices at the conference were not being unreasonable when they pushed back on vendors. They were simply asking for functionality that has been properly evaluated for their specific context. To bridge this gap, the field needs infrastructure that stress-tests models before deployment, requires robust failure profiles, and continuously monitors clinical utility post-launch.

What We're Building

We are building the platform that equips health systems to make those decisions—stress-testing models before deployment, enforcing safety and compliance standards, and measuring clinical utility before and after launch.

The goal is not to replace clinical judgment. It is to give procurement and deployment teams the empirical evidence they deserve. The same proportional evidence standard that Nature Medicine called for in research must apply to every model a health system deploys. The crash test matters. Demand it.

Takeaways from the 3rd Annual Pediatric & Lifespan Data Science Conference