Opinion
An opinion piece drawn from conference panels, work with hospital teams, and the evidence gaps laid out in our sepsis technical post.
ByJeremy Wayland
Opinion
We recently attended the 3rd Annual Pediatric and Lifespan Data Science Conference. The presentations and hallway conversations left us with a familiar feeling—not discouragement, but clarity. Clinical leaders and researchers made one thing plain: the clinical AI field has a deployment problem, and the way we evaluate models today is a large part of why.
This opinion piece distills our three main takeaways from the conference: the reality of the model evaluation gap, the misconception that AGI will autonomously “solve” medicine, and why the future relies on agents equipped with specialized tools.
During a keynote panel, Dr. Anthony Chang put it plainly: there is a chasm between research performance and deployment success. Models that look excellent in retrospective validation routinely underperform or go unused once they meet the complexity of real clinical environments.
This is not a new observation, but the directness of the clinical voices in the room made it land differently. Vendors and EHR platforms are shipping functionality that clinicians did not ask for and do not find useful. Features arrive because they are technically possible, not because they solve a real workflow problem. The asymmetry between evidence and adoption is the field's defining failure mode—a paradox Eric Topol has written about extensively.
The root of this chasm lies in three specific holes in how we evaluate clinical AI:
Before we talk about fixing procurement, we need to address an assumption that is quietly shaping product strategy: the idea that a sufficiently large general AI (AGI) will eventually just know medicine the way GPT-4 knows language.
It won't. The emergent properties of LLMs rely on massive, publicly accessible scale. Healthcare data, however, is tightly guarded behind institutional moats. Medicine simply cannot contribute proportionally to the public training corpus.
You can train an LLM on every published medical textbook and journal. But as Dr. Aldo Faisal noted at the conference, that is like an English literature major reading a medical textbook and writing an elegant description. The fluency is real, but the clinical understanding is superficial. Foundational models trained on multimodal clinical data will excel at specific tasks, but they will hit ceilings because the data required to push past those ceilings sits inside individual health systems—locked behind institutional walls that public training corpora cannot reach.
There is a deeper problem too. Language has rules that generalize across billions of examples. The human body and the operational complexity of healthcare do not. Even if every clinical record ever produced were pooled into a single training run, it is far from clear that this would be sufficient to build a model capable of reasoning reliably across the dynamic, high-stakes decisions medicine demands. The data ceiling and the complexity ceiling are not the same problem—but they both point in the same direction.
If a monolithic AGI isn't the answer, what does effective clinical AI look like? The consensus points toward an agentic architecture.
Consider the Model Context Protocol (MCP) in software engineering—it lets AI agents connect to external tools at runtime rather than carrying all knowledge internally. Healthcare needs the same pattern. A general-purpose AI agent won't contain all relevant medical knowledge. Instead, it will act as an orchestrator, querying deeply trusted, institution-trained tools.
Those tools could be a sepsis risk model built on your patient population, or a readmission predictor calibrated to your workflows. They are precision instruments that a general-purpose agent invokes to get reliable, contextualized predictions. For this architecture to work securely, each instrument must be rigorously evaluated for robustness in the environment where it will be used.
As AI becomes more deeply embedded in EHR platforms, health systems will face an unprecedented procurement challenge. The question will not simply be “does this model work?” It will be: Which of these dozens of embedded AI tools should we adopt? Are they safe? Are clinicians actually using them, and are they improving outcomes?
The clinical voices at the conference were not being unreasonable when they pushed back on vendors. They were simply asking for functionality that has been properly evaluated for their specific context. To bridge this gap, the field needs infrastructure that stress-tests models before deployment, requires robust failure profiles, and continuously monitors clinical utility post-launch.
We are building the platform that equips health systems to make those decisions—stress-testing models before deployment, enforcing safety and compliance standards, and measuring clinical utility before and after launch.
The goal is not to replace clinical judgment. It is to give procurement and deployment teams the empirical evidence they deserve. The same proportional evidence standard that Nature Medicine called for in research must apply to every model a health system deploys. The crash test matters. Demand it.
Hosted by CHOC (part of Rady Children's Health) in Irvine, CA. Two days on AI, precision medicine, and health equity across the lifespan.
Where the evidence is strong, AI is not being applied. Where it is soft, it is being widely adopted. The defining failure mode of clinical AI today.
GE HealthCare's Head of AI Advocacy on why current evaluation frameworks overweight technical accuracy and underweight clinical utility, safety, and economic value.
Aldo Faisal's foundation health model project: a multimodal model trained on health data, not just text about health.
Frameworks for measuring real-world ROI and adoption of clinical AI, independent of vendor-reported metrics.
Our companion technical post: training, evaluating, and stress-testing two sepsis architectures on PhysioNet 2019.