Pasteurizing Sepsis Prediction

Two recent pieces in Nature Medicine frame a hard truth about clinical AI: a model that wins on hold-out accuracy can still be the wrong model to deploy. We set out to show what proportional evidence looks like in practice.

Fig 1/The rank flip/LSTM wins on standard metrics; Attention wins on robustness. Read on for the evidence behind each side of this flip.

The Evidence Gap

An editorial in Nature Medicine argues that claims of clinical AI value must be backed by proportional evidence — the stronger the claim, the stronger the evidence required. As the editors put it:

“A system may perform very well in retrospective validation and still fail to improve care if its outputs are poorly timed, difficult to interpret, inconsistently acted upon or disruptive to clinical workflows.”

In a separate correspondence, Omar et al. argue that apparent contradictions in LLM performance are often a testing problem, not a capability problem. In the review they cite, diagnostic accuracy ranged from 25% to 98% across similar evaluations — a nearly fourfold spread.

These are not abstract concerns. For a hospital considering an early sepsis detection tool, the stakes are life and death. Standard metrics — hold-out accuracy, AUROC, F1 — answer one question: does it work on average? They do not tell you how the model breaks, when, or for whom. And if you deploy a model without documenting those failure modes, you are carrying unquantified liability.

We built Pasteur to produce exactly the crash-test evidence that Nature Medicine says is missing. It stress-tests clinical AI the way a crash test stress-tests a car — and the resulting report documents what was tested, how each model broke, and why the chosen architecture is safer to deploy.

Training Pipeline

We reproduce the multi-center sepsis benchmark described by Moor et al. (2023) and train on the PhysioNet 2019 Challenge ICU data, keeping the same train/validation/test logic and benchmark framing.

Dataset & Subset Selection

The full PhysioNet 2019 dataset contains two hospital cohorts: Hospital A (20,336 patients, 8.8% sepsis prevalence) and Hospital B (20,000 patients, 5.7% prevalence). For our experiments we draw a stratified patient-level subsampleof 5,000 patients from each cohort, preserving an approximate 10% sepsis prevalence for Hospital A (500 sepsis / 4,500 non-sepsis) and 5% for Hospital B (250 / 4,750).

Subsampling operates at the patient level, not the row level: patients are classified as “ever sepsis” or “never sepsis” by their maximum SepsisLabel, then each pool is sampled independently at the target ratio. All hourly rows for selected patients are retained. The process is seed-controlled for reproducibility.

The subsample is split 80/10/10(train / val / test), stratified by per-patient sepsis label so prevalence is preserved within each partition. Hospital A provides train, validation, and test splits; Hospital B provides a held-out cross-site test set(Test B) for generalizability evaluation.

Features

Raw inputs: 35 clinical features (8 vitals + 27 labs) plus 6 demographics, recorded hourly per patient.
Unsupervised feature engineering: missingness indicators, cumulative measurement counts, forward-fill, 9 derived clinical scores (qSOFA, SOFA, SIRS, MEWS, Shock Index, and more), and rolling lookback statistics (min/max/mean/median/var over 4, 8, and 16-hour windows). All computed per-patient before imputation to avoid leaking imputed values into rolling statistics.
Final feature space: ~780 features per hour, arranged into 6-step sliding windows with a 6-hour prediction horizon.
Label strategy: we follow the labeling convention used by Moor et al., shifting the positive window 6 hours earlier and 24 hours later to capture the physiological lead-up to sepsis. All evaluation uses the raw SepsisLabel and the PhysioNet utility score described by Reyna et al. — a clinically motivated score that rewards timely detection and penalizes false alarms.

Two Architectures

LSTMs process data like a narrative, remembering the past in order; attention-based models (transformers) look at the whole history at once to spot critical “warning signs” regardless of when they occurred.

Attention (ReZero Transformer)

2-layer, 128-dim, 8-head with causal masking
Sinusoidal positional encoding
MLP classification head
40 epochs, lr ≈ 6.5e-5

LSTM

3-layer, 256-dim, multi-layer dropout
Last-timestep hidden state classification
46 epochs, lr ≈ 1.6e-4

Artifacts

Final model settings are selected through a structured hyperparameter search, locked in configuration files, and retrained deterministically on the same cached splits. We also cross-check the implementation against the public Borgwardt Lab reference repository that accompanies the multicenter sepsis study. Each paired run produces checkpoints, prediction arrays, and an audit trail of resolved hyperparameters — no silent defaults.

Pasteur Simulations

Pasteur stress-tests models across three dimensions. Every simulation runs on three splits: Validation A (in-distribution), Test A (same hospital), and Test B (different hospital — cross-site generalizability).

Fig 2/Pasteur stress-testing pipeline/Three simulation batteries — Stability, Resiliency, Generalizability — across three evaluation splits.

Stability (Jitter)

Do small changes in input cause unreasonable prediction swings? Stability scores (where 1.0 represents zero change in prediction under noise) measure performance retention.

Calibrated Gaussian noise (scaled as multiples of the training set standard deviation for each feature) is injected into clinical measurements. The goal is to approximate ordinary bedside variability and assay imprecision rather than catastrophic device failure. We anchor the perturbation scales to familiar clinical intuition: chemistry measurements are stressed over relatively narrow ranges, while noisier bedside respiratory signals are stressed over wider ones. Fraser's review of biological variation in laboratory medicine provides the general rationale for calibrating perturbations to expected measurement variability. Features are grouped by clinical draw pattern — physiologically coupled measurements are perturbed together as a bundle, while independently measured signals are stressed in isolation. Eight sub-batteries cover the full clinical feature space.

Resiliency (Blackout)

How much performance do we lose when data goes missing?

Random fractions of feature values are masked to NaN on an hourly basis, simulating lab delays, EHR downtime, and charting gaps. This reflects real-world ICU charting constraints; the PhysioNet 2019 Challenge data described by Reyna et al. are highly sparse at the hourly level, especially for laboratory variables. Features are stratified by clinical ordering frequency — high-frequency labs start at 30% missing, while infrequent labs (already ≥80% missing at baseline) start at 70%. Bundled sub-batteries mask all features in a panel simultaneously; independent sub-batteries mask each feature separately.

Generalizability (Cohort)

Does performance hold for every patient subpopulation?

No noise or masking — pure cohort slicing. Patients are segmented by clinical queries (early ICU stay, age ≥ 70, tachycardia, elevated lactate, hypoxemia) and by data-driven clusters. Because the PhysioNet Challenge includes patients from multiple hospitals, we can train on Hospital A, evaluate standard in-distribution performance on test A, and then measure cross-site generalizability on Hospital B. To understand where generalization breaks down, we map every evaluation patient back into subcohorts defined on the Hospital A training set, so performance on the external cohort can be interpreted in terms of known clinical profiles rather than only as a single aggregate number.

Results

Four views of the same head-to-head — toggle between standard evals, stability, generalizability, and resiliency.

Fig 3.1/Standard Evals/Standard evaluation metrics comparing Attention vs LSTM across Validation A and Test A splits

At a fixed specificity of 0.70 (a 30% false-positive rate), the LSTM detects 86% of sepsis cases in validation vs. 72% for Attention. On the held-out test set: 76% vs. 55%. F1, Recall, Precision — LSTM leads across the board. If this were the whole story, the deployment decision would be obvious.

Key Numbers

Metric	Attention	LSTM
Sensitivity @ Specificity ≥ 0.70 (val_A)	0.72	0.86
Sensitivity @ Specificity ≥ 0.70 (test_A)	0.55	0.76
F1 (val_A)	0.19	0.27
Recall (val_A)	0.35	0.49
Jitter stability — min across all strata	0.990	0.999
Cluster 0 (High-Risk) utility — test_A	0.159	0.062
Cluster 0 (High-Risk) utility — test_B	0.158	0.035
Utility loss @ 70% missing (mean)	~17%	26–35%
Utility loss @ 85% missing	22%	37%

Bold values indicate the better-performing model for each metric. Standard metrics favor LSTM; robustness metrics favor Attention.

Key Takeaways

Standard evaluations are necessary but not sufficient. LSTM won on every traditional metric — sensitivity, recall, F1 — and would have been the wrong deployment choice.
Simulations surface safety and explainability. Jitter, blackout, and cohort batteries map exactly where a model breaks, under what conditions, and for which patients. This is the proportional evidence that Nature Medicine is calling for.
Attention is the safer architecture to deploy for sepsis. It degrades more gracefully under missing data and maintains stronger performance on the high-risk minority subgroup across hospitals. Not because it scores higher on average, but because it fails more consistently and on patients it was already struggling with — rather than surprising you with new failure modes at go-live.
There will be no single golden metric. Old metrics are highly useful; they just are not the whole story. The solution will not be one-size-fits-all. At Krv we do not treat this as a one-way street: clinical use cases should shape the metrics, and the metrics should shape the model we are willing to deploy.

Simulate, Select, Harden

Selection is only step one. Once Pasteur has mapped exactly where a model breaks — which subgroups, which features, which failure modes — we use that diagnostic to go further: hardening the model itself. Targeted fine-tuning against the identified weak points. Retraining with stress-augmented data. Preserving the performance metrics the vendor promised while systematically improving robustness on the cases that matter most.

The goal is not just to pick the best model off the shelf. It is to take the model you have selected and make it safer for your hospital, your patients, your data — before a single real patient sees it.

A documented stress-test report is not just good engineering — it is evidence that limits liability. The next time you evaluate a clinical AI tool, demand more than accuracy. Ask how it performs on your specific patient population. Ask what happens when a monitor disconnects. Ask for the crash test.