Unpacking Multimodal AI in Healthcare: When Does Fusing EHR and X-Rays Really Help?

1: The Promise and Perils of Multimodal AI in Healthcare

Multimodal learning—combining different data types, such as electronic health records (EHR) and medical images—rests on an appealing intuition: a model that reads a patient's labs, history, vital signs, and chest X-ray together should form a more complete picture than any single source. The open question is whether that intuition holds up against real clinical data.

The two modalities are genuinely complementary. EHRs hold structured, longitudinal data—demographics, diagnoses, medications, lab results, and physiological measurements captured over time. Chest X-rays (CXR) add a view of anatomy and cardiopulmonary pathology that the record alone cannot supply. Fusing them should let a model lean on each where it is strongest.

In practice, though, most evidence comes from proof-of-concept studies, and existing clinical ML benchmarks evaluate performance under near-ideal conditions, assuming every modality is present and well-structured. That "complete modality" assumption rarely holds in a hospital, which leaves the practical question—when does fusion actually help?—largely unanswered.

Real clinical data is messy. Missing modalities are common: a patient may have extensive EHR data but no recent X-ray, or the reverse. The modalities also differ sharply in information density—hourly vital signs versus a single static image—which can skew how a model learns. That raises three questions the field has mostly sidestepped. Do more complex fusion architectures actually beat simpler ones? How well do these methods cope when a modality is missing? And does fusion widen or narrow performance gaps across patient subgroups?

To address these gaps, researchers from Hong Kong Baptist University and collaborators built CareBench, a benchmark for systematically evaluating EHR–CXR fusion. Their paper, "When Does Multimodal Learning Help in Healthcare? A Benchmark on EHR and Chest X-Ray Fusion," sets out to answer four questions:

When does multimodal fusion genuinely improve clinical prediction?
How do different fusion strategies compare against each other?
How robust are existing multimodal methods to missing modalities?
Does multimodal fusion preserve algorithmic fairness across patient subgroups?

The work is released as an open-source framework, so the analyses below can be reproduced and extended.

2: Building CareBench: What They Did

The researchers developed CareBench as a robust platform to systematically investigate multimodal learning by integrating Electronic Health Records (EHR) and Chest X-rays (CXR). Their methodology involved constructing standardized patient cohorts, extracting a rich set of features, implementing a diverse array of unimodal and multimodal models, and evaluating them across clinically relevant tasks under various conditions, including realistic data missingness and fairness considerations.

Data Sources and Cohort Construction

The foundation of CareBench lies in two large-scale, real-world intensive care unit (ICU) databases: MIMIC-IV and MIMIC-CXR. MIMIC-IV contains de-identified records of adult patients admitted to Beth Israel Deaconess Medical Center (BIDMC) between 2008 and 2019, providing a wealth of structured EHR data. MIMIC-CXR is a publicly available dataset of chest radiographs from BIDMC, allowing for the matching of patients across both datasets.

To ensure clinical relevance and consistency, two distinct cohorts of ICU stays were constructed from MIMIC-IV:

The Base Cohort: This cohort began with 73,181 ICU stays from MIMIC-IV. Rigorous exclusion criteria were applied, removing stays with missing essential clinical documentation (e.g., discharge notes, diagnostic codes), implausible temporal records, stays shorter than 6 hours (often observational or step-down care), non-urgent/elective admissions, and repeated ICU episodes within the same hospitalization. To focus on acute, clinically meaningful episodes with sufficient longitudinal information for robust prediction, stays shorter than 48 hours were further excluded. This meticulous process resulted in a base cohort of 26,947 ICU stays.
The Matched Subset: For multimodal evaluation, a subset of the base cohort was identified where at least one chest radiograph was available within a specific temporal window (24 hours prior to 48 hours after ICU admission). This yielded a matched subset of 7,149 ICU stays, representing patients with both structured EHR data and CXR images available. This "complete modality" subset let the authors measure fusion under ideal conditions before testing what happens when data is missing.

Feature Extraction and Preprocessing

A comprehensive set of structured EHR features was extracted from MIMIC-IV v2.2, encompassing clinically relevant variables across multiple physiological domains. These included vital signs (e.g., heart rate, respiratory rate, blood pressure, oxygen saturation, temperature, glucose), neurological status (Glasgow Coma Scale), cardiac rhythm, respiratory support parameters (O2 flow, FiO2), fluid balance (urine output), and body weight. Features with a missingness rate greater than 90% were empirically excluded, and treatment-related variables were removed to prevent label leakage. The EHR data was then resampled at an hourly resolution, missing values were imputed (forward filling and median imputation), and continuous variables were robustly normalized using median and interquartile range (IQR).

For CXR data, only frontal-view images with an Anterior-Posterior (AP) projection acquired during the patient's current ICU stay were included. The most recent CXR prior to the prediction timepoint was selected to best reflect the patient's latest cardiopulmonary status.

Benchmark Models

CareBench includes a broad spectrum of models, categorized to address various aspects of multimodal fusion:

Uni-modal Baselines: To establish a reference for each modality, the benchmark included Long Short-Term Memory networks (LSTM) and Transformer models for EHR (capturing temporal dependencies) and a ResNet-50 model (pretrained on ImageNet) for CXR.
Complete-Modality Multimodal Fusion Methods: This group of models assumes all modalities are present during both training and inference. It includes simple strategies like Late Fusion (concatenating unimodal embeddings) and more advanced architectures such as UTDE, DAFT, MMTM, AUG, and InfoReg, which employ sophisticated mechanisms for cross-modal interaction.
Missing-Modality Multimodal Fusion Methods: Crucially, the benchmark included models specifically designed to handle incomplete inputs, covering diverse paradigms like mixture-of-experts designs, disentangled representations, or latent-space imputation. These include HEALNet, Flex-MoE, DrFuse, UMSE, ShaSpec, M3Care, MedFuse, and SMIL. For models not explicitly designed for missing modalities, zero-imputation was applied to absent CXR features in experiments simulating missingness.

Downstream Tasks and Evaluation Metrics

Models were evaluated on three clinically relevant tasks:

Phenotyping Classification: Predicting 25 acute and chronic conditions (e.g., congestive heart failure, COPD) present during an ICU stay. This is a multi-label classification problem, assessed using Area Under the Receiver Operating Characteristic Curve (AUROC), Area Under the Precision-Recall Curve (AUPRC), and F1 score.
Mortality Prediction: Predicting in-hospital mortality within the first 48 hours of ICU admission (binary classification). Metrics included AUROC, AUPRC, and F1 score.
Length-of-Stay (LoS) Prediction: Estimating the remaining hospital stay (RLOS) in clinically meaningful intervals (e.g., 2-3 days, 14+ days). This is a multi-class classification problem with ordinal structure, evaluated using Accuracy (ACC), F1 score, and Cohen's Kappa weighted quadratic.

All tasks used patient-level train/validation/test splits. To ensure fair comparison, Bayesian hyperparameter optimization was performed, and permutation tests were used to assess statistical significance of improvements over unimodal baselines.

3: Key Findings and Insights

CareBench's rigorous evaluation yielded several critical insights into the behavior of multimodal learning, particularly concerning its benefits, the effectiveness of different fusion strategies, robustness to missing data, and algorithmic fairness.

When Multimodal Fusion Helps (Q1)

Key Finding

Multimodal fusion generally improves predictive performance when all data modalities are complete, with gains concentrated in diseases that require complementary information from both EHR and CXR. However, these benefits rapidly degrade under realistic missingness conditions unless models are explicitly designed to handle incomplete inputs.

The study found that under complete-modality settings (the matched subset), multimodal models consistently achieved the best performance across all tasks. For instance, in phenotyping, leading fusion approaches like InfoReg achieved an AUPRC of 0.495 and AUG an AUROC of 0.742, significantly outperforming the best unimodal EHR Transformer (AUPRC 0.472, AUROC 0.724). Similarly, for mortality prediction, DrFuse achieved the strongest AUPRC of 0.481 compared to the LSTM's 0.451, and MedFuse attained the highest AUROC of 0.851 (vs. Transformer's 0.821). While gains for Length-of-Stay (LoS) prediction were more modest, ShaSpec still achieved the highest Kappa of 0.198 relative to the best unimodal model (LSTM, 0.192). These results clearly indicate that when both EHR and CXR are available, CXRs provide valuable complementary information that enriches patient representations and improves predictions.

Within phenotyping, the benefit was not uniform across conditions but concentrated in "modality-distributed phenotypes." Diseases such as congestive heart failure, coronary heart disease, COPD, and liver disease benefited most. These conditions are characterized by manifestations that require both structural insights from CXR (e.g., pulmonary infiltrates, cardiac enlargement) and longitudinal risk factors, lab measurements, and comorbidity history from EHR. This highlights how multimodal models can effectively leverage complementary signals that neither modality could access alone. In contrast, conditions like sepsis, primarily driven by acute physiological changes, showed more limited fusion benefits, as CXR offers only indirect or nonspecific cues for such rapidly evolving states.

Comparing Fusion Strategies (Q2)

Key Finding

Advanced cross-modal learning mechanisms that facilitate information exchange significantly outperform simple concatenation (late fusion), capturing clinically meaningful dependencies. However, the rich temporal structure of EHR often creates a strong modality imbalance that complex architectures alone cannot overcome; methods explicitly designed to balance modalities perform best.

The benchmark demonstrated that cross-modal learning mechanisms capture clinically meaningful dependencies that simple concatenation misses. Methods designed for tighter cross-modal interaction, such as InfoReg (0.495 AUPRC), AUG (0.493 AUPRC), DrFuse (0.493 AUPRC), and UTDE (0.492 AUPRC), all significantly surpassed naive Late Fusion (0.489 AUPRC) for phenotyping. This performance gap underscores the clinical reality that imaging findings must be interpreted in context with physiological state. For example, in pneumonia (PNA) diagnosis, advanced methods like DrFuse (0.500) and UTDE (0.495) showed further gains beyond late fusion (0.493), reflecting the clinical diagnostic process where CXR infiltrates have different implications depending on EHR indicators like fever or leukocytosis.

A striking pattern emerged regarding modality imbalance: EHR's longitudinal richness creates a dominance that architectural complexity alone cannot overcome. EHR provides dense, continuous signals (hourly vital signs, lab trends) capturing disease progression in granular detail. CXR, on the other hand, is a single, static snapshot. This inherent information asymmetry means EHR often dominates the learning process. Models that explicitly address this imbalance, such as AUG (iteratively boosting weaker modalities) and InfoReg (slowing learning from information-rich modalities), consistently achieved superior performance, even with relatively simpler fusion architectures. This finding suggests that effective multimodal fusion must prioritize balancing the information density differences between modalities.

“EHR's rich temporal structure introduces strong modality imbalance that architectural complexity alone cannot overcome.”

“The success of InfoReg... and AUG... demonstrates that addressing this clinical data imbalance is more critical than architectural sophistication.”

Robustness to Modality Missingness (Q3)

Key Finding

In real-world settings with prevalent missing modalities, naive application of complete-data models often fails. Specialized architectures explicitly designed for handling incomplete inputs are essential and significantly outperform both unimodal baselines and complete-data models in these scenarios. Severe missingness amplifies modality imbalance, highlighting the need for explicit balancing mechanisms.

Modality missingness is a pervasive issue in clinical practice, with approximately 75% of ICU stays in the base cohort lacking paired chest X-rays. The study revealed that in this realistic base cohort, the EHR-only Transformer baseline often outperformed many multimodal models designed for complete data, such as MMTM (mortality F1 of 0.663 vs. Transformer's 0.679, LoS Kappa of 0.1714 vs. Transformer's 0.204). This stark finding indicates that naively applying models designed for complete-case scenarios does not guarantee a benefit and often fails once missingness is introduced.

However, multimodal models explicitly designed to handle incomplete inputs, such as MedFuse, M3Care, and HEALNet, significantly outperformed both the unimodal baseline and complete-case fusion methods in the base cohort. For mortality prediction, MedFuse achieved the highest AUROC of 0.874, and for LoS prediction, it led with the best Kappa (0.213) and F1-score (0.203), surpassing the strong EHR Transformer.

The pervasive missingness of CXR data (~25% availability in the base cohort) further amplifies the inherent modality imbalance, leading to EHR-driven gradients dominating the learning process. Controlled experiments varying CXR missingness from 0% to 80% showed that models like InfoReg and AUG degraded notably slower than others. This resilience stems from their explicit mechanisms to counteract modality imbalance, ensuring that CXR features still learn discriminative patterns even when scarce.

Algorithmic Fairness Across Subgroups (Q4)

Key Finding

Multimodal fusion does not inherently improve algorithmic fairness; in fact, it can exacerbate subgroup disparities. These disparities primarily manifest as unequal sensitivity (under-detection) across demographic groups, rather than false positive rates.

Beyond predictive accuracy, the study rigorously assessed algorithmic fairness by stratifying model performance across racial subgroups using metrics like AUPRC gap, TPR gap, FPR gap, and ECE gap. The result cuts against a common assumption: multimodal fusion does not inherently improve algorithmic fairness. Despite improving overall predictive performance, all multimodal models exhibited larger AUPRC gaps than the unimodal Transformer baseline, indicating increased performance disparity across race subgroups. Several high-performing multimodal methods, including DrFuse, ShaSpec, SMIL, and InfoReg, consistently demonstrated higher gaps across all fairness metrics compared to the unimodal Transformer.

“Multimodal fusion does not inherently improve algorithmic fairness, with subgroup disparities primarily arising from unequal sensitivity across demographic groups.”

This suggests that performance gains from multimodal fusion do not automatically translate into improved fairness and may, in some cases, worsen subgroup disparities. Further analysis revealed that these fairness violations are driven more by unequal sensitivity (under-detection) than false positives (over-detection). TPR gaps were consistently larger than FPR gaps across nearly all models, highlighting that certain demographic groups are more prone to under-detection of conditions. This has critical implications for clinical deployment, as under-detection can lead to delayed or missed diagnoses for vulnerable populations.

4: Strengths, Limitations, and Future Directions

CareBench is a systematic, well-controlled analysis of multimodal learning in healthcare. Its constraints, noted below, also mark out where the work should go next.

Strengths of CareBench

Comprehensive and Systematic Benchmark: CareBench stands out as the first benchmark to systematically evaluate EHR-CXR fusion across a broad spectrum of models, clinical tasks, and, critically, real-world conditions like modality missingness and algorithmic fairness, rather than complete-data performance alone.
Rigorous Evaluation Protocol: The study moves beyond aggregate predictive performance, incorporating detailed analyses of model behavior under varying degrees of modality missingness and across different patient subgroups. The use of permutation tests for statistical significance adds robustness to the findings, ensuring that observed gains are not merely due to chance.
Actionable Guidance for Clinical Deployment: By clarifying *when* multimodal learning helps (complete data, modality-distributed diseases), *when it fails* (naive models with missingness), and *why* (modality imbalance, lack of fairness consideration), CareBench gives builders of clinical systems something concrete to design against.
Open-Source and Reproducible Toolkit: The release of a flexible, open-source benchmarking toolkit, including a data extraction pipeline and a unified framework for integrating models, is a significant contribution. This fosters reproducibility, enables plug-and-play integration of new models and datasets, and supports future extensions, accelerating research in the field.

Limitations and Future Directions

Dependency on MIMIC Data: While MIMIC-IV and MIMIC-CXR are large, widely accepted, and de-identified datasets, they represent a specific patient population from a single US academic medical center. The generalizability of these findings to other diverse healthcare systems, different clinical contexts, or more varied demographics (e.g., pediatric populations) needs further investigation. Future benchmarks could incorporate data from multiple institutions and diverse geographical regions.
Limited Modality Scope: The current benchmark focuses exclusively on EHR and chest X-ray images. Real-world clinical decision-making often involves a much wider array of data modalities, including free-text clinical notes, pathology reports, genomic data, waveforms (e.g., ECG), and other imaging types (e.g., CT scans, MRI). Expanding CareBench to include these additional modalities would offer a more complete picture of multimodal AI's potential and challenges.
Simulated Missingness Assumptions: While the study rigorously evaluated robustness to missing modalities by systematically varying CXR missingness and applying zero-imputation, the underlying assumption of "missing at random" might not fully capture the complexities of real-world clinical data incompleteness. Missingness in clinical data is often non-random and clinically driven (e.g., a patient might not get an X-ray because their condition doesn't warrant it, or due to logistical constraints). Future work could explore more sophisticated missing data mechanisms and their impact.
Narrow Focus on Fairness Attributes: The fairness analysis primarily focused on racial subgroups and specific fairness metrics (AUPRC gap, TPR gap, FPR gap, ECE gap). Fairness is a multifaceted concept, and disparities can arise across other sensitive attributes such as age, gender, socioeconomic status, and different disease severities. Future research should broaden the scope of fairness evaluation to encompass a wider range of demographic factors and fairness definitions (e.g., equal opportunity, equalized odds).
Modality imbalance remains unsolved: The finding that EHR's temporal richness dominates fusion—and that architectural complexity alone cannot fix it—points to a deeper problem than the current adaptive-learning fixes address. Integrating modalities with very different information densities and time structures may need new architectural or theoretical approaches, not just reweighting.

CareBench's contribution is to replace a vague optimism about "more data, better models" with specific conditions: fusion helps with complete data and modality-distributed diseases, fails when naive models meet missingness, and does not by itself deliver fairness. Those are the constraints any team deploying EHR–CXR fusion should design around.