Foundation Models in Healthcare: The Misnomer and the Promise

A vendor pitches you a "foundation model" for radiology AI. It sounds impressive—like a cornerstone, a base to build upon, something universal. Then you read the fine print: chest CT only. Optimized for lung nodules specifically. Trained on 50,000 scans from three academic centers.

Foundation model? More like a very specialized tool with a misleading name.

In artificial intelligence, "foundation model" has become the buzzword of the decade. In most of AI, it refers to large-scale models like GPT-4 or BERT—trained on massive, diverse datasets, capable of generalization across a wide range of tasks. But in medical imaging, the term has been co-opted for something much narrower: models that are pre-trained on a subset of medical images and then fine-tuned for related applications.

The promise? These models can accelerate clinical AI development by requiring far less training data than starting from scratch. The problem? Nearly all of them are remarkably specific—and their value (and limitations) depend critically on whose data they learned from.

What "Foundation Model" Actually Means in Healthcare

In computer vision and natural language processing, foundation models are trained on enormous, heterogeneous datasets—millions of images from the internet, billions of words from books and websites. This breadth enables them to generalize: a model trained on ImageNet can recognize cats, cars, and clouds. A language model trained on web text can answer questions, write code, and translate languages.

Medical "foundation models," by contrast, are nearly always trained on much narrower data sets:

Single imaging modality: CT, MRI, X-ray, ultrasound—but rarely combinations
Single anatomical region: Brain, chest, abdomen—usually not multi-organ
Pre-trained without specific tasks: Using self-supervised methods like masked pixel filling or contrastive learning—not diagnosis or classification

Example: A "foundation model" for brain MRI might be pre-trained using self-supervised learning—predicting masked portions of images or learning to distinguish similar vs. dissimilar scans—without any diagnostic labels. You can then fine-tune it to detect multiple sclerosis lesions, brain tumors, or stroke—tasks within the same modality and body region. But you can't use it for chest X-rays, abdominal CT, or even brain ultrasound. And you certainly can't use it for ECG interpretation or pathology slide analysis.

So why call it a "foundation" model? Because within its narrow domain, it provides a starting point—a set of learned features that can be adapted to related tasks with far less data than training from scratch.^1,2

"A medical foundation model isn't a foundation for all of medicine. It's a foundation for one specific slice of one specific imaging modality on one specific body part."

Why Most Medical Foundation Models Are So Narrow

The narrowness isn't arbitrary. It reflects fundamental challenges in medical AI development.

Medical Data Is Fragmented by Modality

Unlike natural images, which all share similar properties (RGB pixels, edges, textures), medical imaging modalities are fundamentally different from each other:

CT: Grayscale with a wide contrast range with quantitative meaning measuring X-ray attenuation (Hounsfield units)
MRI: Usually grayscale but much less range (typically about 7 bits). Many different weightings of what the signal means
Ultrasound: Typically a video but usually only 2D images selected from that video stream. Most reflect echogenicity, though measurement of some quantitiave properties possible. Nearly always acquired by a hand-held probe and thus is highly operator-dependent
PET: Metabolic activity, lower spatial resolution, requires co-registration with anatomic imaging
Radiography: This is the 'typcial' X-ray people usaully think about--chest X-rays, bone X-rays to see fractures or arthritis.

A model trained on CT scans learns to recognize patterns in Hounsfield units—features that are meaningless in MRI or ultrasound. This isn't like transferring knowledge from photos of cats to photos of dogs. It's like transferring knowledge from photos to audio spectrograms.

Anatomical Specialization Is Essential

Even within a single modality, different body regions have entirely different normal anatomy, pathology, and clinical context:

A chest CT model learns ribs, lungs, mediastinum, vessels—none of which appear in brain imaging
An abdominal MRI model learns liver, kidneys, bowel—structures irrelevant to spine imaging. And because it has different 'weighting' a given organ will look dramatically different on different MRI acquisitions

Training a model to "understand" all anatomical regions would require massive, diverse datasets and even greater computational resources. ³

Self-Supervised Pre-Training, Task-Specific Fine-Tuning

Unlike earlier medical AI that required labeled data from the start, foundation models use self-supervised learning for pre-training—learning from unlabeled images by predicting masked regions, reconstructing corrupted images, or distinguishing similar vs. dissimilar scans. This is because creating the 'labels' for medical images is time-consuming and requires expert knowledge and those labels need to be reqprented in a consistent fashion for the computer to learn.

However, to build a clinically useful tool, you still need labeled data for fine-tuning—diagnostic annotations, pathology outcomes, treatment responses. The foundation model provides learned image representations; the fine-tuning teaches it what those representations mean clinically. While self-supervised pre-training is task-agnostic, the modality and anatomical constraints remain.⁴

The Real Value: Transfer Learning with Less Data

Despite being narrow, medical foundation models offer genuine value. The key advantage isn't breadth—it's data efficiency.

Training from Scratch: The Old Way

In the early days of AI (before foundation models were avaiable), we would:

Start with a randomly initialized neural network
Collect and label thousands of MRI scans with metastases annotations
Train the model from scratch, requiring substantial computational resources
Hope you have enough data for the model to learn meaningful features

Problem: You need a lot of labeled data—often tens of thousands of examples—to achieve good performance. If you're working on a rare disease or a small institution's dataset, you're out of luck.

Transfer Learning: The New Way

The next step was to use a pre-trained model that had been trained on imagenet (a large collection of photographs). While these weren't medical iamges, they did share some properties with medical images like having edges and basic textures. So we would:

Start with a model that has already learned general representations of photographic structure
Perform 'transfer learning' which is taking existing weights and adjust them for medical images. This is quite similar to how foundation models are used/

And now we arrive and foundation models, which are models built with the intent that they do not perform any task but are ready to be adjusted to perform some task. In that case, we would:

Start with a model that has already learned general representations of brain MRI structure
Fine-tune it on a much smaller dataset of metastases cases—maybe hundreds instead of thousands
Achieve comparable or better performance with far less data and compute

Why it works: The foundation model has already learned low-level features (edges, textures, intensity patterns) and mid-level features (anatomical structures, tissue boundaries) through self-supervised learning. You're teaching it to apply those representations to a specific clinical task.

This is transfer learning—and it's the main reason medical foundation models exist. They reduce the data requirements for developing clinical AI tools, making it feasible to build models for rare diseases, underserved populations, or institution-specific use cases.^2,5

Real-World Impact: Data Efficiency in Practice

Studies show that using pre-trained medical foundation models can reduce the labeled data requirement by 5-10x compared to training from scratch. For rare diseases with limited cases, this can be the difference between "impossible to build" and "clinically viable."

Example: A foundation model pre-trained on 100,000 chest X-rays can be fine-tuned to detect pediatric pneumonia with as few as 500 labeled pediatric cases—whereas training from scratch might require 5,000-10,000 cases to achieve the same accuracy.⁵

What Foundation Models Don't Automatically Provide

While foundation models improve data efficiency, they don't solve all clinical AI challenges:

Explainability: Self-supervised pre-training learns abstract representations, not clinical concepts. The model may detect disease accurately without explaining why in clinically interpretable terms. You still need separate approaches (attention maps, saliency methods) to generate explanations—and these may not align with radiologist reasoning.
Localization: Foundation models learn image-level features, not necessarily spatial localization. If your clinical task requires precise lesion boundaries or anatomic localization, you may need additional fine-tuning strategies or task-specific architectures—transfer learning doesn't automatically provide pixel-level precision.
Clinical context: Models trained on images alone don't understand patient history, lab values, or clinical presentation. Multi-modal foundation models (combining imaging + EHR) are emerging but remain rare.

Foundation models are powerful starting points, but they're not plug-and-play solutions for all clinical AI needs. Understanding their limitations is as important as understanding their strengths.

Why Demographics Matter: The Hidden Variable in Foundation Models

Here's the part most foundation model papers bury in supplementary materials—or omit entirely: Who is represented in the training data?

When you fine-tune a foundation model on your local dataset, you're not starting from a blank slate. You're starting from a model that has already learned patterns from its pre-training data. If that data is biased, your fine-tuned model inherits that bias—even if your fine-tuning data is perfectly balanced.

Bias Propagation Through Transfer Learning

Foundation models encode the demographic distributions they were trained on. If the pre-training dataset is:

90% white patients → the model learns features optimized for white patients
Predominantly from academic medical centers → it learns patterns specific to tertiary care populations
Imbalanced by sex, age, or socioeconomic status → it under-represents minority groups

When you fine-tune this model on a small dataset (which is the whole point of transfer learning), the foundation model's biases are likely to dominate. Your 200 fine-tuning examples can't override the millions of examples the model saw during pre-training.^6,7

Result: The model performs worse on populations underrepresented in the foundation training set—even if you tried to include them in your fine-tuning data.

Example: Skin Lesion Classification

Dermatology AI models, including foundation models for skin lesion classification, have been shown to perform significantly worse on darker skin tones. Why? Because the pre-training datasets (often scraped from public dermatology atlases) overwhelmingly feature lighter skin tones.

Even if you fine-tune the model on a balanced dataset with diverse skin tones, the foundation model has already learned features optimized for lighter skin—edge detection, texture analysis, color distributions that work best for one demographic.⁸

Fine-tuning can improve performance on underrepresented groups, but it can't fully undo the foundation model's priors. The bias is baked in from the start.

The Pulse Oximetry Parallel

This is conceptually similar to the pulse oximetry problem (discussed in our AI bias article): pulse oximeters were calibrated on light-skinned individuals, leading to systematic errors in darker-skinned patients. You can recalibrate a pulse oximeter, but the original calibration creates a baseline bias that's difficult to eliminate.

Foundation models work the same way. Their "calibration" is the pre-training dataset. If that dataset is demographically skewed, every downstream application inherits that skew—unless you invest significant effort to correct it (which requires labeled data from underrepresented groups, defeating the purpose of data efficiency).

What to Ask Before Adopting a Foundation Model

If you're considering using a foundation model—either for research or to build clinical AI tools—these questions are critical:

1. What Is the Scope of the Model?

Don't be fooled by the "foundation" label. Ask:

Which imaging modality? (CT, MRI, X-ray, ultrasound, PET?)
Which anatomical region? (Brain, chest, abdomen, extremities?)
Which tasks was it pre-trained on? (Segmentation, classification, detection?)

If the vendor can't clearly define the model's scope, that's a red flag. A true foundation model should explicitly state its domain of applicability—and its limitations.

2. What Are the Demographics of the Training Data?

This is the question most vendors hate. But it's essential. Ask:

What are the racial and ethnic demographics? If the model was trained on 90% white patients, it may underperform on minority populations.
What is the age distribution? Models trained predominantly on adults may fail on pediatric or geriatric patients.
What is the sex distribution? Imbalances can lead to disparate performance by sex.
Which institutions contributed data? Academic medical centers have different patient populations than community hospitals or international sites.

If the vendor responds with "the model is unbiased" or "we don't track demographics," walk away. Every dataset has a demographic distribution. Refusing to disclose it is a red flag.⁷

3. Has Performance Been Validated on Diverse Populations?

Pre-training on diverse data is good. Validating performance on diverse populations is better. Ask:

Has the model been tested on external datasets with different demographics?
Are performance metrics (sensitivity, specificity, AUC) reported stratified by demographic subgroups?
What is the performance gap between best- and worst-performing subgroups?

A 95% AUC "overall" means nothing if it's 97% for white patients and 88% for Black patients. Demand subgroup metrics.⁹

4. How Much Fine-Tuning Data Do I Need?

Foundation models promise data efficiency—but how much is "enough"? Ask:

What is the minimum recommended fine-tuning dataset size for my use case?
How does performance scale with fine-tuning data size? (Provide a learning curve.)
If I fine-tune on a small dataset, what are the risks of overfitting or bias amplification?

Vendors should provide guidance on data requirements, not just promise that "it works with less data." How much less? And what are the tradeoffs?

The Future: Toward True Medical Foundation Models

Despite current limitations, the trajectory is promising. Researchers are working toward foundation models that are genuinely broad—multi-modal, multi-organ, multi-task.

Multi-Modal Foundation Models

The next generation of medical AI will integrate across modalities:

Imaging + EHR: Combining radiology, pathology, and clinical data for holistic patient understanding
Multi-scale imaging: Linking whole-organ imaging (CT/MRI) with cellular-level imaging (histopathology)
Temporal integration: Learning from longitudinal data—how patients change over time

These models will require massive, curated datasets—but they promise to be true "foundations" for diverse downstream tasks.¹⁰

Self-Supervised Learning at Scale

Rather than relying on expensive expert labels, future foundation models will learn from unlabeled data using self-supervised techniques:

Contrastive learning: Learning to distinguish similar vs. dissimilar images without labels
Masked prediction: Predicting missing parts of an image or medical record
Temporal prediction: Predicting future scans based on past ones

This approach allows training on millions of unlabeled scans—expanding dataset diversity and reducing demographic bias (if the unlabeled data is itself diverse).¹¹

Federated Learning for Demographic Diversity

One reason medical foundation models are demographically skewed is that data is concentrated in a few large institutions—often with homogeneous patient populations.

Federated learning enables training models across multiple institutions without sharing patient data. Institutions train local models on their own data, then share only model updates (not data itself). This allows foundation models to learn from diverse populations while preserving privacy.¹²

If widely adopted, federated learning could produce foundation models trained on truly representative datasets—incorporating community hospitals, international sites, and underrepresented populations that academic-only datasets miss.

The Bottom Line on Medical Foundation Models

Medical "foundation models" aren't as foundational as their name suggests. They're specialized, narrow, domain-specific tools—pre-trained on one modality, one body part, one type of task.

But within their domain, they're powerful. They enable clinical AI development with far less data than traditional approaches, making it feasible to build tools for rare diseases, small institutions, and niche applications.

The catch? Their value depends on whose data they learned from. If the pre-training dataset is demographically skewed, every downstream application inherits that bias—even if you fine-tune on balanced data.

So before you adopt a medical foundation model, ask the hard questions: What's the scope? Who's in the training data? How does it perform across demographics? And how much fine-tuning data do I actually need?

Because a foundation built on shaky ground won't support the clinical tools we need. But one built with transparency, diversity, and validation? That's a foundation worth building on.

Key Takeaways

"Foundation" Is a Misnomer: Most medical foundation models are narrow—one modality, one body part. They use self-supervised pre-training (not task-specific labels), but remain specialized to their domain.
Data Efficiency Is the Real Benefit: Foundation models reduce labeled data requirements by 5-10x through transfer learning, making clinical AI feasible for rare diseases and small datasets.
Explainability Isn't Automatic: Self-supervised learning creates abstract representations, not clinically interpretable features. Localization and explanation require additional work beyond transfer learning.
Demographics Matter Critically: Models trained on skewed populations propagate bias to downstream tasks—even with balanced fine-tuning data. Always ask: who's in the pre-training dataset?
Validation Must Be Stratified: Demand performance metrics broken down by demographic subgroups. "95% AUC overall" hides disparities between populations.
Ask Hard Questions Before Adopting: What's the scope? What are training demographics? How does it perform on my population? Does it provide the localization/explainability I need?

References & Further Reading

Moor M, Banerjee O, Abad ZSH, et al. Foundation Models for Generalist Medical Artificial Intelligence. Nature. 2023;616(7956):259-265. doi:10.1038/s41586-023-05881-4
Krishnan R, Rajpurkar P, Topol EJ. Self-supervised Learning in Medicine and Healthcare. Nat Biomed Eng. 2022;6(12):1346-1352. doi:10.1038/s41551-022-00914-1
Willemink MJ, Koszek WA, Hardell C, et al. Preparing Medical Imaging Data for Machine Learning. Radiology. 2020;295(1):4-15. doi:10.1148/radiol.2020192224
Tajbakhsh N, Shin JY, Gurudu SR, et al. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans Med Imaging. 2016;35(5):1299-1312. doi:10.1109/TMI.2016.2535302
Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: Understanding Transfer Learning for Medical Imaging. Advances in Neural Information Processing Systems. 2019;32.
Rouzrokh P, Wyles CC, Philbrick KA, Ramazanian T, Weston AD, Cai JC, Taunton MJ, Kremers WK, Lewallen DG, Erickson BJ. A Deep Learning Tool for Automated Radiographic Measurement of Acetabular Component Inclination and Version After Total Hip Arthroplasty. Part 1: Mitigating Bias in Machine Learning—Data Handling. J Arthroplasty. 2022;37(6S):S406-S413. doi:10.1016/j.arth.2022.02.092
Zhang Y, Wyles CC, Makhni MC, Maradit Kremers H, Sellon JL, Erickson BJ. Part 2: Mitigating Bias in Machine Learning—Model Development. J Arthroplasty. 2022;37(6S):S414-S420. doi:10.1016/j.arth.2022.02.085
Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set. Sci Adv. 2022;8(32):eabq6147. doi:10.1126/sciadv.abq6147
Faghani S, Khosravi B, Moassefi M, Rouzrokh P, Erickson BJ. Part 3: Mitigating Bias in Machine Learning—Performance Metrics, Healthcare Applications, and Fairness in Machine Learning. J Arthroplasty. 2022;37(6S):S421-S428. doi:10.1016/j.arth.2022.02.087
Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal Biomedical AI. Nat Med. 2022;28(9):1773-1784. doi:10.1038/s41591-022-01981-2
Azizi S, Mustafa B, Ryan F, et al. Big Self-Supervised Models Advance Medical Image Classification. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:3478-3488.
Rieke N, Hancox J, Li W, et al. The Future of Digital Health with Federated Learning. NPJ Digit Med. 2020;3:119. doi:10.1038/s41746-020-00323-1