Foundation Models in Healthcare: The Misnomer and the Promise

Radiology AI vendors increasingly brand their products as "foundation models," a term that suggests a broad, general-purpose base you can build many applications on top of. Read the technical details, though, and the breadth usually evaporates. A typical "foundation model" turns out to be trained only on chest CT, tuned specifically for lung-nodule detection, on 50,000 scans from three academic medical centers. That is a useful tool—but it is a narrow, specialized one, and calling it a foundation model oversells what it can actually do.

In mainstream AI, "foundation model" means something specific: a large model like GPT-4 or BERT, trained on enormous, heterogeneous data—billions of text tokens or millions of varied images—that generalizes across tasks it was never explicitly trained for, from translation to image classification. Medical imaging has borrowed the term for something much narrower: a model pre-trained on a curated set of medical images via self-supervised learning, then fine-tuned for a specific task in a closely related domain. These medical models are genuinely valuable—they cut the amount of labeled data needed to build a clinical tool—but nearly all of them are tied to one modality, one body region, and one kind of patient population, and how well they generalize depends heavily on who was in their training data.

What "Foundation Model" Actually Means in Healthcare

In computer vision and NLP, what makes a model "foundational" is the breadth of its training data—millions of internet images spanning nearly every visual category, or billions of text tokens from books, the web, and scientific articles across many languages. That breadth is what lets the model generalize: an ImageNet-trained vision model recognizes thousands of object categories, and a web-trained language model can answer questions, write code, translate, and summarize without task-specific training. The "foundation" label is earned by transferring learned representations across genuinely different domains.

Medical "foundation models" are the opposite—trained on narrow data, defined by specificity rather than breadth. Three constraints are typical. They're tied to a single modality (CT, MRI, radiography, or ultrasound, rarely combinations). They're tied to a single anatomical region (brain, chest, abdomen) rather than whole-body. And they're pre-trained without diagnostic labels, using self-supervised methods—predicting masked image regions, reconstructing corrupted images, or contrasting similar against dissimilar scans. A brain-MRI foundation model, for example, might pre-train by predicting masked patches or clustering similar anatomy, never seeing a diagnosis. You can then fine-tune it on a small labeled set to detect MS lesions, classify brain tumors, or spot acute stroke—all within brain MRI. But it can't read a chest radiograph, an abdominal CT, or even a brain ultrasound, let alone an ECG or a pathology slide. "Foundation" here doesn't mean a base for all of medicine, or even all of imaging; it means a specialized starting point whose learned features can be adapted to related tasks in one narrow domain, with far less labeled data than training from scratch would need.^1,2

Why Most Medical Foundation Models Are So Narrow

This narrowness isn't an arbitrary choice or an oversight—it falls out of real technical constraints in how medical imaging data works and what it would cost to train across all of it at once.

Medical Data Is Fragmented by Modality

Natural photographs share a common structure—three RGB channels, similar edge and texture statistics, pixel values that mean the same thing image to image. Medical modalities don't. Each rests on different physics and means something different, which sharply limits transfer between them. CT produces grayscale images in Hounsfield units, where a number is a quantitative measure of tissue density. MRI intensities, by contrast, are arbitrary—the same structure looks completely different across T1, T2, FLAIR, and diffusion sequences. Ultrasound is video, frame-selected, and heavily operator-dependent. PET captures low-resolution metabolic activity and needs co-registration with anatomic imaging to interpret. Plain radiography is a projection through 3D anatomy. A model that learned to read CT builds its internal features around Hounsfield patterns—features that are meaningless on MRI or ultrasound. This isn't like transferring between cats and dogs; it's closer to transferring from photographs to audio spectrograms—different data types that need different feature extractors.

Anatomical Specialization Is Essential

Even within one modality, different body regions are different worlds—distinct normal anatomy, distinct pathology, distinct clinical context—and that limits generalization too. A chest-CT model learns ribs, lung parenchyma, mediastinum, and vasculature; none of that appears in brain imaging, where the relevant structures are gray and white matter, ventricles, and specific regions. An abdominal-MRI model learns liver, kidneys, and bowel—useless for spine imaging, which is about vertebrae, discs, neural foramina, and cord. And because MRI appearance depends on the pulse sequence, the same organ looks different across protocols, so the model has to learn sequence-specific representations on top of anatomy. Training one model to handle every region across every modality would mean assembling a massive dataset spanning all those combinations and paying orders of magnitude more compute than the modality- and anatomy-specific models everyone actually builds.³

Self-Supervised Pre-Training, Task-Specific Fine-Tuning

Unlike earlier medical AI, which needed labels from the start, foundation models pre-train with self-supervised learning—learning from unlabeled images by predicting masked regions, reconstructing corrupted ones, or telling similar scans from dissimilar ones. The motivation is practical: labeling medical images is slow, requires expert knowledge, and has to be done consistently for a model to learn from it.

However, to build a clinically useful tool, you still need labeled data for fine-tuning—diagnostic annotations, pathology outcomes, treatment responses. The foundation model provides learned image representations; the fine-tuning teaches it what those representations mean clinically. While self-supervised pre-training is task-agnostic, the modality and anatomical constraints remain.⁴

The Real Value: Transfer Learning with Less Data

Despite being narrow, medical foundation models offer genuine value. The key advantage isn't breadth—it's data efficiency.

Training from Scratch: The Old Way

In the early days, before foundation models were available, the recipe was:

Start with a randomly initialized neural network
Collect and label thousands of MRI scans with metastases annotations
Train the model from scratch, requiring substantial computational resources
Hope you have enough data for the model to learn meaningful features

Problem: You need a lot of labeled data—often tens of thousands of examples—to achieve good performance. If you're working on a rare disease or a small institution's dataset, you're out of luck.

Transfer Learning: The New Way

The next step was to start from a model pre-trained on ImageNet (a large collection of ordinary photographs). These weren't medical images, but they shared low-level properties with them—edges, basic textures. So the recipe became:

Start with a model that has already learned general representations of photographic structure
Apply transfer learning—take those existing weights and adjust them for medical images, much the way foundation models are used today

And that brings us to foundation models—built deliberately to perform no task on their own, but to be ready for adaptation to one. With those, the recipe is:

Start with a model that has already learned general representations of brain MRI structure
Fine-tune it on a much smaller dataset of metastases cases—maybe hundreds instead of thousands
Achieve comparable or better performance with far less data and compute

Why it works: The foundation model has already learned low-level features (edges, textures, intensity patterns) and mid-level features (anatomical structures, tissue boundaries) through self-supervised learning. You're teaching it to apply those representations to a specific clinical task.

This is transfer learning—and it's the main reason medical foundation models exist. They reduce the data requirements for developing clinical AI tools, making it feasible to build models for rare diseases, underserved populations, or institution-specific use cases.^2,5

Real-World Impact: Data Efficiency in Practice

Studies show that using pre-trained medical foundation models can reduce the labeled data requirement by 5-10x compared to training from scratch. For rare diseases with limited cases, this can be the difference between "impossible to build" and "clinically viable."

Example: A foundation model pre-trained on 100,000 chest X-rays can be fine-tuned to detect pediatric pneumonia with as few as 500 labeled pediatric cases—whereas training from scratch might require 5,000-10,000 cases to achieve the same accuracy.⁵

What Foundation Models Don't Automatically Provide

While foundation models improve data efficiency, they don't solve all clinical AI challenges:

Explainability: Self-supervised pre-training learns abstract representations, not clinical concepts. The model may detect disease accurately without explaining why in clinically interpretable terms. You still need separate approaches (attention maps, saliency methods) to generate explanations—and these may not align with radiologist reasoning.
Localization: Foundation models learn image-level features, not necessarily spatial localization. If your clinical task requires precise lesion boundaries or anatomic localization, you may need additional fine-tuning strategies or task-specific architectures—transfer learning doesn't automatically provide pixel-level precision.
Clinical context: Models trained on images alone don't understand patient history, lab values, or clinical presentation. Multi-modal foundation models (combining imaging + EHR) are emerging but remain rare.

Foundation models are powerful starting points, but they're not plug-and-play solutions for all clinical AI needs. Understanding their limitations is as important as understanding their strengths.

Why Demographics Matter: The Hidden Variable in Foundation Models

Here's the part most foundation model papers bury in supplementary materials—or omit entirely: Who is represented in the training data?

When you fine-tune a foundation model on your local dataset, you're not starting from a blank slate. You're starting from a model that has already learned patterns from its pre-training data. If that data is biased, your fine-tuned model inherits that bias—even if your fine-tuning data is perfectly balanced.

Bias Propagation Through Transfer Learning

Foundation models encode the demographic distributions they were trained on. If the pre-training dataset is:

90% white patients → the model learns features optimized for white patients
Predominantly from academic medical centers → it learns patterns specific to tertiary care populations
Imbalanced by sex, age, or socioeconomic status → it under-represents minority groups

When you fine-tune this model on a small dataset (which is the whole point of transfer learning), the foundation model's biases are likely to dominate. Your 200 fine-tuning examples can't override the millions of examples the model saw during pre-training.^6,7

Result: The model performs worse on populations underrepresented in the foundation training set—even if you tried to include them in your fine-tuning data.

Example: Skin Lesion Classification

Dermatology AI models, including foundation models for skin lesion classification, have been shown to perform significantly worse on darker skin tones. Why? Because the pre-training datasets (often scraped from public dermatology atlases) overwhelmingly feature lighter skin tones.

Even if you fine-tune the model on a balanced dataset with diverse skin tones, the foundation model has already learned features optimized for lighter skin—edge detection, texture analysis, color distributions that work best for one demographic.⁸

Fine-tuning can improve performance on underrepresented groups, but it can't fully undo the foundation model's priors. The bias is baked in from the start.

The Pulse Oximetry Parallel

This is conceptually similar to the pulse oximetry problem (discussed in our AI bias article): pulse oximeters were calibrated on light-skinned individuals, leading to systematic errors in darker-skinned patients. You can recalibrate a pulse oximeter, but the original calibration creates a baseline bias that's difficult to eliminate.

Foundation models work the same way. Their "calibration" is the pre-training dataset. If that dataset is demographically skewed, every downstream application inherits that skew—unless you invest significant effort to correct it (which requires labeled data from underrepresented groups, defeating the purpose of data efficiency).

What to Ask Before Adopting a Foundation Model

If you're considering using a foundation model—either for research or to build clinical AI tools—these questions are critical:

1. What Is the Scope of the Model?

Don't be fooled by the "foundation" label. Ask:

Which imaging modality? (CT, MRI, X-ray, ultrasound, PET?)
Which anatomical region? (Brain, chest, abdomen, extremities?)
Which tasks was it pre-trained on? (Segmentation, classification, detection?)

If the vendor can't clearly define the model's scope, that's a red flag. A true foundation model should explicitly state its domain of applicability—and its limitations.

2. What Are the Demographics of the Training Data?

This is the question most vendors hate. But it's essential. Ask:

What are the racial and ethnic demographics? If the model was trained on 90% white patients, it may underperform on minority populations.
What is the age distribution? Models trained predominantly on adults may fail on pediatric or geriatric patients.
What is the sex distribution? Imbalances can lead to disparate performance by sex.
Which institutions contributed data? Academic medical centers have different patient populations than community hospitals or international sites.

If the vendor responds with "the model is unbiased" or "we don't track demographics," walk away. Every dataset has a demographic distribution. Refusing to disclose it is a red flag.⁷

3. Has Performance Been Validated on Diverse Populations?

Pre-training on diverse data is good. Validating performance on diverse populations is better. Ask:

Has the model been tested on external datasets with different demographics?
Are performance metrics (sensitivity, specificity, AUC) reported stratified by demographic subgroups?
What is the performance gap between best- and worst-performing subgroups?

A 95% AUC "overall" means nothing if it's 97% for white patients and 88% for Black patients. Demand subgroup metrics.⁹

4. How Much Fine-Tuning Data Do I Need?

Foundation models promise data efficiency—but how much is "enough"? Ask:

What is the minimum recommended fine-tuning dataset size for my use case?
How does performance scale with fine-tuning data size? (Provide a learning curve.)
If I fine-tune on a small dataset, what are the risks of overfitting or bias amplification?

Vendors should provide guidance on data requirements, not just promise that "it works with less data." How much less? And what are the tradeoffs?

The Future: Toward True Medical Foundation Models

Despite current limitations, the trajectory is promising. Researchers are working toward foundation models that are genuinely broad—multi-modal, multi-organ, multi-task.

Multi-Modal Foundation Models

The next generation of medical AI will integrate across modalities:

Imaging + EHR: Combining radiology, pathology, and clinical data for holistic patient understanding
Multi-scale imaging: Linking whole-organ imaging (CT/MRI) with cellular-level imaging (histopathology)
Temporal integration: Learning from longitudinal data—how patients change over time

These models will require massive, curated datasets—but they promise to be true "foundations" for diverse downstream tasks.¹⁰

Self-Supervised Learning at Scale

Rather than relying on expensive expert labels, future foundation models will learn from unlabeled data using self-supervised techniques:

Contrastive learning: Learning to distinguish similar vs. dissimilar images without labels
Masked prediction: Predicting missing parts of an image or medical record
Temporal prediction: Predicting future scans based on past ones

This approach allows training on millions of unlabeled scans—expanding dataset diversity and reducing demographic bias (if the unlabeled data is itself diverse).¹¹

Federated Learning for Demographic Diversity

One reason medical foundation models are demographically skewed is that data is concentrated in a few large institutions—often with homogeneous patient populations.

Federated learning enables training models across multiple institutions without sharing patient data. Institutions train local models on their own data, then share only model updates (not data itself). This allows foundation models to learn from diverse populations while preserving privacy.¹²

If widely adopted, federated learning could produce foundation models trained on truly representative datasets—incorporating community hospitals, international sites, and underrepresented populations that academic-only datasets miss.

The Bottom Line on Medical Foundation Models

Medical "foundation models" aren't as foundational as their name suggests. They're specialized, narrow, domain-specific tools—pre-trained on one modality, one body part, one type of task.

But within their domain, they're powerful. They enable clinical AI development with far less data than traditional approaches, making it feasible to build tools for rare diseases, small institutions, and niche applications.

The catch? Their value depends on whose data they learned from. If the pre-training dataset is demographically skewed, every downstream application inherits that bias—even if you fine-tune on balanced data.

So before you adopt a medical foundation model, ask the hard questions: What's the scope? Who's in the training data? How does it perform across demographics? And how much fine-tuning data do I actually need?

Because a foundation built on shaky ground won't support the clinical tools we need. But one built with transparency, diversity, and validation? That's a foundation worth building on.

Key Takeaways

"Foundation" Is a Misnomer: Most medical foundation models are narrow—one modality, one body part. They use self-supervised pre-training (not task-specific labels), but remain specialized to their domain.
Data Efficiency Is the Real Benefit: Foundation models reduce labeled data requirements by 5-10x through transfer learning, making clinical AI feasible for rare diseases and small datasets.
Explainability Isn't Automatic: Self-supervised learning creates abstract representations, not clinically interpretable features. Localization and explanation require additional work beyond transfer learning.
Demographics Matter Critically: Models trained on skewed populations propagate bias to downstream tasks—even with balanced fine-tuning data. Always ask: who's in the pre-training dataset?
Validation Must Be Stratified: Demand performance metrics broken down by demographic subgroups. "95% AUC overall" hides disparities between populations.
Ask Hard Questions Before Adopting: What's the scope? What are training demographics? How does it perform on my population? Does it provide the localization/explainability I need?

References & Further Reading

Moor M, Banerjee O, Abad ZSH, et al. Foundation Models for Generalist Medical Artificial Intelligence. Nature. 2023;616(7956):259-265. doi:10.1038/s41586-023-05881-4
Krishnan R, Rajpurkar P, Topol EJ. Self-supervised Learning in Medicine and Healthcare. Nat Biomed Eng. 2022;6(12):1346-1352. doi:10.1038/s41551-022-00914-1
Willemink MJ, Koszek WA, Hardell C, et al. Preparing Medical Imaging Data for Machine Learning. Radiology. 2020;295(1):4-15. doi:10.1148/radiol.2020192224
Tajbakhsh N, Shin JY, Gurudu SR, et al. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans Med Imaging. 2016;35(5):1299-1312. doi:10.1109/TMI.2016.2535302
Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: Understanding Transfer Learning for Medical Imaging. Advances in Neural Information Processing Systems. 2019;32.
Rouzrokh P, Wyles CC, Philbrick KA, Ramazanian T, Weston AD, Cai JC, Taunton MJ, Kremers WK, Lewallen DG, Erickson BJ. A Deep Learning Tool for Automated Radiographic Measurement of Acetabular Component Inclination and Version After Total Hip Arthroplasty. Part 1: Mitigating Bias in Machine Learning—Data Handling. J Arthroplasty. 2022;37(6S):S406-S413. doi:10.1016/j.arth.2022.02.092
Zhang Y, Wyles CC, Makhni MC, Maradit Kremers H, Sellon JL, Erickson BJ. Part 2: Mitigating Bias in Machine Learning—Model Development. J Arthroplasty. 2022;37(6S):S414-S420. doi:10.1016/j.arth.2022.02.085
Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set. Sci Adv. 2022;8(32):eabq6147. doi:10.1126/sciadv.abq6147
Faghani S, Khosravi B, Moassefi M, Rouzrokh P, Erickson BJ. Part 3: Mitigating Bias in Machine Learning—Performance Metrics, Healthcare Applications, and Fairness in Machine Learning. J Arthroplasty. 2022;37(6S):S421-S428. doi:10.1016/j.arth.2022.02.087
Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal Biomedical AI. Nat Med. 2022;28(9):1773-1784. doi:10.1038/s41591-022-01981-2
Azizi S, Mustafa B, Ryan F, et al. Big Self-Supervised Models Advance Medical Image Classification. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:3478-3488.
Rieke N, Hancox J, Li W, et al. The Future of Digital Health with Federated Learning. NPJ Digit Med. 2020;3:119. doi:10.1038/s41746-020-00323-1