Commercial vendors marketing radiology AI systems increasingly position their products as "foundation models"—terminology that evokes images of cornerstone technologies, broadly applicable architectural bases upon which diverse applications can be constructed, and universal tools capable of addressing wide-ranging clinical needs. However, examination of technical specifications and training details often reveals substantial limitations that contradict the breadth implied by the "foundation" designation. A representative example involves a vendor-promoted foundation model that, upon detailed review, proves to be restricted to chest computed tomography examinations only, specifically optimized for lung nodule detection tasks, and trained on a dataset of 50,000 scans acquired from three academic medical centers. Such systems, despite their marketing as foundational technologies, more accurately represent highly specialized tools with application domains far narrower than their nomenclature suggests.

Within the broader artificial intelligence research community, "foundation model" has emerged as a defining concept of the current decade, referring to large-scale models exemplified by systems such as GPT-4 and BERT that are trained on massive, heterogeneous datasets encompassing billions of text tokens or millions of diverse images, and that demonstrate remarkable generalization capabilities across wide ranges of tasks including language translation, question answering, image classification, and content generation. However, in medical imaging applications, the term has been appropriated to describe something substantially more constrained: models that undergo pre-training on curated subsets of medical images using self-supervised learning approaches, and are subsequently fine-tuned for specific clinical applications within closely related domains. These medical "foundation models" offer genuine value through their ability to accelerate clinical AI development by dramatically reducing the quantity of labeled training data required compared to training models from randomly initialized parameters. However, nearly all such models exhibit remarkable specificity regarding imaging modality, anatomical region, and patient population characteristics, and their clinical utility and generalization performance depend critically on the demographic composition and clinical characteristics of the populations represented in their training data.

What "Foundation Model" Actually Means in Healthcare

In computer vision and natural language processing domains, foundation models are characterized by training on enormous, heterogeneous datasets that span diverse content types and domains—millions of images scraped from internet sources representing virtually every visual category, billions of text tokens extracted from books, websites, scientific articles, and social media encompassing multiple languages and knowledge domains. This remarkable breadth of training data enables broad generalization capabilities: a computer vision model trained on ImageNet's diverse image collection can recognize thousands of object categories including animals, vehicles, household objects, and natural scenes; a language model trained on web-scraped text can perform diverse tasks including question answering, code generation, language translation, and content summarization without task-specific training. These systems justify the "foundation" designation through their ability to transfer learned representations across fundamentally different application domains.

Medical "foundation models," in marked contrast, are nearly always trained on substantially more restricted datasets characterized by specificity rather than breadth. These models typically exhibit three major constraints that limit their generalization domains. First, they are restricted to single imaging modalities—computed tomography, magnetic resonance imaging, radiography, or ultrasound, but rarely trained on combinations of modalities despite the potential value of cross-modal learning. Second, they focus on single anatomical regions such as brain, chest, or abdomen rather than incorporating multi-organ or whole-body imaging data. Third, they undergo pre-training without explicit diagnostic task supervision, instead utilizing self-supervised learning methods such as masked image region prediction, image reconstruction from corrupted inputs, or contrastive learning to distinguish similar from dissimilar examinations. Consider a representative example: a foundation model for brain magnetic resonance imaging might undergo pre-training using self-supervised techniques that predict masked image portions or learn latent representations that cluster similar brain anatomy, all without exposure to any diagnostic outcome labels during pre-training. Such a model can subsequently be fine-tuned using relatively small labeled datasets to detect multiple sclerosis lesions, classify brain tumors, or identify acute ischemic stroke—all tasks within the same imaging modality and anatomical region. However, this same model cannot be applied to chest radiographs, abdominal computed tomography, or even brain ultrasound imaging, nor can it be adapted to electrocardiogram interpretation or digital pathology slide analysis. The "foundation" designation, therefore, refers not to a universal foundation spanning all of medicine or even all of medical imaging, but rather to a specialized starting point providing learned feature representations that can be efficiently adapted to related tasks within a narrowly defined domain with substantially less labeled training data than would be required to train models from randomly initialized parameters.1,2

Why Most Medical Foundation Models Are So Narrow

The characteristic narrowness of medical foundation models does not reflect arbitrary design choices or developer oversight but rather emerges from fundamental technical and practical challenges inherent in medical AI development. These constraints arise from the intrinsic properties of medical imaging data and the computational requirements for training broadly generalizable models across heterogeneous medical data types.

Medical Data Is Fragmented by Modality

Unlike natural photographic images, which share common structural properties including three-channel RGB color representations, consistent statistical distributions of edges and textures, and uniform pixel value interpretations across images, medical imaging modalities exhibit fundamentally different physical principles, data representations, and semantic interpretations that severely limit cross-modality transfer learning. Computed tomography generates grayscale images with wide dynamic range representing quantitative measurements of X-ray attenuation in Hounsfield units, where specific numerical values carry precise physical meaning related to tissue density. Magnetic resonance imaging produces images, typically grayscale with more limited intensity ranges (approximately 7-bit depth), where signal intensity depends on multiple acquisition parameters and tissue relaxation properties, with the same anatomical structure appearing dramatically different across different pulse sequences (T1-weighted, T2-weighted, FLAIR, diffusion-weighted). Ultrasound imaging typically consists of video sequences from which individual two-dimensional frames are selected, representing tissue echogenicity and acoustic impedance properties, with image quality and content highly dependent on operator technique given the handheld probe acquisition method. Positron emission tomography captures metabolic activity patterns at relatively low spatial resolution, requiring co-registration with complementary anatomic imaging for interpretation. Conventional radiography produces projection images representing integrated X-ray attenuation along beam paths through three-dimensional anatomy, used extensively for chest imaging and skeletal evaluation. A deep learning model trained to extract features from CT scans develops internal representations based on Hounsfield unit patterns and attenuation characteristics that are fundamentally meaningless when applied to MRI or ultrasound data. This cross-modality transfer challenge differs qualitatively from transfer learning between related natural image domains (such as cats and dogs), resembling instead the impossible task of transferring knowledge learned from photographic images to audio spectrograms—fundamentally different data types requiring distinct feature extraction strategies.

Anatomical Specialization Is Essential

Even when restricting attention to a single imaging modality, different anatomical regions present entirely distinct normal anatomical structures, pathological processes, and clinical contexts that severely limit model generalization across body regions. A foundation model trained on chest CT examinations develops learned representations for identifying ribs, pulmonary parenchyma, mediastinal structures, and vascular anatomy—none of which appear in brain imaging where the relevant structures include gray matter, white matter, ventricles, and specific brain regions. Similarly, an abdominal MRI foundation model learns to recognize liver parenchyma, kidneys, bowel loops, and mesenteric structures that are completely irrelevant to spine imaging focused on vertebral bodies, intervertebral discs, neural foramina, and spinal cord anatomy. Furthermore, because MRI signal characteristics depend on pulse sequence parameters, the same organ may exhibit dramatically different appearance patterns across different acquisition protocols, requiring the model to learn sequence-specific representations in addition to anatomical structure. Training a single foundation model to "understand" all anatomical regions across all imaging modalities would require aggregating massive, diverse datasets encompassing all combinations of modality, body region, and acquisition parameters, along with computational resources orders of magnitude larger than those required for modality- and anatomy-specific models.3

Self-Supervised Pre-Training, Task-Specific Fine-Tuning

Unlike earlier medical AI that required labeled data from the start, foundation models use self-supervised learning for pre-training—learning from unlabeled images by predicting masked regions, reconstructing corrupted images, or distinguishing similar vs. dissimilar scans. This is because creating the 'labels' for medical images is time-consuming and requires expert knowledge and those labels need to be reqprented in a consistent fashion for the computer to learn.

However, to build a clinically useful tool, you still need labeled data for fine-tuning—diagnostic annotations, pathology outcomes, treatment responses. The foundation model provides learned image representations; the fine-tuning teaches it what those representations mean clinically. While self-supervised pre-training is task-agnostic, the modality and anatomical constraints remain.4

The Real Value: Transfer Learning with Less Data

Despite being narrow, medical foundation models offer genuine value. The key advantage isn't breadth—it's data efficiency.

Training from Scratch: The Old Way

In the early days of AI (before foundation models were avaiable), we would:

  1. Start with a randomly initialized neural network
  2. Collect and label thousands of MRI scans with metastases annotations
  3. Train the model from scratch, requiring substantial computational resources
  4. Hope you have enough data for the model to learn meaningful features

Problem: You need a lot of labeled data—often tens of thousands of examples—to achieve good performance. If you're working on a rare disease or a small institution's dataset, you're out of luck.

Transfer Learning: The New Way

The next step was to use a pre-trained model that had been trained on imagenet (a large collection of photographs). While these weren't medical iamges, they did share some properties with medical images like having edges and basic textures. So we would:

  1. Start with a model that has already learned general representations of photographic structure
  2. Perform 'transfer learning' which is taking existing weights and adjust them for medical images. This is quite similar to how foundation models are used/

And now we arrive and foundation models, which are models built with the intent that they do not perform any task but are ready to be adjusted to perform some task. In that case, we would:

  1. Start with a model that has already learned general representations of brain MRI structure
  2. Fine-tune it on a much smaller dataset of metastases cases—maybe hundreds instead of thousands
  3. Achieve comparable or better performance with far less data and compute

Why it works: The foundation model has already learned low-level features (edges, textures, intensity patterns) and mid-level features (anatomical structures, tissue boundaries) through self-supervised learning. You're teaching it to apply those representations to a specific clinical task.

This is transfer learning—and it's the main reason medical foundation models exist. They reduce the data requirements for developing clinical AI tools, making it feasible to build models for rare diseases, underserved populations, or institution-specific use cases.2,5

Real-World Impact: Data Efficiency in Practice

Studies show that using pre-trained medical foundation models can reduce the labeled data requirement by 5-10x compared to training from scratch. For rare diseases with limited cases, this can be the difference between "impossible to build" and "clinically viable."

Example: A foundation model pre-trained on 100,000 chest X-rays can be fine-tuned to detect pediatric pneumonia with as few as 500 labeled pediatric cases—whereas training from scratch might require 5,000-10,000 cases to achieve the same accuracy.5

What Foundation Models Don't Automatically Provide

While foundation models improve data efficiency, they don't solve all clinical AI challenges:

  • Explainability: Self-supervised pre-training learns abstract representations, not clinical concepts. The model may detect disease accurately without explaining why in clinically interpretable terms. You still need separate approaches (attention maps, saliency methods) to generate explanations—and these may not align with radiologist reasoning.
  • Localization: Foundation models learn image-level features, not necessarily spatial localization. If your clinical task requires precise lesion boundaries or anatomic localization, you may need additional fine-tuning strategies or task-specific architectures—transfer learning doesn't automatically provide pixel-level precision.
  • Clinical context: Models trained on images alone don't understand patient history, lab values, or clinical presentation. Multi-modal foundation models (combining imaging + EHR) are emerging but remain rare.

Foundation models are powerful starting points, but they're not plug-and-play solutions for all clinical AI needs. Understanding their limitations is as important as understanding their strengths.

Why Demographics Matter: The Hidden Variable in Foundation Models

Here's the part most foundation model papers bury in supplementary materials—or omit entirely: Who is represented in the training data?

When you fine-tune a foundation model on your local dataset, you're not starting from a blank slate. You're starting from a model that has already learned patterns from its pre-training data. If that data is biased, your fine-tuned model inherits that bias—even if your fine-tuning data is perfectly balanced.

Bias Propagation Through Transfer Learning

Foundation models encode the demographic distributions they were trained on. If the pre-training dataset is:

  • 90% white patients → the model learns features optimized for white patients
  • Predominantly from academic medical centers → it learns patterns specific to tertiary care populations
  • Imbalanced by sex, age, or socioeconomic status → it under-represents minority groups

When you fine-tune this model on a small dataset (which is the whole point of transfer learning), the foundation model's biases are likely to dominate. Your 200 fine-tuning examples can't override the millions of examples the model saw during pre-training.6,7

Result: The model performs worse on populations underrepresented in the foundation training set—even if you tried to include them in your fine-tuning data.

Example: Skin Lesion Classification

Dermatology AI models, including foundation models for skin lesion classification, have been shown to perform significantly worse on darker skin tones. Why? Because the pre-training datasets (often scraped from public dermatology atlases) overwhelmingly feature lighter skin tones.

Even if you fine-tune the model on a balanced dataset with diverse skin tones, the foundation model has already learned features optimized for lighter skin—edge detection, texture analysis, color distributions that work best for one demographic.8

Fine-tuning can improve performance on underrepresented groups, but it can't fully undo the foundation model's priors. The bias is baked in from the start.

The Pulse Oximetry Parallel

This is conceptually similar to the pulse oximetry problem (discussed in our AI bias article): pulse oximeters were calibrated on light-skinned individuals, leading to systematic errors in darker-skinned patients. You can recalibrate a pulse oximeter, but the original calibration creates a baseline bias that's difficult to eliminate.

Foundation models work the same way. Their "calibration" is the pre-training dataset. If that dataset is demographically skewed, every downstream application inherits that skew—unless you invest significant effort to correct it (which requires labeled data from underrepresented groups, defeating the purpose of data efficiency).

What to Ask Before Adopting a Foundation Model

If you're considering using a foundation model—either for research or to build clinical AI tools—these questions are critical:

1. What Is the Scope of the Model?

Don't be fooled by the "foundation" label. Ask:

  • Which imaging modality? (CT, MRI, X-ray, ultrasound, PET?)
  • Which anatomical region? (Brain, chest, abdomen, extremities?)
  • Which tasks was it pre-trained on? (Segmentation, classification, detection?)

If the vendor can't clearly define the model's scope, that's a red flag. A true foundation model should explicitly state its domain of applicability—and its limitations.

2. What Are the Demographics of the Training Data?

This is the question most vendors hate. But it's essential. Ask:

  • What are the racial and ethnic demographics? If the model was trained on 90% white patients, it may underperform on minority populations.
  • What is the age distribution? Models trained predominantly on adults may fail on pediatric or geriatric patients.
  • What is the sex distribution? Imbalances can lead to disparate performance by sex.
  • Which institutions contributed data? Academic medical centers have different patient populations than community hospitals or international sites.

If the vendor responds with "the model is unbiased" or "we don't track demographics," walk away. Every dataset has a demographic distribution. Refusing to disclose it is a red flag.7

3. Has Performance Been Validated on Diverse Populations?

Pre-training on diverse data is good. Validating performance on diverse populations is better. Ask:

  • Has the model been tested on external datasets with different demographics?
  • Are performance metrics (sensitivity, specificity, AUC) reported stratified by demographic subgroups?
  • What is the performance gap between best- and worst-performing subgroups?

A 95% AUC "overall" means nothing if it's 97% for white patients and 88% for Black patients. Demand subgroup metrics.9

4. How Much Fine-Tuning Data Do I Need?

Foundation models promise data efficiency—but how much is "enough"? Ask:

  • What is the minimum recommended fine-tuning dataset size for my use case?
  • How does performance scale with fine-tuning data size? (Provide a learning curve.)
  • If I fine-tune on a small dataset, what are the risks of overfitting or bias amplification?

Vendors should provide guidance on data requirements, not just promise that "it works with less data." How much less? And what are the tradeoffs?

The Future: Toward True Medical Foundation Models

Despite current limitations, the trajectory is promising. Researchers are working toward foundation models that are genuinely broad—multi-modal, multi-organ, multi-task.

Multi-Modal Foundation Models

The next generation of medical AI will integrate across modalities:

  • Imaging + EHR: Combining radiology, pathology, and clinical data for holistic patient understanding
  • Multi-scale imaging: Linking whole-organ imaging (CT/MRI) with cellular-level imaging (histopathology)
  • Temporal integration: Learning from longitudinal data—how patients change over time

These models will require massive, curated datasets—but they promise to be true "foundations" for diverse downstream tasks.10

Self-Supervised Learning at Scale

Rather than relying on expensive expert labels, future foundation models will learn from unlabeled data using self-supervised techniques:

  • Contrastive learning: Learning to distinguish similar vs. dissimilar images without labels
  • Masked prediction: Predicting missing parts of an image or medical record
  • Temporal prediction: Predicting future scans based on past ones

This approach allows training on millions of unlabeled scans—expanding dataset diversity and reducing demographic bias (if the unlabeled data is itself diverse).11

Federated Learning for Demographic Diversity

One reason medical foundation models are demographically skewed is that data is concentrated in a few large institutions—often with homogeneous patient populations.

Federated learning enables training models across multiple institutions without sharing patient data. Institutions train local models on their own data, then share only model updates (not data itself). This allows foundation models to learn from diverse populations while preserving privacy.12

If widely adopted, federated learning could produce foundation models trained on truly representative datasets—incorporating community hospitals, international sites, and underrepresented populations that academic-only datasets miss.


The Bottom Line on Medical Foundation Models

Medical "foundation models" aren't as foundational as their name suggests. They're specialized, narrow, domain-specific tools—pre-trained on one modality, one body part, one type of task.

But within their domain, they're powerful. They enable clinical AI development with far less data than traditional approaches, making it feasible to build tools for rare diseases, small institutions, and niche applications.

The catch? Their value depends on whose data they learned from. If the pre-training dataset is demographically skewed, every downstream application inherits that bias—even if you fine-tune on balanced data.

So before you adopt a medical foundation model, ask the hard questions: What's the scope? Who's in the training data? How does it perform across demographics? And how much fine-tuning data do I actually need?

Because a foundation built on shaky ground won't support the clinical tools we need. But one built with transparency, diversity, and validation? That's a foundation worth building on.


Key Takeaways

  • "Foundation" Is a Misnomer: Most medical foundation models are narrow—one modality, one body part. They use self-supervised pre-training (not task-specific labels), but remain specialized to their domain.
  • Data Efficiency Is the Real Benefit: Foundation models reduce labeled data requirements by 5-10x through transfer learning, making clinical AI feasible for rare diseases and small datasets.
  • Explainability Isn't Automatic: Self-supervised learning creates abstract representations, not clinically interpretable features. Localization and explanation require additional work beyond transfer learning.
  • Demographics Matter Critically: Models trained on skewed populations propagate bias to downstream tasks—even with balanced fine-tuning data. Always ask: who's in the pre-training dataset?
  • Validation Must Be Stratified: Demand performance metrics broken down by demographic subgroups. "95% AUC overall" hides disparities between populations.
  • Ask Hard Questions Before Adopting: What's the scope? What are training demographics? How does it perform on my population? Does it provide the localization/explainability I need?

References & Further Reading

  1. Moor M, Banerjee O, Abad ZSH, et al. Foundation Models for Generalist Medical Artificial Intelligence. Nature. 2023;616(7956):259-265. doi:10.1038/s41586-023-05881-4
  2. Krishnan R, Rajpurkar P, Topol EJ. Self-supervised Learning in Medicine and Healthcare. Nat Biomed Eng. 2022;6(12):1346-1352. doi:10.1038/s41551-022-00914-1
  3. Willemink MJ, Koszek WA, Hardell C, et al. Preparing Medical Imaging Data for Machine Learning. Radiology. 2020;295(1):4-15. doi:10.1148/radiol.2020192224
  4. Tajbakhsh N, Shin JY, Gurudu SR, et al. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans Med Imaging. 2016;35(5):1299-1312. doi:10.1109/TMI.2016.2535302
  5. Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: Understanding Transfer Learning for Medical Imaging. Advances in Neural Information Processing Systems. 2019;32.
  6. Rouzrokh P, Wyles CC, Philbrick KA, Ramazanian T, Weston AD, Cai JC, Taunton MJ, Kremers WK, Lewallen DG, Erickson BJ. A Deep Learning Tool for Automated Radiographic Measurement of Acetabular Component Inclination and Version After Total Hip Arthroplasty. Part 1: Mitigating Bias in Machine Learning—Data Handling. J Arthroplasty. 2022;37(6S):S406-S413. doi:10.1016/j.arth.2022.02.092
  7. Zhang Y, Wyles CC, Makhni MC, Maradit Kremers H, Sellon JL, Erickson BJ. Part 2: Mitigating Bias in Machine Learning—Model Development. J Arthroplasty. 2022;37(6S):S414-S420. doi:10.1016/j.arth.2022.02.085
  8. Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set. Sci Adv. 2022;8(32):eabq6147. doi:10.1126/sciadv.abq6147
  9. Faghani S, Khosravi B, Moassefi M, Rouzrokh P, Erickson BJ. Part 3: Mitigating Bias in Machine Learning—Performance Metrics, Healthcare Applications, and Fairness in Machine Learning. J Arthroplasty. 2022;37(6S):S421-S428. doi:10.1016/j.arth.2022.02.087
  10. Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal Biomedical AI. Nat Med. 2022;28(9):1773-1784. doi:10.1038/s41591-022-01981-2
  11. Azizi S, Mustafa B, Ryan F, et al. Big Self-Supervised Models Advance Medical Image Classification. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:3478-3488.
  12. Rieke N, Hancox J, Li W, et al. The Future of Digital Health with Federated Learning. NPJ Digit Med. 2020;3:119. doi:10.1038/s41746-020-00323-1