Healthcare institutions evaluating artificial intelligence systems for clinical deployment routinely encounter validation studies reporting impressive overall accuracy metrics—frequently exceeding 90% in aggregate performance measures. However, examination of supplementary materials or subgroup analyses often reveals substantial performance disparities across demographic categories. A representative example involves an AI system demonstrating 90% overall accuracy that, upon stratified analysis, achieves 92% accuracy for white patients but only 78% accuracy for Black patients—a 14-percentage-point gap that raises fundamental questions about clinical deployment ethics and health equity. This scenario is not hypothetical but rather reflects a pervasive pattern documented across health systems throughout the country, where algorithms demonstrate brilliant performance for majority populations while exhibiting dangerous failure modes for underrepresented demographic groups. These disparities typically arise not from intentional developer malfeasance but rather from the subtle and often invisible biases embedded within healthcare data—biases that remain undetected unless systematic investigation and subgroup analysis are performed during both development and validation phases.

Why Bias in Medical AI Is Different

Algorithmic bias in healthcare applications differs fundamentally from bias in other machine learning domains such as social media content recommendation or financial loan approval systems. While bias in these non-medical contexts raises legitimate concerns about fairness and equity, healthcare bias carries direct implications for patient safety and clinical outcomes that elevate it to a qualitatively different category of ethical concern. When an algorithm systematically under-serves or misclassifies members of specific population groups, the consequences extend beyond unfairness to encompass tangible clinical harms: patients experience delayed diagnoses or misdiagnoses, miss referrals to specialists for life-saving interventions, receive inappropriate treatment recommendations, and ultimately experience worse health outcomes than demographically different patients with equivalent clinical presentations. Furthermore, these failures create a pernicious feedback loop wherein patients who experience algorithmic discrimination lose trust in healthcare systems and disengage from care, reducing the data available to train future AI systems on these populations and thereby potentially exacerbating rather than ameliorating existing disparities over successive model iterations. Healthcare bias is particularly insidious relative to other domains due to the unique characteristics of medical data and clinical decision-making processes.

Healthcare Data Reflects Historical Inequities

Medical records and healthcare databases do not represent neutral, objective repositories of biological information but rather encode centuries of documented healthcare disparities arising from structural racism, socioeconomic inequities, geographic access barriers, and systemic discrimination. These historical inequities manifest across multiple dimensions of healthcare data. Access barriers ensure that underserved populations have systematically fewer clinical encounters documented in electronic health records, fewer diagnostic tests ordered and performed, and less comprehensive clinical documentation compared to well-resourced populations with ready healthcare access. Diagnostic bias affects how clinicians interpret and document symptoms, with identical patient complaints described and coded differently depending on demographic characteristics—for example, chest pain in white male patients may be documented as potential cardiac ischemia warranting urgent evaluation, while identical symptoms in women or Black patients may be characterized as anxiety or panic with recommendations for psychiatric rather than cardiac evaluation. Treatment disparities result in different populations receiving systematically different interventions even when presenting with identical conditions and disease severity, creating differential outcome patterns that become encoded in outcome labels used to train predictive models. When machine learning algorithms are trained on this historically biased data, they do not perceive or recognize bias as such. Rather, they identify statistical patterns and associations present in the training data and optimize their predictions to match these patterns—including patterns that reflect bias rather than biology. Machine learning algorithms, whether traditional statistical models or deep neural networks, learn correlations and relationships present in training data without any capacity to understand the social, historical, or structural contexts that generated those relationships.10

"Objective" Algorithms Trained on Subjective Systems

There exists a widespread misconception that algorithms achieve objectivity through their mathematical nature, operating free from the subjective biases that affect human decision-making. However, this assumption fails to account for a fundamental reality: every algorithm trained on healthcare data is necessarily trained on data generated by humans making inherently subjective clinical decisions embedded within biased healthcare delivery systems. These subjective decisions permeate every aspect of the clinical data used for AI training, including determinations of which patients receive referrals to specialist care (reflecting not only clinical need but also insurance coverage, geographic proximity to specialists, implicit bias in referral decision-making, and patient ability to advocate for specialized evaluation), which symptoms receive documentation in the medical record versus dismissal as non-significant (reflecting clinician attention patterns, documentation time availability, and differential credibility assigned to patient reports across demographic groups), which diagnostic tests are ordered for which patients (reflecting differential application of clinical guidelines, insurance authorization patterns, and risk-benefit calculations that may vary across patient populations), and how pain and other subjective symptoms are assessed, believed, and treated (a domain with particularly well-documented racial and gender disparities). Consider a representative example: an AI system trained to predict which patients require cardiology consultation will necessarily learn from historical cardiology referral patterns present in its training data. If historical practice patterns reflect systematic under-referral of women and racial minorities presenting with chest pain symptoms—a well-documented disparity in cardiovascular care—the AI system will incorporate and replicate this bias in its predictions because biased historical referral patterns constitute the "ground truth" outcome labels on which the model was trained. From an algorithmic performance perspective, the model is functioning correctly—accurately predicting historical referral patterns. However, the labels themselves encode bias, and optimizing prediction accuracy for biased labels perpetuates rather than corrects systemic inequities.

Where Bias Enters the Pipeline

Algorithmic bias in medical AI does not arise from a single point of failure but rather accumulates progressively across the entire development and deployment lifecycle, from initial data collection and curation through model training and validation to post-deployment performance monitoring. Understanding the multiple entry points through which bias infiltrates AI systems represents a necessary precondition for designing effective mitigation strategies. Recent research has systematically characterized bias sources across this pipeline, identifying critical intervention points during data handling, model development, and performance evaluation phases.7-9

1. Dataset Composition: Who Is Represented?

The overwhelming majority of medical AI systems are trained using data derived from academic medical centers, large urban healthcare systems, and well-resourced institutions that serve patient populations with characteristics systematically different from the broader population requiring healthcare services. These training datasets typically underrepresent rural populations with limited access to tertiary care facilities, non-English speaking patients who face language barriers to care and may have incomplete documentation in English-language electronic health records, and specific racial and ethnic minority groups whose representation at academic medical centers may not reflect their prevalence in the general population. This data composition bias—arising from non-representative sampling during the data collection phase—represents one of the most fundamental sources of algorithmic inequity because machine learning models generalize poorly to population subgroups absent or underrepresented in training data.7 The statistical distributions and patterns the model learns from majority populations in the training data may not transfer effectively to minority populations with different distributions of clinical features, disease presentations, or physiological parameters. For example, a diabetic retinopathy screening algorithm trained predominantly on fundus photographs from white patients exhibited substantially degraded performance when applied to Black and Hispanic patients, missing significant pathology due to differences in fundus pigmentation affecting image contrast and variations in image acquisition parameters across different patient populations.

2. Labeling Bias: What Gets Called "Disease"?

Ground truth outcome labels used to train supervised learning models frequently represent proxies for actual disease state rather than direct measures of underlying pathology. Diagnosis codes extracted from billing and administrative data reflect which conditions were recognized, documented, and coded for reimbursement purposes rather than the complete spectrum of pathology present in patients. Laboratory value thresholds used to define normal versus abnormal status derive from reference ranges that may have been established using non-representative populations, potentially yielding inappropriate thresholds for demographic groups underrepresented in normative studies. Procedure utilization as an outcome label—for example, defining the need for advanced imaging based on who actually received an MRI scan—conflates clinical need with healthcare access, physician decision-making patterns, insurance authorization, and patient ability to attend scheduled procedures. The most extensively documented example of labeling bias creating algorithmic discrimination comes from the widely-deployed algorithm studied by Obermeyer and colleagues, which predicted healthcare resource needs based on healthcare costs incurred by patients, operating under the assumption that sicker patients generate higher healthcare expenditures.1 However, this assumption fails to account for systematic disparities in care delivery: Black patients receive less healthcare and consequently incur lower costs compared to white patients with equivalent disease severity and clinical need. As a result, the algorithm systematically classified Black patients as healthier and less in need of care management interventions compared to equally ill white patients, not because the algorithmic model was technically flawed, but because the outcome labels themselves encoded racial disparities in care delivery. The model optimized prediction accuracy for a biased label, thereby perpetuating existing inequities. Careful label investigation and validation against actual clinical ground truth rather than administrative or utilization proxies represents an essential step in avoiding such bias propagation.7

3. Proxy Variables: Race as a Shortcut

Some clinical algorithms explicitly incorporate race or ethnicity as input features, justified through appeals to biological differences across racial categories that are often based on flawed or outdated assumptions. However, even algorithms that deliberately exclude race as an explicit variable frequently recreate racial bias through the use of proxy variables—features highly correlated with race that enable the algorithm to implicitly reconstruct racial categories and differentially predict outcomes based on these reconstructed groupings. Common proxy variables that correlate strongly with race in U.S. healthcare data include residential ZIP code (reflecting historical redlining practices and ongoing residential segregation), insurance status and payer type (Medicaid enrollment is heavily racialized due to eligibility criteria and enrollment patterns), primary language preference (serving as a proxy for immigration status and ethnicity), hospital of admission (geographic healthcare segregation results in differential racial composition across hospitals even within the same metropolitan area), and numerous other variables that carry racial information despite not being explicitly demographic identifiers. The phenomenon of redundant encoding means that even when race is deliberately excluded from the feature set, algorithms can leverage these correlated variables to achieve predictions that vary systematically by race, effectively recreating race-based discrimination through indirect pathways.

The eGFR Controversy

For multiple decades, widely-used equations for estimating glomerular filtration rate (eGFR) incorporated a race correction factor that adjusted creatinine-based GFR estimates upward for patients identified as Black, based on assumptions that Black individuals possess greater average muscle mass and consequently generate more creatinine at equivalent levels of kidney function. However, this race-based adjustment resulted in systematic overestimation of kidney function in Black patients, leading to delayed recognition of chronic kidney disease progression, delayed referrals to nephrologists for specialized management, delayed placement on kidney transplant waitlists, and delayed initiation of dialysis when clinically indicated. In 2021, following mounting evidence of harm and advocacy from nephrology professional societies and patient advocates, medical organizations recommended removal of race from eGFR calculation equations, acknowledging that the purported biological justification for racial adjustment was based on flawed assumptions conflating race (a social construct) with ancestry and genetics. However, by the time this policy change occurred, millions of Black patients had potentially experienced delayed or inadequate nephrology care due to systematically biased estimates of their kidney function. This case illustrates a critical lesson: even algorithmic adjustments presented as biologically or physiologically justified can cause substantial harm when they rely on flawed assumptions about racial differences, particularly when race is used as a crude proxy for complex genetic, environmental, and social factors.

Real Clinical Examples: Bias at the Bedside

Race-Based Algorithms in Nephrology

Beyond eGFR, race appears in algorithms for:

  • Kidney stone risk: Lower predicted risk for Black patients (despite similar disease prevalence)
  • Preeclampsia screening: Different thresholds by race for proteinuria
  • Urinary tract infection diagnosis: Race-adjusted prediction rules

The common thread: Using race as a biological variable when it's actually a social variable capturing disparities in care, environment, and discrimination.

Pulse Oximetry + AI: Compounding Error

Pulse oximeters are less accurate in patients with darker skin—overestimating oxygen saturation by 2-3% on average.

Now imagine training an AI on pulse oximetry data to predict:

  • Who needs supplemental oxygen
  • Who should be admitted to ICU
  • Who meets criteria for ECMO

The AI learns from systematically inaccurate readings in Black patients. Result: It underestimates disease severity and under-triages care.

This isn't theoretical—it happened during COVID-19, when pulse oximetry-based AI triage systems contributed to delayed care for minority patients with hypoxemia.

Under-Triage of Minority Patients

Triage algorithms predict who needs higher-acuity care. But if trained on historical patterns where:

  • Minority patients are less likely to be admitted to ICU (due to bias or structural barriers)
  • Pain is under-recognized and under-treated in Black patients
  • Women's cardiac symptoms are more likely to be dismissed

Then the AI will replicate those patterns—flagging fewer minority and female patients as "high-risk" even when clinically equivalent to white male patients.

Why Clinicians Are Often the Last to Know

Bias Is Invisible at Point of Care

When you use an AI tool, you see:

  • This patient's prediction
  • This patient's recommendation

You don't see:

  • How predictions vary across demographics
  • False negative rates by race
  • Whether the model underperforms for your population vs. the validation cohort

Bias is a population-level phenomenon. Individual physicians can't detect it from individual cases.

Lack of Transparency in Commercial Tools

Most hospital-purchased AI tools don't disclose:

  • Training data demographics
  • Subgroup performance metrics
  • Disparate impact assessments
  • Bias mitigation strategies employed

You're expected to trust that "the algorithm works"—but you have no way to verify fairness.

What Fairness Means Clinically

"Fairness" sounds simple. It's not. There are multiple, mathematically incompatible definitions—and the choice of performance metrics used to evaluate fairness fundamentally shapes what "success" looks like.9

Group Fairness (Demographic Parity)

Definition: The algorithm predicts the same rate of positive outcomes across groups.

Example: If 10% of white patients are flagged as "high-risk," then 10% of Black patients should also be flagged.

Problem: If disease prevalence differs across groups (due to social determinants, not biology), forcing equal rates means the algorithm will under-predict in the higher-prevalence group.

Individual Fairness (Equal Treatment)

Definition: Similar patients get similar predictions, regardless of group membership.

Example: Two patients with identical vitals, labs, and history get the same risk score—whether they're Black, white, male, female, etc.

Problem: "Similar" is subjective. If the definition of "similar" ignores structural inequities (e.g., access to care, neighborhood safety, environmental exposures), you're comparing apples to oranges.

Equalized Odds (Equal Accuracy)

Definition: The model has the same sensitivity and specificity across groups.

Example: Among patients who truly have disease, the model catches 85% regardless of race. Among patients who don't, it correctly rules out 90% regardless of race.

Problem: Achieving this mathematically often requires different decision thresholds for different groups—which raises its own ethical questions.

The Fairness Impossibility Theorem

You cannot simultaneously achieve perfect fairness. Optimizing for one group often worsens others.

This means there is no "unbiased" algorithm in a mathematical sense. There are only different tradeoffs.

The question isn't "Is this fair?" It's "Which type of fairness matters most for this clinical use case—and who bears the cost of our choice?"

Tradeoffs Between Accuracy and Equity

Sometimes, optimizing for fairness reduces overall accuracy.

Example: A sepsis predictor is 90% accurate overall but only 82% accurate for Hispanic patients (due to sparse training data). You can:

  • Option A: Use the model as-is (90% overall, disparate performance)
  • Option B: Retrain to equalize performance across groups (maybe 87% overall, 87% for all groups)
  • Option C: Don't deploy—collect better data first

There's no obviously "right" answer. It depends on your values:

  • Maximize benefit for the majority? (Option A)
  • Equalize outcomes across groups, even if average drops? (Option B)
  • Wait until you can do both? (Option C—but patients suffer in the interim)

Why "Removing Race" Isn't Enough

The instinctive response to race-based bias: Just don't use race as a variable.

Problem: Race is correlated with everything in healthcare data.

  • ZIP code
  • Hospital of admission
  • Insurance type
  • Comorbidity burden (driven by access, not biology)
  • Lab values (affected by different baseline health states)

Remove race from the model, and the algorithm will reconstruct it from these proxies. This is called redundant encoding—when group membership can be inferred from other variables. Model selection, architecture choices, and training techniques all influence how these patterns get encoded.8

Case Study: Predicting No-Shows

A hospital builds an AI to predict which patients will miss appointments (to overbook strategically).

They don't use race. But they use:

  • Missed appointments in the past (correlated with transportation barriers, work schedule inflexibility—both racialized)
  • Insurance type (Medicaid heavily racialized due to historical policy)
  • Neighborhood (segregation proxy)

Result: The model flags Black and Hispanic patients as "high no-show risk" at higher rates. Clinic staff assume they're unreliable. They get less flexible scheduling, more restrictive policies, worse rapport.

The algorithm never saw race. But it learned racism anyway.

Practical Guidance for Physicians

Questions to Ask Vendors

Before your hospital adopts an AI tool, demand answers:

  1. What was the demographic composition of the training data?
    If they can't tell you, that's a red flag.7
  2. What is the model's performance across racial/ethnic subgroups?
    Overall accuracy means nothing if it's 95% for white patients and 75% for Black patients. Demand stratified metrics—sensitivity, specificity, PPV, NPV—by demographic group.9
  3. Were disparate impact assessments performed?
    Did you test whether the model systematically over- or under-predicts for specific groups?
  4. Does the model use race as a variable?
    If yes, why? What's the biological justification (vs. social determinant)?2
  5. How do you handle missing data?
    Minority patients often have sparser EHR data. How does the model account for this?7
  6. Was external validation performed on diverse populations?
    Internal testing on the same institution's data isn't enough. Models must be tested on external datasets representing different demographics and practice settings.9
  7. What's your plan for ongoing bias monitoring?
    Performance should be tracked by subgroup post-deployment, with clear thresholds for intervention when disparities emerge.

Warning Signs of Biased Systems

Red flags that an algorithm may be biased:

  • Validation study doesn't report subgroup performance
  • Training data is from a single health system or region
  • Model uses "cost" or "utilization" as a proxy for illness severity
  • Vendor can't explain what features drive predictions
  • No plan for monitoring fairness post-deployment

If vendors dismiss fairness concerns as "not relevant to our model," walk away.

The Clinician's Ethical Role

You are the final common pathway for AI in medicine. Even if the algorithm is biased, you can mitigate harm by:

  • Questioning predictions that don't match clinical judgment — especially for patients from underrepresented groups
  • Documenting when you override the AI — this creates the data needed to detect bias
  • Advocating for better tools — demand fairness assessments before procurement
  • Reporting disparate outcomes — if minority patients seem to have worse outcomes with AI-assisted care, raise it

The algorithm can't be racist. But the system can be. And you're part of the system.


Moving Forward: What Responsible AI Looks Like

Bias in medical AI systems should not be conceptualized primarily as a technical problem amenable to purely technical solutions through algorithmic refinement or optimization. Rather, it represents fundamentally a matter of healthcare justice and equity that requires sustained vigilance, institutional transparency, and clear accountability structures. Addressing algorithmic bias demands recognition that the problem extends beyond algorithm design to encompass data provenance, label quality, deployment contexts, and ongoing monitoring—a systems-level challenge requiring multidisciplinary collaboration among clinicians, data scientists, ethicists, patients, and healthcare administrators.

Complete elimination of bias from medical AI systems represents an unattainable goal given that bias is encoded within the historical healthcare data from which algorithms learn. However, the impossibility of perfect fairness should not counsel resignation or inaction. Rather, responsible AI development and deployment requires that institutions and developers acknowledge the presence of bias, implement systematic approaches to measure and quantify disparities across demographic subgroups, and deploy evidence-based mitigation strategies to reduce inequitable performance to the greatest extent possible. This commitment to bias measurement and mitigation necessitates several concrete institutional practices: demanding transparency from commercial vendors regarding the demographic composition and characteristics of training datasets used to develop algorithms, insisting on pre-deployment testing that includes validation on demographically diverse populations representative of the intended deployment context rather than relying solely on aggregate performance metrics, implementing post-deployment monitoring systems that track algorithm performance stratified by demographic subgroups to detect emergent disparities, maintaining institutional willingness to retire or modify tools that demonstrate persistent disparate impact despite mitigation efforts, and fundamentally reorienting AI governance frameworks to center equity as a primary consideration rather than treating fairness assessment as an afterthought or optional enhancement.

Consider again the paradigmatic case of an algorithm demonstrating 92% accuracy for white patients but only 78% accuracy for Black patients—a 14-percentage-point performance gap that reflects unacceptable inequity in a clinical tool. The appropriate institutional response to such disparate performance is clear: the algorithm should not be adopted for clinical use until this performance gap is substantially narrowed or eliminated, regardless of its impressive aggregate accuracy metrics or vendor assurances. Premature deployment of such a tool privileges the majority population that benefits from high accuracy while systematically disadvantaging the minority population that experiences degraded performance and consequently inferior clinical care. The cost of that 14-percentage-point accuracy gap is not borne abstractly by the algorithm or the institution but concretely by individual patients who experience misdiagnosis, delayed treatment, or inappropriate care recommendations—patients whom healthcare systems are ethically and professionally obligated to serve with equity.


Key Takeaways

  • Bias reflects training data not developer intent: Machine learning algorithms learn statistical patterns from historical healthcare data that encodes centuries of systemic inequities including access barriers, diagnostic bias in symptom interpretation and documentation, and treatment disparities across demographic groups. These learned patterns perpetuate rather than correct existing inequities.
  • Outcome labels frequently represent biased proxies: Ground truth labels used to train supervised learning models often reflect who received care rather than who needed care, with healthcare utilization and costs serving as particularly problematic proxies that systematically underestimate disease severity in underserved populations experiencing care disparities.
  • Excluding race as an explicit variable is insufficient: Algorithms can reconstruct racial categories through redundant encoding using proxy variables highly correlated with race including residential ZIP code, insurance type, primary language, and hospital of admission. Simply removing race from the feature set does not eliminate racial bias in predictions.
  • Multiple fairness definitions are mathematically incompatible: Group fairness (demographic parity), individual fairness (equal treatment of similar cases), and equalized odds (equal accuracy across groups) represent distinct fairness frameworks that cannot be simultaneously optimized. Algorithm designers must make explicit choices about which fairness criterion to prioritize, and these choices have distributional consequences.
  • Critical vendor accountability questions: Healthcare institutions evaluating AI systems should demand detailed information regarding training data demographic composition, algorithm performance metrics stratified by demographic subgroups, disparate impact assessments, biological versus social justifications for including race or proxies, approaches to handling missing data that may be differential across groups, external validation on diverse populations, and plans for ongoing fairness monitoring post-deployment.
  • Clinicians serve as critical gatekeepers: Individual physicians can mitigate algorithmic bias through critical evaluation of predictions that conflict with clinical judgment particularly for patients from underrepresented groups, systematic documentation of algorithm overrides to create data enabling bias detection, institutional advocacy for fairness assessments before procurement decisions, and reporting of observed disparate outcomes in AI-assisted care.

References & Further Reading

  1. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342
  2. Vyas DA, Eisenstein LG, Jones DS. Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms. N Engl J Med. 2020;383(9):874-882. doi:10.1056/NEJMms2004740
  3. Sjoding MW, Dickson RP, Iwashyna TJ, Gay SE, Valley TS. Racial Bias in Pulse Oximetry Measurement. N Engl J Med. 2020;383(25):2477-2478. doi:10.1056/NEJMc2029240
  4. FDA. Artificial Intelligence and Machine Learning in Software as a Medical Device. Updated 2023. FDA.gov
  5. World Health Organization. Ethics and Governance of Artificial Intelligence for Health. WHO Guidance. 2021. WHO.int
  6. Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. JAMA Intern Med. 2018;178(11):1544-1547.
  7. Rouzrokh P, Wyles CC, Philbrick KA, Ramazanian T, Weston AD, Cai JC, Taunton MJ, Kremers WK, Lewallen DG, Erickson BJ. A Deep Learning Tool for Automated Radiographic Measurement of Acetabular Component Inclination and Version After Total Hip Arthroplasty. Part 1: Mitigating Bias in Machine Learning—Data Handling. J Arthroplasty. 2022;37(6S):S406-S413. doi:10.1016/j.arth.2022.02.092
  8. Zhang Y, Wyles CC, Makhni MC, Maradit Kremers H, Sellon JL, Erickson BJ. Part 2: Mitigating Bias in Machine Learning—Model Development. J Arthroplasty. 2022;37(6S):S414-S420. doi:10.1016/j.arth.2022.02.085
  9. Faghani S, Khosravi B, Moassefi M, Rouzrokh P, Erickson BJ. Part 3: Mitigating Bias in Machine Learning—Performance Metrics, Healthcare Applications, and Fairness in Machine Learning. J Arthroplasty. 2022;37(6S):S421-S428. doi:10.1016/j.arth.2022.02.087
  10. Erickson BJ, Kitamura F. Artificial Intelligence in Radiology: a Primer for Radiologists. Radiol Clin North Am. 2021;59(6):991-1003. doi:10.1016/j.rcl.2021.07.004