A validation study lands on your desk reporting 90% accuracy. Impressive—until you open the supplement and find the subgroup breakdown: 92% for white patients, 78% for Black patients. That 14-point gap is the whole story, and it raises a hard question about whether the tool should be deployed at all. This is not a hypothetical. It is a pattern that shows up across health systems, where an algorithm performs brilliantly for the majority and fails quietly for everyone else. The cause is rarely anyone's bad intent; it is bias baked into the healthcare data itself—bias that stays invisible unless someone runs the subgroup analysis to find it.
Why Bias in Medical AI Is Different
Bias in a movie recommender or a loan model is a real problem, but bias in a clinical model is a different category, because the cost is paid in patient safety. When an algorithm systematically under-serves a group, the harm isn't just unfairness—it's delayed diagnoses, missed referrals for life-saving care, and worse outcomes than a different patient with the same presentation would have had. Worse, it can feed on itself: patients who experience that discrimination lose trust and disengage, which thins out the data available to train the next model on those same populations, widening the gap with each iteration. What makes healthcare bias especially hard to root out is the nature of the data and the decisions it records.
Healthcare Data Reflects Historical Inequities
A medical record is not a neutral repository of biology. It encodes decades of documented disparities—structural racism, socioeconomic inequity, access barriers, discrimination—along several axes. Access: underserved patients have fewer documented encounters, fewer tests ordered, and thinner records than well-resourced patients. Diagnosis: the same complaint gets described and coded differently by demographic—chest pain in a white man may be worked up as possible cardiac ischemia, while the same symptom in a woman or a Black patient is more likely written off as anxiety. Treatment: different groups receive different interventions for identical conditions, and those differences get baked into the outcome labels a model trains on. The model never "sees" any of this as bias. It just finds the statistical patterns in the data and optimizes to match them—including the patterns that reflect bias rather than biology. An algorithm, classical or deep, learns correlations; it has no way to understand the history that produced them.10
"Objective" Algorithms Trained on Subjective Systems
A common assumption holds that an algorithm, being mathematical, is objective—free of the biases that color human judgment. That misses the point: every model trained on healthcare data learns from decisions made by humans inside a biased delivery system. Those subjective calls run through the whole dataset—who gets referred to a specialist (shaped by insurance, proximity, implicit bias, and a patient's ability to advocate), which symptoms get documented versus dismissed (shaped by attention, time, and how much a patient is believed), which tests get ordered, and how pain is assessed and treated (a domain with well-documented racial and gender disparities). Take a model that predicts which patients need a cardiology consult. It learns from historical referral patterns—and if those patterns under-referred women and minorities with chest pain, as the evidence says they did, the model absorbs and repeats that bias, because the biased referrals are its ground-truth labels. By its own metric the model is working perfectly: it predicts the historical pattern accurately. But the labels encode bias, and optimizing to match them perpetuates the inequity rather than correcting it.
Where Bias Enters the Pipeline
Bias doesn't enter at one point—it accumulates across the whole lifecycle, from data collection and curation through training and validation to post-deployment monitoring. You can't mitigate what you can't locate, so it helps to know the main entry points. Recent work has mapped them across data handling, model development, and evaluation.7-9
1. Dataset Composition: Who Is Represented?
Most medical AI is trained on data from academic medical centers and large, well-resourced urban systems—populations that differ systematically from the broader patient population. These datasets tend to underrepresent rural patients far from tertiary care, non-English speakers whose records may be incomplete, and minority groups whose presence at academic centers doesn't match their share of the general population. This composition bias, introduced at the data-collection stage, is one of the most basic sources of inequity, because models generalize poorly to groups they barely saw in training.7 Patterns learned from the majority may simply not transfer to minority patients with different clinical features or presentations. A diabetic-retinopathy screener trained mostly on fundus photographs from white patients, for instance, can underperform on Black and Hispanic patients—plausibly because differences in fundus pigmentation affect image contrast, and because acquisition parameters vary across settings.
2. Labeling Bias: What Gets Called "Disease"?
The labels we train on are often proxies for disease, not disease itself. Billing codes capture what was recognized, documented, and coded for reimbursement—not the full picture of a patient's pathology. Lab thresholds for "normal" rest on reference ranges that may have been set on non-representative populations. Using procedure utilization as a label—say, defining "needs advanced imaging" by who actually got an MRI—confuses clinical need with access, physician habit, insurance authorization, and whether the patient could make the appointment. The best-documented case is the algorithm Obermeyer and colleagues studied, which predicted health needs from healthcare costs on the assumption that sicker patients cost more.1 But Black patients receive less care and therefore incur lower costs at equivalent severity—so the model rated them healthier and less in need of management than equally sick white patients. The math wasn't flawed; the label was. Optimizing accuracy against a biased label simply reproduces the bias. The lesson is to validate labels against real clinical ground truth, not administrative or utilization proxies.7
3. Proxy Variables: Race as a Shortcut
Some algorithms use race or ethnicity directly, justified by claimed biological differences that are often flawed or outdated. But even models that deliberately drop race can recreate racial bias through proxies—features so correlated with race that the model effectively reconstructs the category anyway. In U.S. healthcare data, the usual suspects are ZIP code (a legacy of redlining and ongoing segregation), insurance type (Medicaid enrollment is heavily racialized), primary language (a proxy for immigration status and ethnicity), and hospital of admission (geographic segregation gives hospitals different racial mixes even within one city). This is redundant encoding: remove race from the feature set and the model rebuilds it from the correlates, producing predictions that still vary systematically by race.
The eGFR Controversy
For decades, the standard equations for estimating glomerular filtration rate (eGFR) included a race correction that adjusted creatinine-based estimates upward for Black patients, on the assumption that Black individuals have more muscle mass and thus higher baseline creatinine. The effect was to overestimate kidney function in Black patients—delaying recognition of progressing chronic kidney disease, nephrology referral, transplant waitlisting, and dialysis. In 2021, after mounting evidence of harm and pressure from nephrology societies and patient advocates, the major organizations recommended removing race from the equation, acknowledging that the biological rationale conflated race (a social construct) with ancestry and genetics. By then, the biased estimates had likely cost millions of Black patients timely kidney care. The lesson: even an adjustment dressed up as physiology can do real harm when it rests on a crude racial assumption standing in for complex genetic, environmental, and social factors.
Real Clinical Examples: Bias at the Bedside
Race-Based Algorithms in Nephrology
Beyond eGFR, race appears in algorithms for:
- Kidney stone risk: Lower predicted risk for Black patients (despite similar disease prevalence)
- Preeclampsia screening: Different thresholds by race for proteinuria
- Urinary tract infection diagnosis: Race-adjusted prediction rules
The common thread: Using race as a biological variable when it's actually a social variable capturing disparities in care, environment, and discrimination.
Pulse Oximetry + AI: Compounding Error
Pulse oximeters are less accurate in patients with darker skin—overestimating oxygen saturation by 2-3% on average.
Now imagine training an AI on pulse oximetry data to predict:
- Who needs supplemental oxygen
- Who should be admitted to ICU
- Who meets criteria for ECMO
The AI learns from systematically inaccurate readings in Black patients. Result: It underestimates disease severity and under-triages care.
This isn't hypothetical. During COVID-19, pulse oximeters were shown to overestimate oxygen saturation in patients with darker skin, and occult hypoxemia went undetected more often in Black patients—exactly the kind of systematic measurement error that any triage tool built on those readings would inherit and propagate.
Under-Triage of Minority Patients
Triage algorithms predict who needs higher-acuity care. But if trained on historical patterns where:
- Minority patients are less likely to be admitted to ICU (due to bias or structural barriers)
- Pain is under-recognized and under-treated in Black patients
- Women's cardiac symptoms are more likely to be dismissed
Then the AI will replicate those patterns—flagging fewer minority and female patients as "high-risk" even when clinically equivalent to white male patients.
Why Clinicians Are Often the Last to Know
Bias Is Invisible at Point of Care
When you use an AI tool, you see:
- This patient's prediction
- This patient's recommendation
You don't see:
- How predictions vary across demographics
- False negative rates by race
- Whether the model underperforms for your population vs. the validation cohort
Bias is a population-level phenomenon. Individual physicians can't detect it from individual cases.
Lack of Transparency in Commercial Tools
Most hospital-purchased AI tools don't disclose:
- Training data demographics
- Subgroup performance metrics
- Disparate impact assessments
- Bias mitigation strategies employed
You're expected to trust that "the algorithm works"—but you have no way to verify fairness.
What Fairness Means Clinically
"Fairness" sounds simple. It's not. There are multiple, mathematically incompatible definitions—and the choice of performance metrics used to evaluate fairness fundamentally shapes what "success" looks like.9
Group Fairness (Demographic Parity)
Definition: The algorithm predicts the same rate of positive outcomes across groups.
Example: If 10% of white patients are flagged as "high-risk," then 10% of Black patients should also be flagged.
Problem: If disease prevalence differs across groups (due to social determinants, not biology), forcing equal rates means the algorithm will under-predict in the higher-prevalence group.
Individual Fairness (Equal Treatment)
Definition: Similar patients get similar predictions, regardless of group membership.
Example: Two patients with identical vitals, labs, and history get the same risk score—whether they're Black, white, male, female, etc.
Problem: "Similar" is subjective. If the definition of "similar" ignores structural inequities (e.g., access to care, neighborhood safety, environmental exposures), you're comparing apples to oranges.
Equalized Odds (Equal Accuracy)
Definition: The model has the same sensitivity and specificity across groups.
Example: Among patients who truly have disease, the model catches 85% regardless of race. Among patients who don't, it correctly rules out 90% regardless of race.
Problem: Achieving this mathematically often requires different decision thresholds for different groups—which raises its own ethical questions.
The Fairness Impossibility Theorem
You cannot simultaneously achieve perfect fairness. Optimizing for one group often worsens others.
This means there is no "unbiased" algorithm in a mathematical sense. There are only different tradeoffs.
The question isn't "Is this fair?" It's "Which type of fairness matters most for this clinical use case—and who bears the cost of our choice?"
Tradeoffs Between Accuracy and Equity
Sometimes, optimizing for fairness reduces overall accuracy.
Example: A sepsis predictor is 90% accurate overall but only 82% accurate for Hispanic patients (due to sparse training data). You can:
- Option A: Use the model as-is (90% overall, disparate performance)
- Option B: Retrain to equalize performance across groups (maybe 87% overall, 87% for all groups)
- Option C: Don't deploy—collect better data first
There's no obviously "right" answer. It depends on your values:
- Maximize benefit for the majority? (Option A)
- Equalize outcomes across groups, even if average drops? (Option B)
- Wait until you can do both? (Option C—but patients suffer in the interim)
Why "Removing Race" Isn't Enough
The instinctive response to race-based bias: Just don't use race as a variable.
Problem: Race is correlated with everything in healthcare data.
- ZIP code
- Hospital of admission
- Insurance type
- Comorbidity burden (driven by access, not biology)
- Lab values (affected by different baseline health states)
Remove race from the model, and the algorithm will reconstruct it from these proxies. This is called redundant encoding—when group membership can be inferred from other variables. Model selection, architecture choices, and training techniques all influence how these patterns get encoded.8
Case Study: Predicting No-Shows
A hospital builds an AI to predict which patients will miss appointments (to overbook strategically).
They don't use race. But they use:
- Missed appointments in the past (correlated with transportation barriers, work schedule inflexibility—both racialized)
- Insurance type (Medicaid heavily racialized due to historical policy)
- Neighborhood (segregation proxy)
Result: The model flags Black and Hispanic patients as "high no-show risk" at higher rates. Clinic staff assume they're unreliable. They get less flexible scheduling, more restrictive policies, worse rapport.
The algorithm never saw race. But it learned racism anyway.
Practical Guidance for Physicians
Questions to Ask Vendors
Before your hospital adopts an AI tool, demand answers:
- What was the demographic composition of the training data?
If they can't tell you, that's a red flag.7 - What is the model's performance across racial/ethnic subgroups?
Overall accuracy means nothing if it's 95% for white patients and 75% for Black patients. Demand stratified metrics—sensitivity, specificity, PPV, NPV—by demographic group.9 - Were disparate impact assessments performed?
Did you test whether the model systematically over- or under-predicts for specific groups? - Does the model use race as a variable?
If yes, why? What's the biological justification (vs. social determinant)?2 - How do you handle missing data?
Minority patients often have sparser EHR data. How does the model account for this?7 - Was external validation performed on diverse populations?
Internal testing on the same institution's data isn't enough. Models must be tested on external datasets representing different demographics and practice settings.9 - What's your plan for ongoing bias monitoring?
Performance should be tracked by subgroup post-deployment, with clear thresholds for intervention when disparities emerge.
Warning Signs of Biased Systems
Red flags that an algorithm may be biased:
- Validation study doesn't report subgroup performance
- Training data is from a single health system or region
- Model uses "cost" or "utilization" as a proxy for illness severity
- Vendor can't explain what features drive predictions
- No plan for monitoring fairness post-deployment
If vendors dismiss fairness concerns as "not relevant to our model," walk away.
The Clinician's Ethical Role
You are the final common pathway for AI in medicine. Even if the algorithm is biased, you can mitigate harm by:
- Questioning predictions that don't match clinical judgment — especially for patients from underrepresented groups
- Documenting when you override the AI — this creates the data needed to detect bias
- Advocating for better tools — demand fairness assessments before procurement
- Reporting disparate outcomes — if minority patients seem to have worse outcomes with AI-assisted care, raise it
The algorithm can't be racist. But the system can be. And you're part of the system.
Moving Forward: What Responsible AI Looks Like
Bias in medical AI is not primarily a technical problem with a technical fix. It is a question of healthcare equity, and it demands vigilance, transparency, and accountability rather than a better loss function. The problem lives across the whole pipeline—data provenance, label quality, deployment context, ongoing monitoring—which means addressing it takes clinicians, data scientists, ethicists, patients, and administrators working together, not engineers alone.
Eliminating bias entirely is not realistic—it is encoded in the historical data the models learn from. But the impossibility of perfect fairness is not an excuse for inaction. Responsible practice means naming the bias, measuring disparities across subgroups, and applying evidence-based mitigation to shrink them as far as possible. In concrete terms: demand that vendors disclose the demographics of their training data; require pre-deployment testing on populations that match where the tool will actually be used, not just an aggregate accuracy figure; monitor stratified performance after go-live; be willing to retire a tool whose disparate impact persists despite mitigation; and treat equity as a primary design criterion rather than a box checked at the end.
Return to the algorithm that is 92% accurate for white patients and 78% for Black patients. The right institutional response is not complicated: do not adopt it until that gap closes, however good the headline number or the vendor's reassurances. Deploying it anyway buys high accuracy for the majority at the direct expense of the minority. And that 14-point gap is not an abstraction on a slide—it is paid by specific patients in missed diagnoses, delayed treatment, and worse care, the very patients a health system is obligated to serve equitably.
Key Takeaways
- Bias reflects training data not developer intent: Machine learning algorithms learn statistical patterns from historical healthcare data that encodes centuries of systemic inequities including access barriers, diagnostic bias in symptom interpretation and documentation, and treatment disparities across demographic groups. These learned patterns perpetuate rather than correct existing inequities.
- Outcome labels frequently represent biased proxies: Ground truth labels used to train supervised learning models often reflect who received care rather than who needed care, with healthcare utilization and costs serving as particularly problematic proxies that systematically underestimate disease severity in underserved populations experiencing care disparities.
- Excluding race as an explicit variable is insufficient: Algorithms can reconstruct racial categories through redundant encoding using proxy variables highly correlated with race including residential ZIP code, insurance type, primary language, and hospital of admission. Simply removing race from the feature set does not eliminate racial bias in predictions.
- Multiple fairness definitions are mathematically incompatible: Group fairness (demographic parity), individual fairness (equal treatment of similar cases), and equalized odds (equal accuracy across groups) represent distinct fairness frameworks that cannot be simultaneously optimized. Algorithm designers must make explicit choices about which fairness criterion to prioritize, and these choices have distributional consequences.
- Critical vendor accountability questions: Healthcare institutions evaluating AI systems should demand detailed information regarding training data demographic composition, algorithm performance metrics stratified by demographic subgroups, disparate impact assessments, biological versus social justifications for including race or proxies, approaches to handling missing data that may be differential across groups, external validation on diverse populations, and plans for ongoing fairness monitoring post-deployment.
- Clinicians serve as critical gatekeepers: Individual physicians can mitigate algorithmic bias through critical evaluation of predictions that conflict with clinical judgment particularly for patients from underrepresented groups, systematic documentation of algorithm overrides to create data enabling bias detection, institutional advocacy for fairness assessments before procurement decisions, and reporting of observed disparate outcomes in AI-assisted care.
References & Further Reading
- Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342
- Vyas DA, Eisenstein LG, Jones DS. Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms. N Engl J Med. 2020;383(9):874-882. doi:10.1056/NEJMms2004740
- Sjoding MW, Dickson RP, Iwashyna TJ, Gay SE, Valley TS. Racial Bias in Pulse Oximetry Measurement. N Engl J Med. 2020;383(25):2477-2478. doi:10.1056/NEJMc2029240
- FDA. Artificial Intelligence and Machine Learning in Software as a Medical Device. Updated 2023. FDA.gov
- World Health Organization. Ethics and Governance of Artificial Intelligence for Health. WHO Guidance. 2021. WHO.int
- Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. JAMA Intern Med. 2018;178(11):1544-1547.
- Rouzrokh P, Wyles CC, Philbrick KA, Ramazanian T, Weston AD, Cai JC, Taunton MJ, Kremers WK, Lewallen DG, Erickson BJ. A Deep Learning Tool for Automated Radiographic Measurement of Acetabular Component Inclination and Version After Total Hip Arthroplasty. Part 1: Mitigating Bias in Machine Learning—Data Handling. J Arthroplasty. 2022;37(6S):S406-S413. doi:10.1016/j.arth.2022.02.092
- Zhang Y, Wyles CC, Makhni MC, Maradit Kremers H, Sellon JL, Erickson BJ. Part 2: Mitigating Bias in Machine Learning—Model Development. J Arthroplasty. 2022;37(6S):S414-S420. doi:10.1016/j.arth.2022.02.085
- Faghani S, Khosravi B, Moassefi M, Rouzrokh P, Erickson BJ. Part 3: Mitigating Bias in Machine Learning—Performance Metrics, Healthcare Applications, and Fairness in Machine Learning. J Arthroplasty. 2022;37(6S):S421-S428. doi:10.1016/j.arth.2022.02.087
- Erickson BJ, Kitamura F. Artificial Intelligence in Radiology: a Primer for Radiologists. Radiol Clin North Am. 2021;59(6):991-1003. doi:10.1016/j.rcl.2021.07.004