You're reviewing a new AI tool for your hospital. The validation study shows 90% accuracy overall. Impressive. But buried in the supplementary materials, you notice: 92% for white patients, 78% for Black patients.
Do you adopt it?
This isn't a hypothetical. It's happening in health systems across the country—algorithms that work brilliantly for some patients and fail dangerously for others. Not because developers intended harm, but because bias in healthcare data is invisible until you look for it.
"The algorithm isn't biased. The system that created the data is biased. The algorithm just learned what we taught it."
Why Bias in Medical AI Is Different
Bias in healthcare AI isn't like bias in social media ads or loan approvals. It's not just unfair—it's dangerous. Because when an algorithm systematically under-serves a population, people get sicker. They get misdiagnosed. They don't get referred for life-saving interventions.
And unlike other domains, healthcare bias is particularly insidious because:
Healthcare Data Reflects Historical Inequities
Medical records aren't neutral. They encode centuries of disparities:
- Access barriers: Underserved populations have fewer encounters, fewer tests, less documentation
- Diagnostic bias: Symptoms described differently across demographics (e.g., chest pain vs. "anxious," shortness of breath vs. "panic")
- Treatment disparities: Different populations receive different interventions for the same conditions
- Structural racism: Redlining, insurance discrimination, provider distribution—all captured in the data
When you train AI on this data, it doesn't see bias. It sees patterns. And it optimizes for those patterns—including the biased ones. Machine learning algorithms, whether traditional models or deep neural networks, learn relationships in training data without understanding the social context that created those relationships.10
"Objective" Algorithms Trained on Subjective Systems
We assume algorithms are neutral because they're mathematical. But every algorithm is trained on data generated by humans making subjective decisions:
- Which patients get referred for specialist care
- Which symptoms get documented vs. dismissed
- Which tests get ordered (and which don't)
- How pain is assessed and believed
Example: An AI trained to predict "who needs a cardiology consult" will learn from historical referral patterns. If cardiologists historically under-referred women and minorities for chest pain, the AI will replicate that bias—because that's the "ground truth" it was trained on.
The algorithm isn't wrong. The labels are wrong.
Where Bias Enters the Pipeline
Bias doesn't come from a single source. It accumulates across the entire AI development lifecycle—from data handling through model development to performance evaluation. Understanding where bias enters is the first step toward mitigation.7-9
1. Dataset Composition: Who Is Represented?
Most medical AI is trained on data from academic medical centers, which systematically underrepresent:
- Rural populations
- Uninsured patients
- Non-English speakers
- Specific racial and ethnic minorities
Result: Models perform worse on the populations they haven't seen during training. This data composition bias—arising from non-representative sampling during data collection—is one of the most fundamental sources of algorithmic inequity.7
Example: A diabetic retinopathy screening AI trained predominantly on images from white patients missed significant pathology in Black and Hispanic patients due to differences in fundus pigmentation and image acquisition parameters.
2. Labeling Bias: What Gets Called "Disease"?
Ground truth labels are often proxies for actual disease:
- Diagnosis codes: Reflect what was documented and billed, not necessarily what exists
- Lab thresholds: Reference ranges derived from non-representative populations
- Procedure use: "Who got an MRI" ≠ "who needed an MRI"
The Obermeyer Problem: A widely-used algorithm predicted healthcare needs based on cost—assuming sicker patients cost more. But Black patients receive less care (and thus cost less) for the same severity of illness. So the algorithm systematically classified Black patients as "less sick" than equally ill white patients.1
The model worked perfectly. The labels were biased. Careful label investigation and validation against clinical ground truth—rather than administrative proxies—is essential to avoid perpetuating these disparities.7
3. Proxy Variables: Race as a Shortcut
Some algorithms use race explicitly. Others use proxies:
- ZIP code (redlining proxy)
- Insurance status (socioeconomic proxy)
- Primary language (immigration status proxy)
- Hospital of admission (geographic segregation proxy)
Even when race isn't in the model, everything correlated with race can recreate bias.
The eGFR Controversy
For decades, kidney function equations included a "race correction factor"—giving Black patients a higher estimated GFR for the same creatinine level. The assumption: Black patients have more muscle mass.
Result: Black patients were systematically delayed in referral for transplant, dialysis, and nephrology care.
In 2021, medical societies finally removed race from eGFR calculations—but millions of patients were harmed in the interim.
Lesson: "Biologically justified" adjustments can still cause harm when they're based on flawed assumptions about race.
Real Clinical Examples: Bias at the Bedside
Race-Based Algorithms in Nephrology
Beyond eGFR, race appears in algorithms for:
- Kidney stone risk: Lower predicted risk for Black patients (despite similar disease prevalence)
- Preeclampsia screening: Different thresholds by race for proteinuria
- Urinary tract infection diagnosis: Race-adjusted prediction rules
The common thread: Using race as a biological variable when it's actually a social variable capturing disparities in care, environment, and discrimination.
Pulse Oximetry + AI: Compounding Error
Pulse oximeters are less accurate in patients with darker skin—overestimating oxygen saturation by 2-3% on average.
Now imagine training an AI on pulse oximetry data to predict:
- Who needs supplemental oxygen
- Who should be admitted to ICU
- Who meets criteria for ECMO
The AI learns from systematically inaccurate readings in Black patients. Result: It underestimates disease severity and under-triages care.
This isn't theoretical—it happened during COVID-19, when pulse oximetry-based AI triage systems contributed to delayed care for minority patients with hypoxemia.
Under-Triage of Minority Patients
Triage algorithms predict who needs higher-acuity care. But if trained on historical patterns where:
- Minority patients are less likely to be admitted to ICU (due to bias or structural barriers)
- Pain is under-recognized and under-treated in Black patients
- Women's cardiac symptoms are more likely to be dismissed
Then the AI will replicate those patterns—flagging fewer minority and female patients as "high-risk" even when clinically equivalent to white male patients.
Why Clinicians Are Often the Last to Know
Bias Is Invisible at Point of Care
When you use an AI tool, you see:
- This patient's prediction
- This patient's recommendation
You don't see:
- How predictions vary across demographics
- False negative rates by race
- Whether the model underperforms for your population vs. the validation cohort
Bias is a population-level phenomenon. Individual physicians can't detect it from individual cases.
Lack of Transparency in Commercial Tools
Most hospital-purchased AI tools don't disclose:
- Training data demographics
- Subgroup performance metrics
- Disparate impact assessments
- Bias mitigation strategies employed
You're expected to trust that "the algorithm works"—but you have no way to verify fairness.
What Fairness Means Clinically
"Fairness" sounds simple. It's not. There are multiple, mathematically incompatible definitions—and the choice of performance metrics used to evaluate fairness fundamentally shapes what "success" looks like.9
Group Fairness (Demographic Parity)
Definition: The algorithm predicts the same rate of positive outcomes across groups.
Example: If 10% of white patients are flagged as "high-risk," then 10% of Black patients should also be flagged.
Problem: If disease prevalence differs across groups (due to social determinants, not biology), forcing equal rates means the algorithm will under-predict in the higher-prevalence group.
Individual Fairness (Equal Treatment)
Definition: Similar patients get similar predictions, regardless of group membership.
Example: Two patients with identical vitals, labs, and history get the same risk score—whether they're Black, white, male, female, etc.
Problem: "Similar" is subjective. If the definition of "similar" ignores structural inequities (e.g., access to care, neighborhood safety, environmental exposures), you're comparing apples to oranges.
Equalized Odds (Equal Accuracy)
Definition: The model has the same sensitivity and specificity across groups.
Example: Among patients who truly have disease, the model catches 85% regardless of race. Among patients who don't, it correctly rules out 90% regardless of race.
Problem: Achieving this mathematically often requires different decision thresholds for different groups—which raises its own ethical questions.
The Fairness Impossibility Theorem
You cannot simultaneously achieve all types of fairness. Optimizing for one often worsens others.
This means there is no "unbiased" algorithm in a mathematical sense. There are only different tradeoffs.
The question isn't "Is this fair?" It's "Which type of fairness matters most for this clinical use case—and who bears the cost of our choice?"
Tradeoffs Between Accuracy and Equity
Sometimes, optimizing for fairness reduces overall accuracy.
Example: A sepsis predictor is 90% accurate overall but only 82% accurate for Hispanic patients (due to sparse training data). You can:
- Option A: Use the model as-is (90% overall, disparate performance)
- Option B: Retrain to equalize performance across groups (maybe 87% overall, 87% for all groups)
- Option C: Don't deploy—collect better data first
There's no obviously "right" answer. It depends on your values:
- Maximize benefit for the majority? (Option A)
- Equalize outcomes across groups, even if average drops? (Option B)
- Wait until you can do both? (Option C—but patients suffer in the interim)
Why "Removing Race" Isn't Enough
The instinctive response to race-based bias: Just don't use race as a variable.
Problem: Race is correlated with everything in healthcare data.
- ZIP code
- Hospital of admission
- Insurance type
- Comorbidity burden (driven by access, not biology)
- Lab values (affected by different baseline health states)
Remove race from the model, and the algorithm will reconstruct it from these proxies. This is called redundant encoding—when group membership can be inferred from other variables. Model selection, architecture choices, and training techniques all influence how these patterns get encoded.8
Case Study: Predicting No-Shows
A hospital builds an AI to predict which patients will miss appointments (to overbook strategically).
They don't use race. But they use:
- Missed appointments in the past (correlated with transportation barriers, work schedule inflexibility—both racialized)
- Insurance type (Medicaid heavily racialized due to historical policy)
- Neighborhood (segregation proxy)
Result: The model flags Black and Hispanic patients as "high no-show risk" at higher rates. Clinic staff assume they're unreliable. They get less flexible scheduling, more restrictive policies, worse rapport.
The algorithm never saw race. But it learned racism anyway.
Practical Guidance for Physicians
Questions to Ask Vendors
Before your hospital adopts an AI tool, demand answers:
- What was the demographic composition of the training data?
If they can't tell you, that's a red flag.7 - What is the model's performance across racial/ethnic subgroups?
Overall accuracy means nothing if it's 95% for white patients and 75% for Black patients. Demand stratified metrics—sensitivity, specificity, PPV, NPV—by demographic group.9 - Were disparate impact assessments performed?
Did you test whether the model systematically over- or under-predicts for specific groups? - Does the model use race as a variable?
If yes, why? What's the biological justification (vs. social determinant)?2 - How do you handle missing data?
Minority patients often have sparser EHR data. How does the model account for this?7 - Was external validation performed on diverse populations?
Internal testing on the same institution's data isn't enough. Models must be tested on external datasets representing different demographics and practice settings.9 - What's your plan for ongoing bias monitoring?
Performance should be tracked by subgroup post-deployment, with clear thresholds for intervention when disparities emerge.
Warning Signs of Biased Systems
Red flags that an algorithm may be biased:
- Validation study doesn't report subgroup performance
- Training data is from a single health system or region
- Model uses "cost" or "utilization" as a proxy for illness severity
- Vendor can't explain what features drive predictions
- No plan for monitoring fairness post-deployment
If vendors dismiss fairness concerns as "not relevant to our model," walk away.
The Clinician's Ethical Role
You are the final common pathway for AI in medicine. Even if the algorithm is biased, you can mitigate harm by:
- Questioning predictions that don't match clinical judgment — especially for patients from underrepresented groups
- Documenting when you override the AI — this creates the data needed to detect bias
- Advocating for better tools — demand fairness assessments before procurement
- Reporting disparate outcomes — if minority patients seem to have worse outcomes with AI-assisted care, raise it
The algorithm can't be racist. But the system can be. And you're part of the system.
Moving Forward: What Responsible AI Looks Like
Bias in medical AI isn't a technical problem with a technical solution. It's a justice problem that requires ongoing vigilance, transparency, and accountability.
We can't eliminate bias entirely—it's baked into the data we learn from. But we can acknowledge it, measure it, and mitigate it.
That means:
- Demanding transparency from vendors
- Testing algorithms on diverse populations before deployment
- Monitoring performance by subgroup after deployment
- Being willing to retire tools that exacerbate disparities
- Centering equity in AI governance, not treating it as an afterthought
The algorithm that's 92% accurate for white patients and 78% for Black patients?
Don't adopt it. Not until it works equally well for everyone.
Because the cost of that 14-point gap isn't borne by the algorithm. It's borne by the patients we're supposed to serve.
Key Takeaways
- Bias Reflects Data, Not Intent: AI learns from historical healthcare data that encodes systemic inequities—access barriers, diagnostic bias, treatment disparities.
- Labels Are Often Proxies: "Ground truth" based on who got care (not who needed it) perpetuates disparities. Cost ≠ illness severity.
- Removing Race Isn't Enough: Algorithms reconstruct race from proxies (ZIP code, insurance, hospital). Redundant encoding is pervasive.
- Multiple Definitions of Fairness: Group fairness, individual fairness, and equalized odds are mathematically incompatible. Tradeoffs are inevitable.
- Demand Transparency: Ask vendors: What's the training data demographics? Subgroup performance? Disparate impact assessment? Ongoing monitoring plan?
- Physicians Are Gatekeepers: You can mitigate bias by questioning predictions, documenting overrides, and advocating for equitable tools.
References & Further Reading
- Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342
- Vyas DA, Eisenstein LG, Jones DS. Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms. N Engl J Med. 2020;383(9):874-882. doi:10.1056/NEJMms2004740
- Sjoding MW, Dickson RP, Iwashyna TJ, Gay SE, Valley TS. Racial Bias in Pulse Oximetry Measurement. N Engl J Med. 2020;383(25):2477-2478. doi:10.1056/NEJMc2029240
- FDA. Artificial Intelligence and Machine Learning in Software as a Medical Device. Updated 2023. FDA.gov
- World Health Organization. Ethics and Governance of Artificial Intelligence for Health. WHO Guidance. 2021. WHO.int
- Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. JAMA Intern Med. 2018;178(11):1544-1547.
- Rouzrokh P, Wyles CC, Philbrick KA, Ramazanian T, Weston AD, Cai JC, Taunton MJ, Kremers WK, Lewallen DG, Erickson BJ. A Deep Learning Tool for Automated Radiographic Measurement of Acetabular Component Inclination and Version After Total Hip Arthroplasty. Part 1: Mitigating Bias in Machine Learning—Data Handling. J Arthroplasty. 2022;37(6S):S406-S413. doi:10.1016/j.arth.2022.02.092
- Zhang Y, Wyles CC, Makhni MC, Maradit Kremers H, Sellon JL, Erickson BJ. Part 2: Mitigating Bias in Machine Learning—Model Development. J Arthroplasty. 2022;37(6S):S414-S420. doi:10.1016/j.arth.2022.02.085
- Faghani S, Khosravi B, Moassefi M, Rouzrokh P, Erickson BJ. Part 3: Mitigating Bias in Machine Learning—Performance Metrics, Healthcare Applications, and Fairness in Machine Learning. J Arthroplasty. 2022;37(6S):S421-S428. doi:10.1016/j.arth.2022.02.087
- Erickson BJ, Kitamura F. Artificial Intelligence in Radiology: a Primer for Radiologists. Radiol Clin North Am. 2021;59(6):991-1003. doi:10.1016/j.rcl.2021.07.004