← Back to Technical Library
AI Bias & Health Equity
Understanding and Mitigating Algorithmic Bias in Medical AI
⚠️ Critical Reality:
AI systems trained on biased data perpetuate and amplify healthcare disparities. A 2019 study
found a widely-used commercial algorithm systematically rationed care for Black patients by
using healthcare costs as a proxy for health needs. This document covers how to identify,
measure, and mitigate bias in medical AI systems.
1. Sources of Bias in Medical AI
📊 Landmark Case: Obermeyer et al. (2019), Science:
The Algorithm: Used by 200+ hospitals, affecting 200 million patients annually.
The Bias: Algorithm used healthcare costs to predict health needs. Black patients
had lower costs at same illness severity due to systemic barriers to care access.
The Impact: Black patients needed to be significantly sicker than White patients
to qualify for high-risk care management programs.
The Fix: Replaced cost proxy with direct health measures (comorbidities, lab
values). Black patient enrollment in care programs increased from 17.7% to 46.5%.
Common Bias Sources:
| Bias Type |
Source |
Example |
Mitigation |
| Historical Bias |
Past discrimination encoded in data |
Less cardiac care for women → model learns women need less care |
Audit outcomes, adjust for known disparities |
| Representation Bias |
Underrepresented populations in training data |
Skin cancer datasets 90%+ light-skinned patients |
Oversample, synthetic data, federated learning |
| Measurement Bias |
Proxy variables that correlate with protected attributes |
Using cost as proxy for health needs (Obermeyer case) |
Use direct measures, not proxies |
| Aggregation Bias |
Models that don't account for subgroup differences |
HbA1c thresholds differ by race/ethnicity |
Stratified models, interaction terms |
| Evaluation Bias |
Testing on non-representative populations |
FDA clearance on predominantly White cohorts |
Require diverse validation sets |
2. Fairness Metrics
You can't fix what you don't measure. These metrics quantify different aspects of algorithmic
fairness. No single metric captures all concerns—use multiple.
Key Fairness Metrics:
| Metric |
Definition |
Formula |
When to Use |
| Demographic Parity |
Equal positive rates across groups |
P(Ŷ=1|A=0) = P(Ŷ=1|A=1) |
Resource allocation, screening programs |
| Equalized Odds |
Equal TPR and FPR across groups |
P(Ŷ=1|Y,A=0) = P(Ŷ=1|Y,A=1) for Y∈{0,1} |
Diagnostic tests, risk prediction |
| Predictive Parity |
Equal PPV across groups |
P(Y=1|Ŷ=1,A=0) = P(Y=1|Ŷ=1,A=1) |
When false positives are costly |
| Calibration |
Predicted probabilities match actual rates |
P(Y=1|Ŷ=p,A=0) = P(Y=1|Ŷ=p,A=1) = p |
Risk scores, probability estimates |
| Individual Fairness |
Similar individuals get similar predictions |
d(i,j) small → |Ŷ_i - Ŷ_j| small |
Personalized treatment recommendations |
⚠️ Impossibility Theorem:
Kleinberg, Mullainathan & Raghavan (2016) proved that when base rates differ between groups,
you cannot simultaneously satisfy calibration, equalized odds, and
demographic parity. You must choose which fairness criterion matters most for your use case.
This is a value judgment, not a technical decision.
3. Bias Mitigation Strategies
🛠️ Pre-Processing (Before Training):
• Reweighting: Assign higher weights to underrepresented groups during
training.
• Resampling: Oversample minority groups or undersample majority groups.
• Synthetic Data: Generate synthetic samples for underrepresented populations
(GANs, SMOTE).
• Feature Removal: Remove proxy variables that correlate with protected
attributes (but may reduce accuracy).
🛠️ In-Processing (During Training):
• Adversarial Debiasing: Train adversary to predict protected attribute from
model outputs; main model tries to fool adversary.
• Fairness Constraints: Add fairness metrics as regularization terms in loss
function.
• Group-Specific Models: Train separate models for different demographic
groups (controversial, may violate anti-discrimination law).
🛠️ Post-Processing (After Training):
• Threshold Adjustment: Use different decision thresholds for different groups
to equalize error rates.
• Calibration: Adjust predicted probabilities to match observed outcomes by
group.
• Reject Option: For borderline cases, defer to human decision-maker aware of
potential bias.
4. Regulatory Requirements
📋 FDA Guidance (2021-2024):
AI/ML-Based SaMD Action Plan: FDA now expects bias assessment in 510(k)
submissions for AI/ML devices.
Required Elements:
• Demographic breakdown of training and validation datasets
• Performance metrics stratified by race, ethnicity, sex, age
• Discussion of known limitations and intended use populations
• Post-market monitoring plan for real-world performance across groups
2024 Update: FDA proposing mandatory diversity action plans for clinical
trials and AI validation studies.
📋 ONC Health IT Certification (2024):
HTI-1 Rule: Requires transparency for predictive decision support
interventions:
• Source of training data (including demographic composition)
• Performance characteristics across subpopulations
• Known limitations and risks
• Intended use and contraindications
This information must be available to clinicians at the point of care—not
buried in documentation.
⚖️ Legal Liability:
Healthcare organizations can be held liable for discriminatory outcomes from AI systems under
Section 1557 of the Affordable Care Act (prohibits discrimination in
healthcare) and ADA Title III (requires accessible medical services).
Ignorance of bias is not a legal defense.
5. Best Practices Checklist
Pre-Deployment Bias Audit:
| Checkpoint |
Question |
Acceptable State |
| Dataset Composition |
Does training data reflect target population demographics? |
Within 10% of local/regional demographics for race, sex, age |
| Performance Stratification |
Are accuracy, sensitivity, specificity reported by subgroup? |
All major subgroups analyzed, no >10% performance gaps |
| Proxy Variables |
Are any features proxies for protected attributes? |
Documented, justified, or removed |
| Fairness Metrics |
Which fairness criteria were optimized? |
Explicit choice documented, trade-offs acknowledged |
| External Validation |
Tested on population different from training? |
Yes, performance degradation <15% |
| Monitoring Plan |
How will bias be detected post-deployment? |
Automated alerts for performance drift by subgroup |
Key Takeaways:
- Bias in medical AI is documented, measurable, and harmful—ignoring it is not an option
- No single fairness metric is sufficient—use multiple and document trade-offs
- FDA and ONC now require bias assessment and transparency for AI/ML medical devices
- Legal liability exists under ACA Section 1557 and ADA Title III
- Mitigation is possible at every stage: pre-processing, in-processing, post-processing
- Continuous monitoring post-deployment is essential—bias can emerge or worsen over time