← Back to Technical Library

AI Bias & Health Equity

Understanding and Mitigating Algorithmic Bias in Medical AI

⚠️ Critical Reality: AI systems trained on biased data perpetuate and amplify healthcare disparities. A 2019 study found a widely-used commercial algorithm systematically rationed care for Black patients by using healthcare costs as a proxy for health needs. This document covers how to identify, measure, and mitigate bias in medical AI systems.

1. Sources of Bias in Medical AI

📊 Landmark Case: Obermeyer et al. (2019), Science:
The Algorithm: Used by 200+ hospitals, affecting 200 million patients annually.
The Bias: Algorithm used healthcare costs to predict health needs. Black patients had lower costs at same illness severity due to systemic barriers to care access.
The Impact: Black patients needed to be significantly sicker than White patients to qualify for high-risk care management programs.
The Fix: Replaced cost proxy with direct health measures (comorbidities, lab values). Black patient enrollment in care programs increased from 17.7% to 46.5%.

Common Bias Sources:

Bias Type Source Example Mitigation
Historical Bias Past discrimination encoded in data Less cardiac care for women → model learns women need less care Audit outcomes, adjust for known disparities
Representation Bias Underrepresented populations in training data Skin cancer datasets 90%+ light-skinned patients Oversample, synthetic data, federated learning
Measurement Bias Proxy variables that correlate with protected attributes Using cost as proxy for health needs (Obermeyer case) Use direct measures, not proxies
Aggregation Bias Models that don't account for subgroup differences HbA1c thresholds differ by race/ethnicity Stratified models, interaction terms
Evaluation Bias Testing on non-representative populations FDA clearance on predominantly White cohorts Require diverse validation sets

2. Fairness Metrics

You can't fix what you don't measure. These metrics quantify different aspects of algorithmic fairness. No single metric captures all concerns—use multiple.

Key Fairness Metrics:

Metric Definition Formula When to Use
Demographic Parity Equal positive rates across groups P(Ŷ=1|A=0) = P(Ŷ=1|A=1) Resource allocation, screening programs
Equalized Odds Equal TPR and FPR across groups P(Ŷ=1|Y,A=0) = P(Ŷ=1|Y,A=1) for Y∈{0,1} Diagnostic tests, risk prediction
Predictive Parity Equal PPV across groups P(Y=1|Ŷ=1,A=0) = P(Y=1|Ŷ=1,A=1) When false positives are costly
Calibration Predicted probabilities match actual rates P(Y=1|Ŷ=p,A=0) = P(Y=1|Ŷ=p,A=1) = p Risk scores, probability estimates
Individual Fairness Similar individuals get similar predictions d(i,j) small → |Ŷ_i - Ŷ_j| small Personalized treatment recommendations
⚠️ Impossibility Theorem:
Kleinberg, Mullainathan & Raghavan (2016) proved that when base rates differ between groups, you cannot simultaneously satisfy calibration, equalized odds, and demographic parity. You must choose which fairness criterion matters most for your use case. This is a value judgment, not a technical decision.

3. Bias Mitigation Strategies

🛠️ Pre-Processing (Before Training):
• Reweighting: Assign higher weights to underrepresented groups during training.
• Resampling: Oversample minority groups or undersample majority groups.
• Synthetic Data: Generate synthetic samples for underrepresented populations (GANs, SMOTE).
• Feature Removal: Remove proxy variables that correlate with protected attributes (but may reduce accuracy).
🛠️ In-Processing (During Training):
• Adversarial Debiasing: Train adversary to predict protected attribute from model outputs; main model tries to fool adversary.
• Fairness Constraints: Add fairness metrics as regularization terms in loss function.
• Group-Specific Models: Train separate models for different demographic groups (controversial, may violate anti-discrimination law).
🛠️ Post-Processing (After Training):
• Threshold Adjustment: Use different decision thresholds for different groups to equalize error rates.
• Calibration: Adjust predicted probabilities to match observed outcomes by group.
• Reject Option: For borderline cases, defer to human decision-maker aware of potential bias.

4. Regulatory Requirements

📋 FDA Guidance (2021-2024):
AI/ML-Based SaMD Action Plan: FDA now expects bias assessment in 510(k) submissions for AI/ML devices.

Required Elements:
• Demographic breakdown of training and validation datasets
• Performance metrics stratified by race, ethnicity, sex, age
• Discussion of known limitations and intended use populations
• Post-market monitoring plan for real-world performance across groups

2024 Update: FDA proposing mandatory diversity action plans for clinical trials and AI validation studies.
📋 ONC Health IT Certification (2024):
HTI-1 Rule: Requires transparency for predictive decision support interventions:

• Source of training data (including demographic composition)
• Performance characteristics across subpopulations
• Known limitations and risks
• Intended use and contraindications

This information must be available to clinicians at the point of care—not buried in documentation.
⚖️ Legal Liability:
Healthcare organizations can be held liable for discriminatory outcomes from AI systems under Section 1557 of the Affordable Care Act (prohibits discrimination in healthcare) and ADA Title III (requires accessible medical services). Ignorance of bias is not a legal defense.

5. Best Practices Checklist

Pre-Deployment Bias Audit:

Checkpoint Question Acceptable State
Dataset Composition Does training data reflect target population demographics? Within 10% of local/regional demographics for race, sex, age
Performance Stratification Are accuracy, sensitivity, specificity reported by subgroup? All major subgroups analyzed, no >10% performance gaps
Proxy Variables Are any features proxies for protected attributes? Documented, justified, or removed
Fairness Metrics Which fairness criteria were optimized? Explicit choice documented, trade-offs acknowledged
External Validation Tested on population different from training? Yes, performance degradation <15%
Monitoring Plan How will bias be detected post-deployment? Automated alerts for performance drift by subgroup

Key Takeaways:

  • Bias in medical AI is documented, measurable, and harmful—ignoring it is not an option
  • No single fairness metric is sufficient—use multiple and document trade-offs
  • FDA and ONC now require bias assessment and transparency for AI/ML medical devices
  • Legal liability exists under ACA Section 1557 and ADA Title III
  • Mitigation is possible at every stage: pre-processing, in-processing, post-processing
  • Continuous monitoring post-deployment is essential—bias can emerge or worsen over time