← Back to Technical Library

Medical Data Privacy & AI Training

The Uncomfortable Truth About Patient Data in AI Models

⚠️ The Question Nobody Answers: When you use a commercial AI system with patient data, is that data being used to train their models? Most vendors evade this question. The answer determines whether your patients' PHI is becoming part of a commercial product sold to your competitors.

1. Is Your Patient Data Training AI Models?

This is the billion-dollar question. Many AI vendors' business models depend on continuously improving their models using customer data. Here's what's actually happening:

How Patient Data Becomes Training Data

Your EHR
Patient records
AI Vendor
Processes PHI
Training Pipeline
"De-identified"
Improved Model
Sold to competitors

Vendor Business Models:

Model Type Uses Your Data for Training? Can You Opt Out? Examples
Traditional SaaS No (subscription-only revenue) N/A (not collected) Some EHR-embedded AI
Platform Model Yes (data improves product for all) Sometimes (check BAA) Many cloud AI services
Freemium Yes (you're the product) Rarely Free AI tools, trials
Research Partnerships Yes (explicit research agreements) Requires patient consent Academic medical centers
🚩 Evasive Answers to Watch For:
"We use data to improve our services" — This almost always means training. Ask: "Does 'improve' mean updating models with customer data?"

"All data is de-identified before use" — De-identification often fails (see below). Ask: "What specific de-identification method? HIPAA Safe Harbor or Expert Determination?"

"We don't sell your data" — Technically true if they're selling model improvements derived from your data, not the raw data itself.

"Data is only used with consent" — Whose consent? Yours? The patient's? Clarify who can consent and what they're consenting to.

2. Why De-Identification Often Fails

HIPAA allows two methods for de-identification: Safe Harbor (remove 18 identifiers) and Expert Determination (statistical proof of low re-identification risk). Both have serious limitations in the AI era.

🔍 Re-identification Attacks That Work: Linkage Attacks: Combine "de-identified" medical data with public datasets (voter records, social media) to re-identify patients. A 2019 study re-identified 60% of "anonymous" patients using just ZIP code, birth date, and gender.

Model Inversion: Query an AI model repeatedly to reconstruct training data. Researchers have extracted faces and medical conditions from supposedly private models.

Membership Inference: Determine if a specific person's data was in the training set. This alone can reveal sensitive information (e.g., "Was this patient in the HIV clinic dataset?").
📊 Landmark Study (Nature, 2019): Researchers demonstrated that 99.98% of Americans could be correctly re-identified from any available dataset using just 15 demographic attributes. "De-identified" medical data is far from anonymous when cross-referenced with other data sources.
✓ What to Demand: If a vendor claims data is "de-identified," ask: (1) Which HIPAA method (Safe Harbor or Expert Determination)? (2) Who performed the expert determination (if applicable)? (3) What re-identification risk assessment was done? (4) Do you apply differential privacy or other modern techniques?

3. Data Sales & Third-Party Sharing

Even if a vendor doesn't directly "sell" your data, they may share it with partners, subsidiaries, or "trusted third parties" in ways that effectively monetize your PHI.

Common Data Sharing Scenarios:

Scenario Is This "Selling"? HIPAA Status
Selling raw patient data to data brokers Yes HIPAA violation without authorization
Selling model improvements derived from your data Debatable Legal gray area (not explicitly prohibited)
Sharing with "affiliates" or subsidiaries Depends on contracts May be permitted under organized healthcare arrangement
Sharing with cloud providers (AWS, Azure, GCP) No (infrastructure) Permitted with BAA
Sharing with research partners Depends on agreements Requires IRB approval or patient authorization
Using data to train models sold to competitors Indirectly, yes Not prohibited by HIPAA (BAA should address)
🚩 Read the Fine Print:
Check the vendor's privacy policy and BAA for these phrases:

"We may share data with our corporate family" — Could mean any subsidiary, anywhere, with varying privacy standards.

"We may share with trusted partners" — Who? For what purpose? This is often undefined.

"We may use data for research purposes" — Whose research? Published? Proprietary? Patient consent obtained?

"Data may be transferred internationally" — HIPAA doesn't restrict this, but other laws might (GDPR, state laws).

4. Patient Opt-Out Rights

Do patients have the right to opt out of having their data used for AI training? The answer is complicated and depends on how the data is used.

Opt-Out Scenarios:

✓ Treatment, Payment, Healthcare Operations (TPO)
HIPAA permits using PHI for TPO without patient consent. If AI is used directly for patient care (e.g., diagnostic support), patients generally cannot opt out.
✓ Research
Using PHI for research typically requires IRB approval and patient authorization (opt-in consent). Some research can use waivers, but this is narrowly defined.
✗ Commercial Product Development
Using PHI to train commercial AI models sold to third parties is NOT TPO. This should require patient authorization, but enforcement is weak and many vendors operate in this gray area.
✗ De-identified Data
Once data is de-identified per HIPAA, it's no longer PHI and patients have no opt-out rights. This is why the de-identification method matters critically.
📋 Best Practice: Implement a transparent patient notification process: "We use AI tools to assist with care. Your data may be used to improve these tools. Here's what that means, here's what's protected, and here's how to ask questions." Even if not legally required, transparency builds trust.

5. Questions to Ask Every AI Vendor

Data Usage Interrogation:

Question Acceptable Answer 🚩 Red Flag
Is our data used to train your models? "No" or "Only with explicit written consent" "We use data to improve services" (vague)
Can we opt out of training data use? "Yes, via contract amendment" "Not possible with our architecture"
Do you sell or share data with third parties? "No, except infrastructure providers under BAA" "We share with partners to enhance offerings"
What de-identification method do you use? HIPAA Safe Harbor or Expert Determination (with documentation) "We anonymize data" (no specifics)
Do you apply differential privacy? "Yes" or "Not applicable (we don't train on customer data)" "What's that?" or silence
Can we audit your data usage? "Yes, annual audit rights in BAA" "Our systems are proprietary"
Where is data processed (geographically)? US-only data centers "Global infrastructure" or unclear
💼 Service Details: Avondale.AI offers AI Privacy Audits including BAA review for data usage clauses, vendor interrogation support, patient notification template creation, and de-identification methodology assessment. We help you protect patient privacy in the AI era.

Key Takeaways:

  • Many AI vendors use customer data to train models — get explicit answers in writing
  • De-identification is not anonymity — re-identification attacks are increasingly successful
  • "We don't sell data" may be technically true while still monetizing your PHI indirectly
  • Patient opt-out rights depend on data use (TPO vs research vs commercial)
  • BAA should explicitly prohibit using PHI for model training without consent
  • Transparency with patients about AI use builds trust even when not legally required