Article #11 in Medical ML for Ukrainian Doctors Series
By Oleh Ivchenko | Researcher, ONPU | Stabilarity Hub | February 8, 2026
Key Questions Addressed #
- What are the most significant high-profile failures of medical AI implementations?
- What technical, organizational, and deployment factors cause AI systems to fail?
- What lessons can Ukrainian healthcare learn to avoid repeating these failures?
Context: Why This Matters for Ukrainian Healthcare #
Despite over $66.8 billion invested globally in healthcare AI (2021 alone), the field has produced spectacular failures alongside its successes. Understanding what went wrong—and why—is essential for any hospital considering AI adoption.
The High-Profile Failures: Case Studies #
timeline
title High-Profile Medical AI Failures Timeline
2011 : IBM Watson wins Jeopardy — pivot to healthcare begins
2015 : IBM acquires multiple health data companies ($5B+)
2017 : NHS Royal Free — DeepMind's Streams app breaches privacy rules
2018 : Watson Health internal memo reveals "unsafe" recommendations
2019 : Optum algorithm found to discriminate against 200M patients
2020 : Google Health Thailand deployment fails due to workflow mismatch
2021 : Epic Sepsis model — external validation shows poor performance
2022 : IBM sells Watson Health for fraction of investment
No IBM Watson Health: A $5 Billion Lesson #
Perhaps no failure looms larger than IBM Watson Health—the company’s flagship attempt to revolutionize cancer care with AI.
No What Went Wrong #
- Synthetic training data: Trained on hypothetical cases, not real patients
- Poor real-world performance: 96% concordance at MSK → 12% for gastric cancer in China
- Dangerous recommendations: Suggested chemotherapy for patients with severe infection
- Limited adaptability: Couldn’t incorporate breakthrough treatments
No Google Health’s Thailand Diabetic Retinopathy Deployment #
Google’s diabetic retinopathy AI claimed “>90% accuracy at human specialist level.” Field deployment told a different story.
The Promise #
- Accuracy: >90% (specialist-level)
- Processing time: <10 minutes per scan
- Target: 4.5 million Thai patients
No The Reality #
- >21% of images rejected as unsuitable
- Poor lighting in rural clinic environments
- Only 10 patients screened in 2 hours
“Patients like the instant results, but the internet is slow and patients then complain. They’ve been waiting here since 6 a.m., and for the first two hours we could only screen 10 patients.”
— Thai clinic nurse (MIT Technology Review)
No Epic’s Sepsis Prediction Model #
Epic’s sepsis prediction model represents a failure at scale—deployed across hundreds of US hospitals, affecting millions of patients.
Algorithmic Bias: Systematic Discrimination at Scale #
The Optum Algorithm: 200 Million Patients Affected #
200M
Patients affected annually
$1,800
Less spent by Black patients (same illness)
2.7x
Improvement after correction
Root cause: Algorithm used healthcare spending as proxy for health need. Black patients spent less due to access barriers, not better health—creating systematic discrimination.
Skin Cancer Detection: Racial Performance Gaps #
Common Failure Patterns #
mindmap
root((AI Failure Patterns))
Technical Failures
Shortcut learning
Training/deployment mismatch
Dataset shift
Organizational Failures
Ignoring clinician input
Underestimating integration
Missing baselines
Deployment Failures
Infrastructure gaps
Alert fatigue
Workflow disruption
Bias Failures
Demographic gaps in training
Proxy variable discrimination
Missing subgroup testing
Shortcut Learning: When AI Learns the Wrong Features #
| COVID-19 Detection AI | Learned to detect patient position (standing vs. lying) rather than lung pathology |
| Pneumonia AI | Recognized hospital equipment/labels rather than disease patterns |
| Skin Cancer AI | Used presence of rulers (dermatologists measure suspicious lesions) as cancer indicator |
Key Lessons for Ukrainian Healthcare #
flowchart TD
A[AI Implementation Decision] --> B{Clinical Validation?}
B -- No --> C[No HIGH RISK: Stop]
B -- Yes --> D{Real-world Workflow Testing?}
D -- No --> E[No HIGH RISK: Pilot First]
D -- Yes --> F{Bias Testing Across Demographics?}
F -- No --> G[(!)️ MEDIUM RISK: Audit Required]
F -- Yes --> H{Clinician Co-design?}
H -- No --> I[(!)️ MEDIUM RISK: Stakeholder Review]
H -- Yes --> J[Yes READY: Monitored Deployment]
J --> K[Continuous Performance Monitoring]
K --> L{Performance Drift?}
L -- Yes --> M[Retrain / Retire Model]
L -- No --> K
Yes What To Do #
- Validate locally: Never trust vendor performance claims without local testing
- Test on edge cases: Include diverse populations and challenging cases
- Measure baselines: Know current performance before deploying AI
- Plan for infrastructure: Consider internet, lighting, equipment quality
- Involve clinicians early: Design with end-users, not for them
No What To Avoid #
- Synthetic training data: Hypothetical cases ≠ real patients
- Single-site validation: Performance varies across settings
- Ignoring alert fatigue: Too many alerts = all alerts ignored
- Proxy variable bias: Spending ≠ health need
- Overconfidence in lab metrics: AUC in lab ≠ AUC in clinic
Conclusions #
Key Takeaways #
- Failures are systemic, not anomalies: Even well-funded, prestigious projects fail when they ignore clinical reality
- Lab performance ≠ clinical performance: The gap between research and deployment is where most AI fails
- Bias is built-in, not accidental: Training data reflects historical inequities; active mitigation is required
- Infrastructure matters as much as algorithms: Network speed, image quality, workflow integration determine success
- Learning from failures is more valuable than celebrating successes
Questions Answered #
Yes What are the most significant failures?
IBM Watson Health ($5B loss), Google’s Thailand deployment (21% image rejection), Epic’s sepsis model (missed 67% of cases), and Optum’s biased algorithm (200M patients affected).
Yes What factors cause failure?
Synthetic training data, infrastructure mismatches, algorithmic bias, alert fatigue, and underestimating deployment complexity.
Yes What lessons apply to Ukraine?
Always validate locally, involve clinicians from day one, plan for infrastructure requirements, and never trust vendor performance claims without independent testing.
Next in Series: Article #12 – Physician Resistance: Causes and Solutions
Series: Medical ML for Ukrainian Doctors | Stabilarity Hub Research Initiative
Author: Oleh Ivchenko | ONPU Researcher | Stabilarity Hub
References (1) #
- Stabilarity Research Hub. [Medical ML] Failed Implementations: What Went Wrong. doi.org. dtil
