Failed Implementations: What Went Wrong

Article #11 in Medical ML for Ukrainian Doctors Series

By Oleh Ivchenko | Researcher, ONPU | Stabilarity Hub | February 8, 2026

📋 Key Questions Addressed

What are the most significant high-profile failures of medical AI implementations?
What technical, organizational, and deployment factors cause AI systems to fail?
What lessons can Ukrainian healthcare learn to avoid repeating these failures?

Context: Why This Matters for Ukrainian Healthcare

Despite over $66.8 billion invested globally in healthcare AI (2021 alone), the field has produced spectacular failures alongside its successes. Understanding what went wrong—and why—is essential for any hospital considering AI adoption.

The High-Profile Failures: Case Studies

🔴 IBM Watson Health: A $5 Billion Lesson

Perhaps no failure looms larger than IBM Watson Health—the company’s flagship attempt to revolutionize cancer care with AI.

Year	Event
2011	Watson wins Jeopardy, IBM pivots to healthcare
2015-2016	Aggressive acquisitions totaling $5+ billion
Peak	7,000 employees dedicated to Watson Health
2018	Internal document reveals “unsafe and incorrect” recommendations
2022	IBM sells Watson Health for ~$1 billion ($4B write-off)

❌ What Went Wrong

Synthetic training data: Trained on hypothetical cases, not real patients
Poor real-world performance: 96% concordance at MSK → 12% for gastric cancer in China
Dangerous recommendations: Suggested chemotherapy for patients with severe infection
Limited adaptability: Couldn’t incorporate breakthrough treatments

🔴 Google Health’s Thailand Diabetic Retinopathy Deployment

Google’s diabetic retinopathy AI claimed “>90% accuracy at human specialist level.” Field deployment told a different story.

📊 The Promise

Accuracy: >90% (specialist-level)
Processing time: <10 minutes per scan
Target: 4.5 million Thai patients

❌ The Reality

>21% of images rejected as unsuitable
Poor lighting in rural clinic environments
Only 10 patients screened in 2 hours

“Patients like the instant results, but the internet is slow and patients then complain. They’ve been waiting here since 6 a.m., and for the first two hours we could only screen 10 patients.”

— Thai clinic nurse (MIT Technology Review)

🔴 Epic’s Sepsis Prediction Model

Epic’s sepsis prediction model represents a failure at scale—deployed across hundreds of US hospitals, affecting millions of patients.

Metric	Epic’s Claim	Actual Performance
AUC	0.76-0.83	0.63
Sensitivity	Not disclosed	33%
False alarm ratio	Not disclosed	109 alerts per 1 true intervention

⚠️ Impact: The model identified only 7% of patients who actually developed sepsis—a far cry from its marketed performance.

Algorithmic Bias: Systematic Discrimination at Scale

The Optum Algorithm: 200 Million Patients Affected

200M

Patients affected annually

$1,800

Less spent by Black patients (same illness)

2.7x

Improvement after correction

Root cause: Algorithm used healthcare spending as proxy for health need. Black patients spent less due to access barriers, not better health—creating systematic discrimination.

Skin Cancer Detection: Racial Performance Gaps

System	Light Skin	Dark Skin	Drop
System A	0.41	0.12	-71%
System B	0.69	0.23	-67%
System C	0.71	0.31	-56%

Common Failure Patterns

“`mermaid
mindmap
root((AI Failure
Patterns))
Technical Failures
Shortcut learning
Training/deployment mismatch
Dataset shift
Organizational Failures
Ignoring clinician input
Underestimating integration
Missing baselines
Deployment Failures
Infrastructure gaps
Alert fatigue
Workflow disruption
Bias Failures
Demographic gaps in training
Proxy variable discrimination
Missing subgroup testing
“`

Shortcut Learning: When AI Learns the Wrong Features

COVID-19 Detection AI	Learned to detect patient position (standing vs. lying) rather than lung pathology
Pneumonia AI	Recognized hospital equipment/labels rather than disease patterns
Skin Cancer AI	Used presence of rulers (dermatologists measure suspicious lesions) as cancer indicator

Key Lessons for Ukrainian Healthcare

✅ What To Do

Validate locally: Never trust vendor performance claims without local testing
Test on edge cases: Include diverse populations and challenging cases
Measure baselines: Know current performance before deploying AI
Plan for infrastructure: Consider internet, lighting, equipment quality
Involve clinicians early: Design with end-users, not for them

❌ What To Avoid

Synthetic training data: Hypothetical cases ≠ real patients
Single-site validation: Performance varies across settings
Ignoring alert fatigue: Too many alerts = all alerts ignored
Proxy variable bias: Spending ≠ health need
Overconfidence in lab metrics: AUC in lab ≠ AUC in clinic

Conclusions

🔑 Key Takeaways

Failures are systemic, not anomalies: Even well-funded, prestigious projects fail when they ignore clinical reality
Lab performance ≠ clinical performance: The gap between research and deployment is where most AI fails
Bias is built-in, not accidental: Training data reflects historical inequities; active mitigation is required
Infrastructure matters as much as algorithms: Network speed, image quality, workflow integration determine success
Learning from failures is more valuable than celebrating successes

Questions Answered

✅ What are the most significant failures?

IBM Watson Health ($5B loss), Google’s Thailand deployment (21% image rejection), Epic’s sepsis model (missed 67% of cases), and Optum’s biased algorithm (200M patients affected).

✅ What factors cause failure?

Synthetic training data, infrastructure mismatches, algorithmic bias, alert fatigue, and underestimating deployment complexity.

✅ What lessons apply to Ukraine?

Always validate locally, involve clinicians from day one, plan for infrastructure requirements, and never trust vendor performance claims without independent testing.

Next in Series: Article #12 – Physician Resistance: Causes and Solutions

Series: Medical ML for Ukrainian Doctors | Stabilarity Hub Research Initiative

Author: Oleh Ivchenko | ONPU Researcher | Stabilarity Hub