Data Requirements and Quality Standards for Medical Imaging AI
Machine Learning for Medical Diagnosis Research Series • Article #5
Abstract
This article examines the critical data quality standards required for medical imaging AI systems, revealing that of 1,016 FDA-approved AI medical devices, 93.3% did not report training data source and 76.3% lacked demographic information. We establish a comprehensive framework for data quality assessment including the six pillars of medical imaging data quality, bias sources and mitigation strategies, and practical implementation guidelines for Ukrainian healthcare facilities.
1. The Data Quality Crisis in Medical AI
The promise of AI in medical imaging depends entirely on data quality. Yet a comprehensive 2025 study of all 1,016 FDA-approved AI/ML medical devices reveals a troubling reality:
flowchart TD
subgraph FDA["FDA AI/ML Device Transparency (2025)"]
A[1,016 Approved Devices] --> B{Transparency Analysis}
B --> C["🔴 93.3% No Training Source"]
B --> D["🔴 90.6% No Dataset Size"]
B --> E["🟠 76.3% No Demographics"]
B --> F["🟠 51.6% No Performance Metrics"]
end
subgraph Score["ACTR Score"]
G["Mean Score: 3.3/17 points"]
H["Post-GMLP 2021: +0.88 improvement"]
I["Still FAR below acceptable"]
end
C --> G
D --> G
E --> G
F --> G
G --> H --> I
style C fill:#ffcccc
style D fill:#ffcccc
style E fill:#ffe6cc
style F fill:#ffe6cc
style I fill:#ff9999
📉 FDA Transparency Analysis (December 2024)
| Data Characteristic | Devices Reporting (%) | Gap |
|---|---|---|
| Training data source | 6.7% | 93.3% unreported |
| Test data source | 24.5% | 75.5% unreported |
| Training dataset size | 9.4% | 90.6% unreported |
| Test dataset size | 23.2% | 76.8% unreported |
| Demographic information | 23.7% | 76.3% unreported |
| Any performance metrics | 48.4% | 51.6% unreported |
Source: Nature Digital Medicine, “Evaluating transparency in AI/ML model characteristics for FDA-reviewed medical devices,” 2025
The mean AI Characteristics Transparency Reporting (ACTR) score across all devices was just 3.3 out of 17 possible points. Even after the FDA’s 2021 Good Machine Learning Practice (GMLP) guidelines, scores only improved by 0.88 points — remaining far below acceptable standards.
2. Dataset Size Requirements: How Much Data Is Enough?
The relationship between training dataset size and model performance follows a logarithmic curve — with diminishing returns at scale, but critical thresholds below which models fail entirely.
flowchart LR
subgraph Requirements["Dataset Size Requirements by Task"]
direction TB
A["Binary Classification
cancer/no cancer"] --> A1["Min: 1K-5K images
Rec: 10K+"]
B["Multi-class
6+ categories"] --> B1["Min: 500-1K/class
Rec: 2K+/class"]
C["Object Detection
localization"] --> C1["Min: 2K-5K bbox
Rec: 15K+"]
D["Segmentation
pixel masks"] --> D1["Min: 500-1K masks
Rec: 5K+"]
end
subgraph Transfer["Transfer Learning Impact"]
E["From Scratch"] --> E1["████████ 10K+ images"]
F["Transfer Learning"] --> F1["████ ~1K images"]
G["Few-Shot"] --> G1["██ ~100 images"]
end
Requirements --> Transfer
style A1 fill:#e8f5e9
style B1 fill:#e8f5e9
style C1 fill:#fff3e0
style D1 fill:#fff3e0
2.1 Minimum Viable Dataset Sizes
| Task Type | Minimum Size | Recommended Size | State-of-Art Datasets |
|---|---|---|---|
| Binary classification (e.g., cancer/no cancer) | 1,000-5,000 images | 10,000+ images | CheXpert: 224,316 |
| Multi-class classification (6+ classes) | 500-1,000 per class | 2,000+ per class | NIH ChestX-ray14: 112,120 |
| Object detection/localization | 2,000-5,000 with bbox | 15,000+ with bbox | VinDr-CXR: 18,000 |
| Semantic segmentation | 500-1,000 with masks | 5,000+ with masks | Varies by anatomy |
2.2 The Transfer Learning Advantage
Transfer learning dramatically reduces data requirements by leveraging pre-trained models (ImageNet, RadImageNet, etc.):
┌─────────────────────────────────────────────────────────────┐ │ DATASET SIZE REQUIREMENTS │ ├─────────────────────────────────────────────────────────────┤ │ │ │ From Scratch: ████████████████████████ 10,000+ images │ │ │ │ Transfer Learning: ████████ ~1,000 images │ │ │ │ Few-Shot/Fine-tune: ██ ~100 images │ │ (with foundation models) │ │ │ └─────────────────────────────────────────────────────────────┘
A 2025 BMC Medical Imaging scoping review found that 50% of deep learning medical imaging studies used datasets between 1,000-10,000 samples — suggesting this range represents current practical norms.
3. The Six Pillars of Medical Imaging Data Quality
Based on the CLAIM 2024 Update (Checklist for Artificial Intelligence in Medical Imaging) and RIDGE framework (Reproducibility, Integrity, Dependability, Generalizability, Efficiency), we define six essential data quality pillars:
mindmap
root((Data Quality
Pillars))
Reference Standard
Use not ground truth
3+ annotators
Consensus methodology
Interobserver κ/Dice
Annotation Protocol
Written guidelines
Visual examples
Edge case handling
Annotator training
Demographics
Age distribution
Sex/gender balance
Race/ethnicity
Geographic diversity
Technical Specs
Scanner model
Resolution/bit depth
Acquisition params
DICOM metadata
Privacy
HIPAA/GDPR
PHI removal
Facial de-identification
Re-ID risk assessment
Provenance
Temporal coverage
Institution sources
Selection criteria
Version control
🎯 1. Reference Standard Quality
Definition: The benchmark against which AI predictions are measured.
- Use “reference standard” not “ground truth”
- Minimum 3 independent annotators
- Document consensus methodology
- Report interobserver variability (Dice, κ)
CLAIM 2024 recommends avoiding “ground truth” — it implies certainty that rarely exists in medicine.
📋 2. Annotation Protocol
Definition: Standardized instructions for human labelers.
- Written guidelines with visual examples
- Training for all annotators
- Clear boundary definitions
- Handling of edge cases documented
VinDr-CXR used 17 radiologists with 8+ years experience, 3 per training image, 5 consensus for test set.
👥 3. Demographic Representation
Definition: Dataset reflects target population diversity.
- Age distribution documented
- Sex/gender balance reported
- Race/ethnicity when relevant
- Geographic/institutional diversity
Only 23.7% of FDA devices reported demographics — unacceptable for fair AI.
🔧 4. Technical Specifications
Definition: Image acquisition parameters documented.
- Scanner manufacturer/model
- Image resolution and bit depth
- Acquisition protocols (kVp, mAs, etc.)
- DICOM format with metadata
Heterogeneous scanners improve generalization but must be documented.
🔒 5. Privacy & De-identification
Definition: Patient data protection compliance.
- HIPAA/GDPR/local law compliance
- PHI removal from DICOM tags
- Facial structure removal (CT/MRI)
- Pseudonymization vs anonymization choice
Re-identification risk increases with multi-modal data linkage.
📜 6. Data Provenance
Definition: Complete documentation of data origins and history.
- Temporal coverage (collection dates)
- Institutional sources identified
- Selection/exclusion criteria
- Version control for dataset updates
Provenance enables reproducibility and bias tracing.
4. Sources of Bias in Medical Imaging Data
Understanding bias sources is essential for mitigation. Medical imaging AI faces four major bias categories:
📊 Bias Categories and Mitigation Strategies
| Bias Type | Source | Example | Mitigation |
|---|---|---|---|
| Representation Bias | Demographic undersampling | Training on 90% white patients | Multi-site diverse data collection |
| Measurement Bias | Label extraction methods | NLP from reports vs expert annotation | Multi-reader gold standard |
| Annotation Bias | Single-reader subjectivity | One radiologist’s interpretation | Consensus protocols, 3+ readers |
| Temporal Bias | Outdated training data | 2015 scanner protocols in 2025 | Continuous data refresh, drift monitoring |
5. The CLAIM 2024 Checklist Requirements
The updated CLAIM (Checklist for Artificial Intelligence in Medical Imaging) 2024 provides 52 items across categories. Key data-related requirements include:
5.1 Essential Data Documentation Items
- Data sources: Institutions, geographic regions, time periods
- Selection criteria: Inclusion/exclusion criteria for cases
- Annotation methodology: Software used, discrepancy resolution
5.2 Critical Terminology Updates
| Avoid | Use Instead | Reason |
|---|---|---|
| “Ground truth” | “Reference standard” | Acknowledges uncertainty in medical labels |
| “Validation set” | “Internal testing” / “Tuning” | Avoids confusion with clinical validation |
| “External validation” | “External testing” | Clearer meaning |
| “Gold standard” | “Reference standard” | No label is truly “gold” |
6. Data Quality Checklist for Ukrainian Hospitals
Based on international standards, here’s a practical checklist for Ukrainian healthcare facilities considering AI adoption or data collection:
📋 Pre-Deployment Data Assessment
1. Source Audit 2. Demographics Check 3. Protocol Match 4. Bias Scan 5. Quality Score6.1 Questions to Ask Vendors
- Training data source: Which hospitals/regions? What years?
- Dataset size: How many images total? Per class?
- Demographics: Age, sex, ethnicity distribution of training data?
- Annotation methodology: Who labeled? How many readers? What consensus?
- Scanner diversity: Which manufacturers? Protocol variations?
- External testing: Tested on data from outside training institutions?
- Subgroup performance: Metrics broken down by age, sex, pathology severity?
- Ukrainian testing: Has this model been tested on Ukrainian patient populations?
6.2 Minimum Data Quality Standards for ScanLab
| Criterion | Minimum Standard | Ideal Standard |
|---|---|---|
| Annotators per image | ≥2 radiologists | 3+ with consensus protocol |
| Annotator experience | ≥5 years radiology | ≥8 years, subspecialty certified |
| Interobserver agreement reported | Yes (κ or Dice) | Yes, with disagreement analysis |
| Demographic documentation | Age, sex distribution | Full demographics + subgroup metrics |
| Data partition method | Patient-level split | Patient-level + temporal + external |
| Scanner diversity | ≥2 manufacturers | ≥3 manufacturers, multiple sites |
| Ukrainian representation | Any Ukrainian data tested | Ukrainian data in training + testing |
7. Unique Conclusions and Synthesis
🔑 Novel Insights from This Analysis
- The Transparency Paradox: Despite FDA GMLP guidelines, medical AI remains a “black box” for data quality. Only 6.7% of approved devices reveal training data sources. This is not a technical limitation — it’s an accountability gap that Ukrainian regulators should not replicate.
- Quality Over Quantity: VinDr-CXR with 18,000 carefully annotated images outperforms models trained on 200,000+ NLP-labeled images for localization tasks. Ukrainian hospitals should prioritize multi-reader annotated datasets even if smaller.
- The “Reference Standard” Shift: The move from “ground truth” to “reference standard” terminology reflects a mature understanding that medical labels are probabilistic, not absolute. This philosophical shift should inform all Ukrainian AI procurement.
- Demographic Fairness is Technical: AI models can detect demographics from X-rays alone — and this correlates with unfair performance. Testing on Ukrainian populations is not optional; it’s essential for equitable care.
- The Annotation Cost-Quality Tradeoff: Multi-reader annotation (3-5 radiologists per image) costs 3-5x more than single-reader or NLP extraction. This cost is justified for clinical deployment but may be optimized for initial development with active learning strategies.
8. ScanLab Implementation Recommendations
For the ScanLab project, we recommend the following data quality framework:
- Establish a Ukrainian Reference Dataset: Partner with 2-3 major Ukrainian hospitals to create a multi-reader annotated test set (minimum 3,000 images) representing Ukrainian patient demographics and scanner fleet.
- Implement Vendor Vetting Protocol: Require all AI vendors to complete a standardized data quality questionnaire based on CLAIM 2024 before evaluation.
- Create Continuous Monitoring Dashboard: Track model performance by demographic subgroups over time to detect distribution shift and emerging biases.
- Develop Annotation Guidelines: Create Ukrainian-language annotation protocols with visual examples for common pathologies, enabling consistent future data collection.
- Build Data Quality Registry: Maintain a database of evaluated AI models with their data quality scores, enabling informed procurement decisions across Ukrainian healthcare.
9. Open Questions for Future Research
❓ Questions Generated
- What is the minimum dataset size specifically for Ukrainian patient populations given demographic differences from US/EU training data?
- How do equipment differences between Ukrainian hospitals (older vs. newer scanners) affect AI model generalization?
- Can synthetic data augmentation compensate for demographic underrepresentation, or does it introduce new biases?
- What is the cost-effectiveness threshold for multi-reader annotation in resource-constrained Ukrainian settings?
- How should Ukrainian regulations evolve to mandate data quality transparency that FDA guidelines failed to enforce?
10. References
- Mahbub et al. (2025). “Evaluating transparency in AI/ML model characteristics for FDA-reviewed medical devices.” npj Digital Medicine. Link
- Mongan J, et al. (2024). “Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 Update.” Radiology: Artificial Intelligence. PMC Link
- Stable et al. (2024). “RIDGE: Reproducibility, Integrity, Dependability, Generalizability, and Efficiency Assessment of Medical Image Segmentation Models.” Journal of Imaging Informatics in Medicine. Link
- Stable et al. (2024). “Image annotation and curation in radiology: an overview for machine learning practitioners.” European Radiology Experimental. PMC Link
- Stable et al. (2025). “Bias in artificial intelligence for medical imaging: fundamentals, detection, avoidance, mitigation, challenges, ethics, and prospects.” Diagnostic and Interventional Radiology. PMC Link
- Zhang et al. (2024). “The limits of fair medical imaging AI in real-world generalization.” Nature Medicine. Link
- Nguyen et al. (2022). “VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations.” Scientific Data. PMC Link
- Do S, Woo K. (2016). “How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?” arXiv:1511.06348. Link
- Irvin J, et al. (2019). “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison.” AAAI. Project Page
- Wang X, et al. (2017). “ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks.” CVPR.
Questions Answered
- Q: What data quality standards are required for medical imaging AI?
A: CLAIM 2024 defines essential requirements: documented data sources, multi-reader annotation with consensus protocols, demographic representation, patient-level data splits, acquisition protocol specifications, and privacy compliance. Current FDA transparency is alarmingly low (ACTR score 3.3/17). - Q: How much data is needed to train medical imaging AI?
A: Minimums range from 1,000-5,000 images for binary classification to 10,000+ for multi-class tasks. Transfer learning reduces requirements ~10x. Quality (multi-reader annotation) matters more than quantity (NLP-extracted labels). - Q: What are the sources of bias and how do we mitigate them?
A: Major sources include representation bias (demographic undersampling), measurement bias (NLP label extraction), annotation bias (single-reader subjectivity), and temporal bias (data age). Mitigation requires diverse multi-site data, multi-reader annotation, demographic stratification, and continuous monitoring post-deployment.
Series: Machine Learning for Medical Diagnosis | Article 5 of 35 | Stabilarity Hub