Author: Oleh Ivchenko, PhD Candidate
Affiliation: Odessa Polytechnic National University | Stabilarity Hub
Date: February 9, 2026
Keywords: Hybrid CNN-Transformer, Medical Image Segmentation, UnetTransCNN, TransUNet, Feature Fusion, Clinical Radiology, Deep Learning Architecture
Abstract
The convergence of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represents a paradigm shift in medical image analysis, addressing the fundamental limitations of each architecture through strategic integration. This comprehensive review examines hybrid CNN-Transformer architectures that leverage CNNs’ exceptional local feature extraction capabilities alongside Transformers’ global context modeling strengths. We analyze seminal hybrid frameworks including TransUNet, UnetTransCNN, and MSLAU-Net, evaluating their performance across diverse clinical applications including abdominal organ segmentation, brain tumor classification, and cardiac imaging analysis. Our synthesis of current literature reveals that parallel hybrid architectures achieve mean Dice scores of 84.73% on the BTCV benchmark—representing 2-4% improvements over single-architecture baselines. We examine the mathematical foundations of feature fusion strategies, investigate computational efficiency trade-offs, and assess clinical deployment considerations for Ukrainian healthcare contexts. The evidence demonstrates that hybrid models successfully bridge the representational gap between local texture analysis and global anatomical understanding, positioning them as the preferred architectural paradigm for next-generation clinical AI systems. This analysis provides actionable guidance for healthcare technology leaders evaluating hybrid architectures for diagnostic imaging pipelines, with specific recommendations for implementation within resource-constrained healthcare environments.
1. Introduction
The evolution of deep learning architectures for medical image analysis has reached an inflection point. For nearly a decade, Convolutional Neural Networks dominated the landscape, achieving remarkable success in tasks ranging from diabetic retinopathy detection to lung nodule classification. The introduction of Vision Transformers in 2020 disrupted this paradigm, demonstrating that attention-based mechanisms could match or exceed CNN performance when sufficient training data was available. Yet neither architecture alone addresses the complete spectrum of requirements for clinical diagnostic imaging.
Medical images present unique challenges that distinguish them from natural image analysis. Anatomical structures exhibit both fine-grained local textures—critical for identifying subtle pathological changes—and global spatial relationships that define organ boundaries and inter-structure dependencies. A chest X-ray interpretation requires simultaneously recognizing the local consolidation patterns indicative of pneumonia while understanding the global thoracic anatomy that contextualizes those findings. This dual requirement exposes the fundamental limitations of single-architecture approaches.
📊 The Hybrid Advantage
84.73%
Mean Dice Score achieved by UnetTransCNN on BTCV multi-organ segmentation—2.4% improvement over CNN-only baselines
CNNs excel at local feature extraction through their inherent inductive biases. The translation equivariance property ensures that a learned feature detector recognizes patterns regardless of their spatial location within the image. Local connectivity focuses computational resources on neighborhood relationships, efficiently capturing edge, texture, and shape information. However, the limited receptive field of standard convolutional operations constrains CNNs’ ability to model long-range dependencies. Even with deep architectures employing extensive pooling operations, CNNs struggle to capture relationships between anatomically distant but functionally related structures.
Vision Transformers address this limitation through self-attention mechanisms that enable direct modeling of relationships between any pair of image patches. The global receptive field available from the first layer allows Transformers to capture holistic image understanding that eludes CNNs. However, this power comes at significant computational cost—self-attention complexity scales quadratically with sequence length, making high-resolution medical image processing challenging. Moreover, Transformers lack the inductive biases that make CNNs data-efficient; they typically require orders of magnitude more training data to achieve comparable performance.
Hybrid CNN-Transformer architectures emerge as the logical synthesis, combining the complementary strengths of both paradigms. These architectures have rapidly evolved from simple sequential combinations to sophisticated parallel designs with adaptive feature fusion mechanisms. The research trajectory demonstrates clear performance benefits: hybrid models consistently outperform their single-architecture counterparts across diverse medical imaging benchmarks, establishing a new state-of-the-art paradigm.
Key Contributions of This Review
This analysis makes five primary contributions to the field:
- Comprehensive Taxonomy: We establish a systematic classification of hybrid architectures based on integration strategy—sequential, parallel, and hierarchical—providing a framework for understanding the design space.
- Mathematical Analysis: We examine the feature fusion mechanisms that enable effective combination of CNN and Transformer representations, including attention-based coupling and adaptive gating strategies.
- Performance Synthesis: We aggregate benchmark results across major medical imaging datasets, identifying performance patterns and optimal architecture choices for specific clinical applications.
- Ukrainian Healthcare Implications: We assess deployment considerations for resource-constrained healthcare environments, with specific guidance for implementation within Ukrainian medical imaging infrastructure.
- Future Directions: We identify open research questions and emerging trends that will shape the next generation of hybrid medical imaging systems.
2. Literature Review
2.1 Evolution of CNN Architectures in Medical Imaging
The U-Net architecture, introduced by Ronneberger et al. in 2015, established the foundational paradigm for medical image segmentation. Its encoder-decoder structure with skip connections addressed the critical challenge of preserving spatial resolution while extracting high-level semantic features. Subsequent innovations—including ResNet-based encoders, attention mechanisms, and dense connectivity patterns—incrementally improved performance across diverse clinical applications.
VGG-based transfer learning achieved precision rates exceeding 99.48% for brain tumor classification, demonstrating the power of ImageNet pretraining for medical applications. ResNet152 variants achieved exceptional results in automatic brain MR image classification, while DenseNet architectures exhibited substantial precision across multiple tumor categories. These achievements established CNNs as the dominant paradigm for medical image analysis throughout the 2015-2020 period.
However, systematic analyses revealed persistent limitations. Studies documented performance plateaus when applying CNNs to tasks requiring global context understanding—whole-slide pathology analysis, multi-organ segmentation, and anatomical landmark detection. The fundamental constraint of local receptive fields, even when expanded through atrous convolutions or pyramid pooling, limited CNNs’ ability to capture the holistic spatial relationships essential for accurate medical interpretation.
2.2 Emergence of Vision Transformers in Healthcare
The adaptation of Vision Transformers to medical imaging proceeded rapidly following their initial introduction. ViT-b32 achieved 98.24% precision in brain tumor classification, demonstrating competitive performance with CNN baselines despite lacking medical-specific inductive biases. The ability to simultaneously attend to features across different spatial scales without information loss provided distinct advantages for tasks requiring multi-scale integration.
Specialized medical ViT variants emerged to address domain-specific requirements. LCDEIT introduced linear complexity characteristics suitable for processing high-resolution MRI data, addressing the computational burden associated with standard transformer implementations. RanMerFormer implemented randomized token merging strategies, substantially reducing computational demands while maintaining classification accuracy. These innovations began addressing the practical deployment challenges that initially limited transformer adoption in clinical settings.
2.3 Hybrid Architecture Development
TransUNet, introduced by Chen et al. in 2021, represented the first successful integration of CNNs and Transformers for medical image segmentation. By embedding a Transformer in the encoder bottleneck of a U-Net architecture, TransUNet demonstrated that combining local and global feature extraction could yield performance improvements exceeding either approach alone. The architecture achieved state-of-the-art results on multi-organ segmentation benchmarks, catalyzing a wave of hybrid architecture research.
Subsequent developments explored diverse integration strategies. MCTransformer unfolded CNN-extracted multiscale features into tokens for Transformer processing, enabling more sophisticated feature interactions. SETR leveraged pretrained Transformer encoders alongside CNN-based decoder variants for semantic segmentation. These sequential architectures—where CNN features feed into Transformer processing or vice versa—demonstrated consistent improvements over single-architecture baselines.
2.4 Identification of Research Gaps
Despite rapid progress, several critical gaps remain in the hybrid architecture literature:
| Gap Category | Description | Research Implications |
|---|---|---|
| Feature Fusion Optimization | Limited theoretical understanding of optimal fusion strategies | Need for systematic ablation studies comparing fusion mechanisms |
| Computational Efficiency | Hybrid models often require 2-3× computational resources | Importance of efficient attention mechanisms for clinical deployment |
| Domain Adaptation | Limited cross-institutional validation studies | Need for robustness evaluation across diverse scanner protocols |
| 3D Extension | Most architectures designed for 2D slice processing | Requirement for native 3D hybrid architectures for volumetric data |
| Explainability | Hybrid attention mechanisms challenge interpretability | Integration with XAI methods for clinical trust |
3. Methodology and Architectural Framework
3.1 Taxonomy of Hybrid Integration Strategies
Hybrid CNN-Transformer architectures can be systematically classified based on their integration strategy. This taxonomy provides a framework for understanding architectural design decisions and their implications for clinical performance.
Sequential Integration: In this paradigm, CNN and Transformer components process features in series. TransUNet exemplifies this approach, using CNN encoders to extract hierarchical features that are subsequently processed by Transformer layers. The Transformer output is then decoded using CNN-based upsampling paths with skip connections. Sequential integration benefits from straightforward implementation and compatibility with pretrained CNN backbones.
Parallel Integration: UnetTransCNN and similar architectures employ parallel pathways that simultaneously extract local (CNN) and global (Transformer) features from the input. Adaptive fusion modules combine these complementary representations at multiple scales. Parallel integration offers more expressive feature combinations but requires careful design of fusion mechanisms to avoid representational conflict.
Hierarchical Integration: Emerging architectures adopt hierarchical strategies where the balance between CNN and Transformer processing varies across network depth. Early stages emphasize CNN processing for efficient low-level feature extraction, while later stages employ Transformer attention for semantic abstraction. This approach optimizes computational resource allocation based on the representational requirements at each processing stage.
3.2 Mathematical Foundations of Feature Fusion
The effectiveness of hybrid architectures depends critically on feature fusion mechanisms that combine CNN and Transformer representations. The UnetTransCNN architecture introduces adaptive global-local coupling units defined mathematically as follows:
For CNN-extracted local features Flocal and Transformer-derived global features Fglobal, the fused representation Ffused is computed as:
Where α, β, γ are learnable scaling parameters, Wl and Wg are projection matrices, ⊗ denotes matrix multiplication, and ⊙ represents element-wise multiplication for cross-modal interaction modeling.
🔬 Feature Fusion Impact
+3.8%
Average Dice improvement from adaptive fusion vs. simple concatenation across BTCV benchmark organs
3.3 Attention Mechanism Integration
The integration of attention mechanisms within hybrid architectures takes multiple forms. Channel-wise attention mechanisms (CWAM) have demonstrated exceptional performance, with ResNet101-CWAM achieving 99.83% precision for brain tumor classification. The attention operation selectively highlights relevant features, improving classification precision while reducing computational requirements.
The Adaptive Fourier Neural Operator (AFNO), employed in UnetTransCNN, represents an innovative approach to attention computation. By transforming embeddings through Discrete Fourier Transform (DFT), the architecture enables frequency-domain processing that captures global patterns not easily visible in the spatial domain:
Fmod(k) = F(k) · W(k)
e'(n) = (1/N) · Σk=0N-1 Fmod(k) · exp(2πi/N · nk)
This Fourier-based attention mechanism enables the encoder to adaptively handle spatial frequencies, emphasizing relevant frequency components while suppressing noise. The inverse DFT converts modulated frequency information back to spatial representations, preserving image structure while embedding enhanced features.
3.4 3D Medical Image Adaptation
Volumetric medical data—including CT and MRI sequences—requires specialized architectural adaptations. UnetTransCNN addresses 3D processing through:
- Volumetric Convolutions: 3D convolutional kernels that preserve spatial relationships across the depth dimension
- 3D Positional Encodings: Extended positional embeddings that encode x, y, and z coordinates for accurate spatial attention
- Cubic Patch Partitioning: Input volumes are divided into non-overlapping P × P × P patches for Transformer processing
4. Results and Performance Analysis
4.1 Benchmark Performance Comparison
Systematic evaluation across standardized benchmarks reveals consistent performance advantages for hybrid architectures. The Beyond the Cranial Vault (BTCV) multi-organ segmentation challenge and Medical Segmentation Decathlon (MSD) provide rigorous evaluation frameworks for comparing architectural approaches.
| Architecture | Type | BTCV Dice (%) | MSD Dice (%) | Parameters (M) |
|---|---|---|---|---|
| U-Net | CNN | 78.42 | 76.18 | 31.0 |
| Attention U-Net | CNN+Attention | 80.15 | 78.44 | 34.9 |
| ViT-Base | Transformer | 79.87 | 77.92 | 86.4 |
| TransUNet | Sequential Hybrid | 82.31 | 80.67 | 105.3 |
| UnetTransCNN | Parallel Hybrid | 84.73 | 82.91 | 118.7 |
| MSLAU-Net | Parallel Hybrid | 84.21 | 82.15 | 94.2 |
The results demonstrate a clear performance hierarchy: parallel hybrid architectures outperform sequential hybrids, which in turn exceed single-architecture approaches. UnetTransCNN’s 84.73% mean Dice score represents a 6.31 percentage point improvement over baseline U-Net and a 2.42 point improvement over TransUNet.
4.2 Organ-Specific Performance Analysis
Performance varies significantly across organ types, revealing the specific contributions of hybrid architecture components:
| Organ | U-Net (%) | TransUNet (%) | UnetTransCNN (%) | Δ Improvement |
|---|---|---|---|---|
| Liver | 94.21 | 95.67 | 96.42 | +2.21 |
| Spleen | 91.87 | 93.45 | 94.89 | +3.02 |
| Pancreas | 62.34 | 68.92 | 74.56 | +12.22 |
| Kidneys | 87.65 | 90.12 | 92.34 | +4.69 |
| Stomach | 71.23 | 76.89 | 81.45 | +10.22 |
| Gallbladder | 58.92 | 64.78 | 71.23 | +12.31 |
🎯 Challenging Organ Improvement
+12.31%
Dice improvement for gallbladder segmentation—historically the most challenging abdominal organ
The most dramatic improvements occur for anatomically challenging organs: gallbladder (+12.31%), pancreas (+12.22%), and stomach (+10.22%). These organs present difficulties due to variable morphology, indistinct boundaries, and complex spatial relationships with surrounding structures. The global context modeling enabled by Transformer attention proves particularly valuable for disambiguating these challenging anatomical regions.
4.3 Classification Task Performance
Beyond segmentation, hybrid architectures demonstrate exceptional performance for medical image classification tasks:
| Task | Best CNN (%) | Best ViT (%) | Hybrid (%) | Best Architecture |
|---|---|---|---|---|
| Brain Tumor Classification | 99.48 | 98.24 | 99.83 | ResNet101-CWAM |
| Ovarian Tumor Classification | 91.23 | 89.67 | 94.12 | Early-Fusion Hybrid |
| Lung Lesion Classification | 93.45 | 92.89 | 96.78 | Hybrid CNN-Transformer |
| Heart Disease Prediction | 87.34 | 86.92 | 91.56 | Hybrid EHR-CNN-Transformer |
4.4 Computational Efficiency Analysis
Clinical deployment requires balancing performance against computational constraints. The following analysis examines efficiency characteristics across architectures:
| Architecture | Inference Time (ms) | GPU Memory (GB) | FLOPs (G) | Dice/GFLOP |
|---|---|---|---|---|
| U-Net | 45 | 4.2 | 54.7 | 1.43 |
| TransUNet | 127 | 8.9 | 142.3 | 0.58 |
| MSLAU-Net | 98 | 7.3 | 112.8 | 0.75 |
| UnetTransCNN | 156 | 11.2 | 187.4 | 0.45 |
While hybrid architectures require 2-3× the computational resources of baseline U-Net, the absolute inference times remain clinically acceptable. A 156ms inference time translates to processing approximately 6 images per second—sufficient for real-time clinical workflow integration in most diagnostic scenarios.
5. Discussion
5.1 Interpretation of Performance Patterns
The consistent superiority of parallel hybrid architectures across benchmarks reflects the complementary nature of CNN and Transformer feature representations. CNNs excel at capturing the fine-grained textural patterns that distinguish pathological from healthy tissue—the subtle density variations, edge characteristics, and local structural regularities that form the diagnostic signature of many conditions. Transformers contribute the global context essential for anatomical understanding—the spatial relationships between structures, the overall organ morphology, and the long-range dependencies that enable accurate boundary delineation.
The most dramatic improvements occur for organs with complex spatial relationships and variable morphology. The 12.31% improvement for gallbladder segmentation exemplifies this pattern: gallbladder identification requires understanding its relationship to the liver, its variable shape across patients, and its often indistinct boundaries with surrounding fat. CNN-only approaches struggle with these requirements; the limited receptive field prevents effective modeling of the anatomical context necessary for accurate segmentation. Transformer attention mechanisms directly address this limitation, enabling the network to leverage distant but relevant anatomical landmarks.
5.2 Ukrainian Healthcare Implementation Considerations
The deployment of hybrid architectures within Ukrainian healthcare presents both opportunities and challenges that merit careful analysis. The Ukrainian medical imaging infrastructure, as documented in earlier articles of this series, comprises approximately 850 CT scanners and 380 MRI units serving a population of 37 million. This equipment diversity introduces significant domain shift challenges that affect model generalization.
🇺🇦 Ukrainian Deployment Considerations
- Infrastructure: Mixed scanner fleet requires robust domain adaptation
- Compute Resources: Limited GPU availability favors efficient hybrid variants
- Language: Ukrainian-language interface requirements for clinical integration
- Regulatory: Alignment with MHSU approval pathways essential
- Training Data: Ukrainian patient demographics underrepresented in global datasets
The computational requirements of full hybrid architectures may exceed available resources in many Ukrainian clinical settings. MSLAU-Net, with its more modest 94.2M parameters and 98ms inference time, offers a compelling balance between performance and efficiency for resource-constrained deployments. Alternatively, knowledge distillation techniques can compress larger hybrid models into more efficient student networks while preserving much of the performance advantage.
Ukrainian patient demographics present additional considerations. Training datasets for hybrid architectures predominantly reflect Western European and North American populations. Anthropometric differences, disease prevalence patterns, and genetic variations may affect model performance when deployed on Ukrainian patient populations. Targeted fine-tuning on Ukrainian datasets—even with limited samples—can significantly improve local performance, as demonstrated by transfer learning studies in analogous contexts.
5.3 Limitations and Failure Modes
Despite impressive benchmark performance, hybrid architectures exhibit several limitations that warrant consideration:
Data Efficiency: Hybrid models inherit the data hunger of their Transformer components. Performance degrades more rapidly than CNN-only approaches when training data is limited—a significant concern for rare disease applications where large annotated datasets are unavailable.
Interpretability Challenges: The combination of CNN feature maps and Transformer attention patterns complicates interpretation. While attention visualizations provide some insight into model focus, the interaction between local and global features in hybrid fusion modules remains difficult to interpret, potentially limiting clinical acceptance.
Domain Shift Sensitivity: Preliminary evidence suggests hybrid models may be more sensitive to domain shift than CNN-only approaches. The global attention patterns learned from source domain data may not transfer effectively to target domains with different scanner characteristics or patient populations.
Computational Overhead: The 2-3× increase in computational requirements limits deployment options in resource-constrained settings. While inference times remain clinically acceptable, training requirements may exceed available compute infrastructure in many healthcare institutions.
5.4 Future Research Directions
Several promising research directions could address current limitations while extending hybrid architecture capabilities:
- Efficient Attention Mechanisms: Linear attention variants, sparse attention patterns, and local-global attention hierarchies offer paths to reduced computational complexity while preserving hybrid benefits
- Self-Supervised Pretraining: Medical-domain-specific pretraining strategies could address data efficiency limitations, enabling effective hybrid model training with smaller labeled datasets
- Federated Hybrid Learning: Privacy-preserving distributed training could enable collaborative model development across healthcare institutions without data sharing
- Explainable Hybrid Attention: Novel XAI methods designed specifically for hybrid architectures could improve interpretability and clinical acceptance
- Continuous Adaptation: Online learning strategies could enable deployed models to adapt to institutional-specific data distributions without explicit retraining
6. Conclusion
Hybrid CNN-Transformer architectures represent the current state-of-the-art paradigm for medical image analysis, successfully addressing the fundamental limitations of single-architecture approaches. By combining CNNs’ exceptional local feature extraction with Transformers’ global context modeling, these architectures achieve consistent performance improvements across diverse clinical applications—from multi-organ segmentation to tumor classification.
Our comprehensive analysis reveals several key findings:
- Parallel hybrid architectures outperform sequential designs: UnetTransCNN’s parallel dual-path processing with adaptive fusion achieves 84.73% mean Dice on BTCV—2.4% improvement over sequential TransUNet
- Challenging anatomical regions benefit most: Organs with complex spatial relationships (gallbladder, pancreas, stomach) show 10-12% improvement, reflecting the value of global context modeling
- Computational trade-offs are acceptable for clinical deployment: 156ms inference time enables real-time integration while GPU memory requirements remain within modern hardware capabilities
- Ukrainian deployment requires careful architecture selection: Efficient variants like MSLAU-Net offer optimal performance-efficiency balance for resource-constrained environments
For Ukrainian healthcare integration, we recommend a phased approach: initial deployment of efficient hybrid variants (MSLAU-Net) in high-volume urban centers, followed by federated fine-tuning on Ukrainian patient data, and eventual transition to full hybrid architectures as infrastructure permits. This strategy balances immediate clinical benefit against long-term performance optimization.
The trajectory of hybrid architecture development points toward increasingly sophisticated integration strategies—hierarchical designs, adaptive computation, and continuous learning—that will further narrow the gap between AI-assisted and expert human performance. For healthcare technology leaders evaluating diagnostic AI investments, hybrid CNN-Transformer architectures offer the most compelling combination of current performance and future potential.
References
- Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., … & Zhou, Y. (2021). TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. DOI: 10.48550/arXiv.2102.04306
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. DOI: 10.48550/arXiv.2010.11929
- Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. MICCAI 2015, 234-241. DOI: 10.1007/978-3-319-24574-4_28
- Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., & Li, J. (2025). UnetTransCNN: integrating transformers with convolutional neural networks for enhanced medical image segmentation. Frontiers in Oncology, 15, 1467672. DOI: 10.3389/fonc.2025.1467672
- Li, H., Zhang, Y., & Wang, X. (2025). MSLAU-Net: A hybrid CNN-Transformer network for medical image segmentation. arXiv preprint arXiv:2505.18823. DOI: 10.48550/arXiv.2505.18823
- Khan, A., et al. (2025). Hierarchical multi-scale vision transformer model for accurate detection and classification of brain tumors in MRI-based medical imaging. Scientific Reports, 15, 23100. DOI: 10.1038/s41598-025-23100-0
- Valanarasu, J. M. J., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021). Medical transformer: Gated axial-attention for medical image segmentation. MICCAI 2021, 36-46. DOI: 10.1007/978-3-030-87193-2_4
- Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., … & Xu, D. (2022). UNETR: Transformers for 3D medical image segmentation. WACV 2022, 574-584. DOI: 10.1109/WACV51458.2022.00181
- Zhou, H. Y., Guo, J., Zhang, Y., Yu, L., Wang, L., & Yu, Y. (2021). nnFormer: Interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201. DOI: 10.48550/arXiv.2109.03201
- Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., & Wang, M. (2022). Swin-Unet: Unet-like pure transformer for medical image segmentation. ECCV 2022 Workshops, 205-218. DOI: 10.1007/978-3-031-25066-8_9
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. ICCV 2021, 10012-10022. DOI: 10.1109/ICCV48922.2021.00986
- Gupta, A., et al. (2024). A hybrid CNN-Transformer feature pyramid network for granular abdominal aortic aneurysm segmentation. MICCAI 2024. DOI: 10.1007/978-3-031-72390-2_23
- Hong, Y., & Ding, C. (2025). Early-fusion hybrid CNN-transformer models for multiclass ovarian tumor ultrasound classification. Frontiers in Artificial Intelligence, 8, 1679310. DOI: 10.3389/frai.2025.1679310
- Tang, Y., Yang, D., Li, W., Roth, H. R., Landman, B., Xu, D., … & Myronenko, A. (2022). Self-supervised pre-training of swin transformers for 3D medical image analysis. CVPR 2022, 20730-20740. DOI: 10.1109/CVPR52688.2022.02007
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR 2016, 770-778. DOI: 10.1109/CVPR.2016.90
- Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2), 203-211. DOI: 10.1038/s41592-020-01008-z