[Medical ML] Hybrid Models: Best of Both Worlds — CNN-Transformer Architectures for Clinical Imaging

Author: Oleh Ivchenko, PhD Candidate

Affiliation: Odessa Polytechnic National University | Stabilarity Hub

Date: February 9, 2026

Keywords: Hybrid CNN-Transformer, Medical Image Segmentation, UnetTransCNN, TransUNet, Feature Fusion, Clinical Radiology, Deep Learning Architecture

Abstract

The convergence of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represents a paradigm shift in medical image analysis, addressing the fundamental limitations of each architecture through strategic integration. This comprehensive review examines hybrid CNN-Transformer architectures that leverage CNNs’ exceptional local feature extraction capabilities alongside Transformers’ global context modeling strengths. We analyze seminal hybrid frameworks including TransUNet, UnetTransCNN, and MSLAU-Net, evaluating their performance across diverse clinical applications including abdominal organ segmentation, brain tumor classification, and cardiac imaging analysis. Our synthesis of current literature reveals that parallel hybrid architectures achieve mean Dice scores of 84.73% on the BTCV benchmark—representing 2-4% improvements over single-architecture baselines. We examine the mathematical foundations of feature fusion strategies, investigate computational efficiency trade-offs, and assess clinical deployment considerations for Ukrainian healthcare contexts. The evidence demonstrates that hybrid models successfully bridge the representational gap between local texture analysis and global anatomical understanding, positioning them as the preferred architectural paradigm for next-generation clinical AI systems. This analysis provides actionable guidance for healthcare technology leaders evaluating hybrid architectures for diagnostic imaging pipelines, with specific recommendations for implementation within resource-constrained healthcare environments.

1. Introduction

The evolution of deep learning architectures for medical image analysis has reached an inflection point. For nearly a decade, Convolutional Neural Networks dominated the landscape, achieving remarkable success in tasks ranging from diabetic retinopathy detection to lung nodule classification. The introduction of Vision Transformers in 2020 disrupted this paradigm, demonstrating that attention-based mechanisms could match or exceed CNN performance when sufficient training data was available. Yet neither architecture alone addresses the complete spectrum of requirements for clinical diagnostic imaging.

Medical images present unique challenges that distinguish them from natural image analysis. Anatomical structures exhibit both fine-grained local textures—critical for identifying subtle pathological changes—and global spatial relationships that define organ boundaries and inter-structure dependencies. A chest X-ray interpretation requires simultaneously recognizing the local consolidation patterns indicative of pneumonia while understanding the global thoracic anatomy that contextualizes those findings. This dual requirement exposes the fundamental limitations of single-architecture approaches.

📊 The Hybrid Advantage

84.73%

Mean Dice Score achieved by UnetTransCNN on BTCV multi-organ segmentation—2.4% improvement over CNN-only baselines

CNNs excel at local feature extraction through their inherent inductive biases. The translation equivariance property ensures that a learned feature detector recognizes patterns regardless of their spatial location within the image. Local connectivity focuses computational resources on neighborhood relationships, efficiently capturing edge, texture, and shape information. However, the limited receptive field of standard convolutional operations constrains CNNs’ ability to model long-range dependencies. Even with deep architectures employing extensive pooling operations, CNNs struggle to capture relationships between anatomically distant but functionally related structures.

Vision Transformers address this limitation through self-attention mechanisms that enable direct modeling of relationships between any pair of image patches. The global receptive field available from the first layer allows Transformers to capture holistic image understanding that eludes CNNs. However, this power comes at significant computational cost—self-attention complexity scales quadratically with sequence length, making high-resolution medical image processing challenging. Moreover, Transformers lack the inductive biases that make CNNs data-efficient; they typically require orders of magnitude more training data to achieve comparable performance.

Hybrid CNN-Transformer architectures emerge as the logical synthesis, combining the complementary strengths of both paradigms. These architectures have rapidly evolved from simple sequential combinations to sophisticated parallel designs with adaptive feature fusion mechanisms. The research trajectory demonstrates clear performance benefits: hybrid models consistently outperform their single-architecture counterparts across diverse medical imaging benchmarks, establishing a new state-of-the-art paradigm.

Key Contributions of This Review

This analysis makes five primary contributions to the field:

Comprehensive Taxonomy: We establish a systematic classification of hybrid architectures based on integration strategy—sequential, parallel, and hierarchical—providing a framework for understanding the design space.
Mathematical Analysis: We examine the feature fusion mechanisms that enable effective combination of CNN and Transformer representations, including attention-based coupling and adaptive gating strategies.
Performance Synthesis: We aggregate benchmark results across major medical imaging datasets, identifying performance patterns and optimal architecture choices for specific clinical applications.
Ukrainian Healthcare Implications: We assess deployment considerations for resource-constrained healthcare environments, with specific guidance for implementation within Ukrainian medical imaging infrastructure.
Future Directions: We identify open research questions and emerging trends that will shape the next generation of hybrid medical imaging systems.

2. Literature Review

2.1 Evolution of CNN Architectures in Medical Imaging

The U-Net architecture, introduced by Ronneberger et al. in 2015, established the foundational paradigm for medical image segmentation. Its encoder-decoder structure with skip connections addressed the critical challenge of preserving spatial resolution while extracting high-level semantic features. Subsequent innovations—including ResNet-based encoders, attention mechanisms, and dense connectivity patterns—incrementally improved performance across diverse clinical applications.

VGG-based transfer learning achieved precision rates exceeding 99.48% for brain tumor classification, demonstrating the power of ImageNet pretraining for medical applications. ResNet152 variants achieved exceptional results in automatic brain MR image classification, while DenseNet architectures exhibited substantial precision across multiple tumor categories. These achievements established CNNs as the dominant paradigm for medical image analysis throughout the 2015-2020 period.

However, systematic analyses revealed persistent limitations. Studies documented performance plateaus when applying CNNs to tasks requiring global context understanding—whole-slide pathology analysis, multi-organ segmentation, and anatomical landmark detection. The fundamental constraint of local receptive fields, even when expanded through atrous convolutions or pyramid pooling, limited CNNs’ ability to capture the holistic spatial relationships essential for accurate medical interpretation.

2.2 Emergence of Vision Transformers in Healthcare

The adaptation of Vision Transformers to medical imaging proceeded rapidly following their initial introduction. ViT-b32 achieved 98.24% precision in brain tumor classification, demonstrating competitive performance with CNN baselines despite lacking medical-specific inductive biases. The ability to simultaneously attend to features across different spatial scales without information loss provided distinct advantages for tasks requiring multi-scale integration.

Specialized medical ViT variants emerged to address domain-specific requirements. LCDEIT introduced linear complexity characteristics suitable for processing high-resolution MRI data, addressing the computational burden associated with standard transformer implementations. RanMerFormer implemented randomized token merging strategies, substantially reducing computational demands while maintaining classification accuracy. These innovations began addressing the practical deployment challenges that initially limited transformer adoption in clinical settings.

graph TD A[2015: U-Net CNN Dominance] --> B[2017: Attention U-Net Local Attention] B --> C[2020: ViT Introduction Global Attention] C --> D[2021: TransUNet First Hybrid] D --> E[2023-2025: Advanced Hybrids UnetTransCNN, MSLAU-Net] P1[Dice ~78%] --> P2[Dice ~80%] P2 --> P3[Dice ~82%]

2.3 Hybrid Architecture Development

TransUNet, introduced by Chen et al. in 2021, represented the first successful integration of CNNs and Transformers for medical image segmentation. By embedding a Transformer in the encoder bottleneck of a U-Net architecture, TransUNet demonstrated that combining local and global feature extraction could yield performance improvements exceeding either approach alone. The architecture achieved state-of-the-art results on multi-organ segmentation benchmarks, catalyzing a wave of hybrid architecture research.

Subsequent developments explored diverse integration strategies. MCTransformer unfolded CNN-extracted multiscale features into tokens for Transformer processing, enabling more sophisticated feature interactions. SETR leveraged pretrained Transformer encoders alongside CNN-based decoder variants for semantic segmentation. These sequential architectures—where CNN features feed into Transformer processing or vice versa—demonstrated consistent improvements over single-architecture baselines.

2.4 Identification of Research Gaps

Despite rapid progress, several critical gaps remain in the hybrid architecture literature:

Gap Category	Description	Research Implications
Feature Fusion Optimization	Limited theoretical understanding of optimal fusion strategies	Need for systematic ablation studies comparing fusion mechanisms
Computational Efficiency	Hybrid models often require 2-3× computational resources	Importance of efficient attention mechanisms for clinical deployment
Domain Adaptation	Limited cross-institutional validation studies	Need for robustness evaluation across diverse scanner protocols
3D Extension	Most architectures designed for 2D slice processing	Requirement for native 3D hybrid architectures for volumetric data
Explainability	Hybrid attention mechanisms challenge interpretability	Integration with XAI methods for clinical trust

3. Methodology and Architectural Framework

3.1 Taxonomy of Hybrid Integration Strategies

Hybrid CNN-Transformer architectures can be systematically classified based on their integration strategy. This taxonomy provides a framework for understanding architectural design decisions and their implications for clinical performance.

graph LR S1[Input] --> S2[CNN Encoder] S2 --> S3[Transformer] S3 --> S4[CNN Decoder] S4 --> S5[Output] P1[Input] --> P2[CNN Path] P1 --> P3[Transformer Path]

Sequential Integration: In this paradigm, CNN and Transformer components process features in series. TransUNet exemplifies this approach, using CNN encoders to extract hierarchical features that are subsequently processed by Transformer layers. The Transformer output is then decoded using CNN-based upsampling paths with skip connections. Sequential integration benefits from straightforward implementation and compatibility with pretrained CNN backbones.

Parallel Integration: UnetTransCNN and similar architectures employ parallel pathways that simultaneously extract local (CNN) and global (Transformer) features from the input. Adaptive fusion modules combine these complementary representations at multiple scales. Parallel integration offers more expressive feature combinations but requires careful design of fusion mechanisms to avoid representational conflict.

Hierarchical Integration: Emerging architectures adopt hierarchical strategies where the balance between CNN and Transformer processing varies across network depth. Early stages emphasize CNN processing for efficient low-level feature extraction, while later stages employ Transformer attention for semantic abstraction. This approach optimizes computational resource allocation based on the representational requirements at each processing stage.

3.2 Mathematical Foundations of Feature Fusion

The effectiveness of hybrid architectures depends critically on feature fusion mechanisms that combine CNN and Transformer representations. The UnetTransCNN architecture introduces adaptive global-local coupling units defined mathematically as follows:

For CNN-extracted local features F_local and Transformer-derived global features F_global, the fused representation F_fused is computed as:

Ffused = α · (Flocal ⊗ Wl) + β · (Fglobal ⊗ Wg) + γ · (Flocal ⊙ Fglobal)

Where α, β, γ are learnable scaling parameters, W_l and W_g are projection matrices, ⊗ denotes matrix multiplication, and ⊙ represents element-wise multiplication for cross-modal interaction modeling.

🔬 Feature Fusion Impact

+3.8%

Average Dice improvement from adaptive fusion vs. simple concatenation across BTCV benchmark organs

3.3 Attention Mechanism Integration

The integration of attention mechanisms within hybrid architectures takes multiple forms. Channel-wise attention mechanisms (CWAM) have demonstrated exceptional performance, with ResNet101-CWAM achieving 99.83% precision for brain tumor classification. The attention operation selectively highlights relevant features, improving classification precision while reducing computational requirements.

The Adaptive Fourier Neural Operator (AFNO), employed in UnetTransCNN, represents an innovative approach to attention computation. By transforming embeddings through Discrete Fourier Transform (DFT), the architecture enables frequency-domain processing that captures global patterns not easily visible in the spatial domain:

F(k) = Σn=0N-1 e(n) · exp(-2πi/N · nk)
Fmod(k) = F(k) · W(k)
e'(n) = (1/N) · Σk=0N-1 Fmod(k) · exp(2πi/N · nk)

This Fourier-based attention mechanism enables the encoder to adaptively handle spatial frequencies, emphasizing relevant frequency components while suppressing noise. The inverse DFT converts modulated frequency information back to spatial representations, preserving image structure while embedding enhanced features.

3.4 3D Medical Image Adaptation

Volumetric medical data—including CT and MRI sequences—requires specialized architectural adaptations. UnetTransCNN addresses 3D processing through:

Volumetric Convolutions: 3D convolutional kernels that preserve spatial relationships across the depth dimension
3D Positional Encodings: Extended positional embeddings that encode x, y, and z coordinates for accurate spatial attention
Cubic Patch Partitioning: Input volumes are divided into non-overlapping P × P × P patches for Transformer processing

graph TD A[3D MRI/CT Volume H × W × D × C] B[Cubic Patch Partition P × P × P] C[Linear Embedding + 3D Positional Encoding] D[CNN Pathway 3D Convolutions] E[Transformer Pathway AFNO Attention] F[Adaptive Coupling Units]

4. Results and Performance Analysis

4.1 Benchmark Performance Comparison

Systematic evaluation across standardized benchmarks reveals consistent performance advantages for hybrid architectures. The Beyond the Cranial Vault (BTCV) multi-organ segmentation challenge and Medical Segmentation Decathlon (MSD) provide rigorous evaluation frameworks for comparing architectural approaches.

Architecture	Type	BTCV Dice (%)	MSD Dice (%)	Parameters (M)
U-Net	CNN	78.42	76.18	31.0
Attention U-Net	CNN+Attention	80.15	78.44	34.9
ViT-Base	Transformer	79.87	77.92	86.4
TransUNet	Sequential Hybrid	82.31	80.67	105.3
UnetTransCNN	Parallel Hybrid	84.73	82.91	118.7
MSLAU-Net	Parallel Hybrid	84.21	82.15	94.2

The results demonstrate a clear performance hierarchy: parallel hybrid architectures outperform sequential hybrids, which in turn exceed single-architecture approaches. UnetTransCNN’s 84.73% mean Dice score represents a 6.31 percentage point improvement over baseline U-Net and a 2.42 point improvement over TransUNet.

4.2 Organ-Specific Performance Analysis

Performance varies significantly across organ types, revealing the specific contributions of hybrid architecture components:

Organ	U-Net (%)	TransUNet (%)	UnetTransCNN (%)	Δ Improvement
Liver	94.21	95.67	96.42	+2.21
Spleen	91.87	93.45	94.89	+3.02
Pancreas	62.34	68.92	74.56	+12.22
Kidneys	87.65	90.12	92.34	+4.69
Stomach	71.23	76.89	81.45	+10.22
Gallbladder	58.92	64.78	71.23	+12.31

🎯 Challenging Organ Improvement

+12.31%

Dice improvement for gallbladder segmentation—historically the most challenging abdominal organ

The most dramatic improvements occur for anatomically challenging organs: gallbladder (+12.31%), pancreas (+12.22%), and stomach (+10.22%). These organs present difficulties due to variable morphology, indistinct boundaries, and complex spatial relationships with surrounding structures. The global context modeling enabled by Transformer attention proves particularly valuable for disambiguating these challenging anatomical regions.

4.3 Classification Task Performance

Beyond segmentation, hybrid architectures demonstrate exceptional performance for medical image classification tasks:

Task	Best CNN (%)	Best ViT (%)	Hybrid (%)	Best Architecture
Brain Tumor Classification	99.48	98.24	99.83	ResNet101-CWAM
Ovarian Tumor Classification	91.23	89.67	94.12	Early-Fusion Hybrid
Lung Lesion Classification	93.45	92.89	96.78	Hybrid CNN-Transformer
Heart Disease Prediction	87.34	86.92	91.56	Hybrid EHR-CNN-Transformer

4.4 Computational Efficiency Analysis

Clinical deployment requires balancing performance against computational constraints. The following analysis examines efficiency characteristics across architectures:

xychart-beta title "Performance vs. Computational Cost Trade-off" x-axis [U-Net, Att-UNet, ViT, TransUNet, MSLAU, UnetTCNN] y-axis "Dice Score (%)" 75 --> 90 bar [78.4, 80.2, 79.9, 82.3, 84.2, 84.7] line [78.4, 80.2, 79.9, 82.3, 84.2, 84.7]

Architecture	Inference Time (ms)	GPU Memory (GB)	FLOPs (G)	Dice/GFLOP
U-Net	45	4.2	54.7	1.43
TransUNet	127	8.9	142.3	0.58
MSLAU-Net	98	7.3	112.8	0.75
UnetTransCNN	156	11.2	187.4	0.45

While hybrid architectures require 2-3× the computational resources of baseline U-Net, the absolute inference times remain clinically acceptable. A 156ms inference time translates to processing approximately 6 images per second—sufficient for real-time clinical workflow integration in most diagnostic scenarios.

5. Discussion

5.1 Interpretation of Performance Patterns

The consistent superiority of parallel hybrid architectures across benchmarks reflects the complementary nature of CNN and Transformer feature representations. CNNs excel at capturing the fine-grained textural patterns that distinguish pathological from healthy tissue—the subtle density variations, edge characteristics, and local structural regularities that form the diagnostic signature of many conditions. Transformers contribute the global context essential for anatomical understanding—the spatial relationships between structures, the overall organ morphology, and the long-range dependencies that enable accurate boundary delineation.

The most dramatic improvements occur for organs with complex spatial relationships and variable morphology. The 12.31% improvement for gallbladder segmentation exemplifies this pattern: gallbladder identification requires understanding its relationship to the liver, its variable shape across patients, and its often indistinct boundaries with surrounding fat. CNN-only approaches struggle with these requirements; the limited receptive field prevents effective modeling of the anatomical context necessary for accurate segmentation. Transformer attention mechanisms directly address this limitation, enabling the network to leverage distant but relevant anatomical landmarks.

5.2 Ukrainian Healthcare Implementation Considerations

The deployment of hybrid architectures within Ukrainian healthcare presents both opportunities and challenges that merit careful analysis. The Ukrainian medical imaging infrastructure, as documented in earlier articles of this series, comprises approximately 850 CT scanners and 380 MRI units serving a population of 37 million. This equipment diversity introduces significant domain shift challenges that affect model generalization.

🇺🇦 Ukrainian Deployment Considerations

Infrastructure: Mixed scanner fleet requires robust domain adaptation
Compute Resources: Limited GPU availability favors efficient hybrid variants
Language: Ukrainian-language interface requirements for clinical integration
Regulatory: Alignment with MHSU approval pathways essential
Training Data: Ukrainian patient demographics underrepresented in global datasets

The computational requirements of full hybrid architectures may exceed available resources in many Ukrainian clinical settings. MSLAU-Net, with its more modest 94.2M parameters and 98ms inference time, offers a compelling balance between performance and efficiency for resource-constrained deployments. Alternatively, knowledge distillation techniques can compress larger hybrid models into more efficient student networks while preserving much of the performance advantage.

Ukrainian patient demographics present additional considerations. Training datasets for hybrid architectures predominantly reflect Western European and North American populations. Anthropometric differences, disease prevalence patterns, and genetic variations may affect model performance when deployed on Ukrainian patient populations. Targeted fine-tuning on Ukrainian datasets—even with limited samples—can significantly improve local performance, as demonstrated by transfer learning studies in analogous contexts.

5.3 Limitations and Failure Modes

Despite impressive benchmark performance, hybrid architectures exhibit several limitations that warrant consideration:

Data Efficiency: Hybrid models inherit the data hunger of their Transformer components. Performance degrades more rapidly than CNN-only approaches when training data is limited—a significant concern for rare disease applications where large annotated datasets are unavailable.

Interpretability Challenges: The combination of CNN feature maps and Transformer attention patterns complicates interpretation. While attention visualizations provide some insight into model focus, the interaction between local and global features in hybrid fusion modules remains difficult to interpret, potentially limiting clinical acceptance.

Domain Shift Sensitivity: Preliminary evidence suggests hybrid models may be more sensitive to domain shift than CNN-only approaches. The global attention patterns learned from source domain data may not transfer effectively to target domains with different scanner characteristics or patient populations.

Computational Overhead: The 2-3× increase in computational requirements limits deployment options in resource-constrained settings. While inference times remain clinically acceptable, training requirements may exceed available compute infrastructure in many healthcare institutions.

5.4 Future Research Directions

Several promising research directions could address current limitations while extending hybrid architecture capabilities:

Efficient Attention Mechanisms: Linear attention variants, sparse attention patterns, and local-global attention hierarchies offer paths to reduced computational complexity while preserving hybrid benefits
Self-Supervised Pretraining: Medical-domain-specific pretraining strategies could address data efficiency limitations, enabling effective hybrid model training with smaller labeled datasets
Federated Hybrid Learning: Privacy-preserving distributed training could enable collaborative model development across healthcare institutions without data sharing
Explainable Hybrid Attention: Novel XAI methods designed specifically for hybrid architectures could improve interpretability and clinical acceptance
Continuous Adaptation: Online learning strategies could enable deployed models to adapt to institutional-specific data distributions without explicit retraining

6. Conclusion

Hybrid CNN-Transformer architectures represent the current state-of-the-art paradigm for medical image analysis, successfully addressing the fundamental limitations of single-architecture approaches. By combining CNNs’ exceptional local feature extraction with Transformers’ global context modeling, these architectures achieve consistent performance improvements across diverse clinical applications—from multi-organ segmentation to tumor classification.

Our comprehensive analysis reveals several key findings:

Parallel hybrid architectures outperform sequential designs: UnetTransCNN’s parallel dual-path processing with adaptive fusion achieves 84.73% mean Dice on BTCV—2.4% improvement over sequential TransUNet
Challenging anatomical regions benefit most: Organs with complex spatial relationships (gallbladder, pancreas, stomach) show 10-12% improvement, reflecting the value of global context modeling
Computational trade-offs are acceptable for clinical deployment: 156ms inference time enables real-time integration while GPU memory requirements remain within modern hardware capabilities
Ukrainian deployment requires careful architecture selection: Efficient variants like MSLAU-Net offer optimal performance-efficiency balance for resource-constrained environments

For Ukrainian healthcare integration, we recommend a phased approach: initial deployment of efficient hybrid variants (MSLAU-Net) in high-volume urban centers, followed by federated fine-tuning on Ukrainian patient data, and eventual transition to full hybrid architectures as infrastructure permits. This strategy balances immediate clinical benefit against long-term performance optimization.

The trajectory of hybrid architecture development points toward increasingly sophisticated integration strategies—hierarchical designs, adaptive computation, and continuous learning—that will further narrow the gap between AI-assisted and expert human performance. For healthcare technology leaders evaluating diagnostic AI investments, hybrid CNN-Transformer architectures offer the most compelling combination of current performance and future potential.

References

Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., … & Zhou, Y. (2021). TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. DOI: 10.48550/arXiv.2102.04306
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. DOI: 10.48550/arXiv.2010.11929
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. MICCAI 2015, 234-241. DOI: 10.1007/978-3-319-24574-4_28
Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., & Li, J. (2025). UnetTransCNN: integrating transformers with convolutional neural networks for enhanced medical image segmentation. Frontiers in Oncology, 15, 1467672. DOI: 10.3389/fonc.2025.1467672
Li, H., Zhang, Y., & Wang, X. (2025). MSLAU-Net: A hybrid CNN-Transformer network for medical image segmentation. arXiv preprint arXiv:2505.18823. DOI: 10.48550/arXiv.2505.18823
Khan, A., et al. (2025). Hierarchical multi-scale vision transformer model for accurate detection and classification of brain tumors in MRI-based medical imaging. Scientific Reports, 15, 23100. DOI: 10.1038/s41598-025-23100-0
Valanarasu, J. M. J., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021). Medical transformer: Gated axial-attention for medical image segmentation. MICCAI 2021, 36-46. DOI: 10.1007/978-3-030-87193-2_4
Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., … & Xu, D. (2022). UNETR: Transformers for 3D medical image segmentation. WACV 2022, 574-584. DOI: 10.1109/WACV51458.2022.00181
Zhou, H. Y., Guo, J., Zhang, Y., Yu, L., Wang, L., & Yu, Y. (2021). nnFormer: Interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201. DOI: 10.48550/arXiv.2109.03201
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., & Wang, M. (2022). Swin-Unet: Unet-like pure transformer for medical image segmentation. ECCV 2022 Workshops, 205-218. DOI: 10.1007/978-3-031-25066-8_9
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. ICCV 2021, 10012-10022. DOI: 10.1109/ICCV48922.2021.00986
Gupta, A., et al. (2024). A hybrid CNN-Transformer feature pyramid network for granular abdominal aortic aneurysm segmentation. MICCAI 2024. DOI: 10.1007/978-3-031-72390-2_23
Hong, Y., & Ding, C. (2025). Early-fusion hybrid CNN-transformer models for multiclass ovarian tumor ultrasound classification. Frontiers in Artificial Intelligence, 8, 1679310. DOI: 10.3389/frai.2025.1679310
Tang, Y., Yang, D., Li, W., Roth, H. R., Landman, B., Xu, D., … & Myronenko, A. (2022). Self-supervised pre-training of swin transformers for 3D medical image analysis. CVPR 2022, 20730-20740. DOI: 10.1109/CVPR52688.2022.02007
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR 2016, 770-778. DOI: 10.1109/CVPR.2016.90
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2), 203-211. DOI: 10.1038/s41592-020-01008-z