RViT-FusionNet: A Local Cross-Attention Feature Fusion-based Hybrid Framework for Brain Tumor Classification
Abstract
Accurate brain tumor classification via MRI is essential for diagnosis and treatment. This study introduces RViT-FusionNet, a hybrid deep learning model that integrates convolutional and transformer architectures for enhanced tumor detection. The model utilizes ResNet-50 to capture textural details and a Vision Transformer for extracting global context. A Local Cross-Attention (LCA) module is proposed to align and merge these features, allowing the network to model local structures and long-range dependencies concurrently. To enhance generalization across varied imaging conditions and tumor types, a domain discriminator is included to discern spatial and domain-specific patterns, fostering the learning of domain-invariant representations. The approach is validated on four public MRI datasets, yielding classification accuracies of 99.08% ± 0.16%, 99.56% ± 0.17%, 96.20% ± 0.25% and 94.76% ± 0.35% for glioma, meningioma, pituitary tumors, and healthy cases, respectively. For interpretability, Grad-CAM is utilized to create saliency maps highlighting tumor regions, confirming the model’s focus on clinically relevant areas. These findings demonstrate that RViT-FusionNet achieves exceptional performance while maintaining high interpretability, positioning it as an effective tool for computer-assisted brain tumor diagnosis. This framework is posited as a robust and scalable solution for multi-class brain tumor classification in clinical practice. The complete source code will be made available in a GitHub repository after acceptance of the manuscript.
*Equal contribution with first author