Home > Published Issues > 2025 > Volume 16, No. 2, 2025 >
JAIT 2025 Vol.16(2): 264-273
doi: 10.12720/jait.16.2.264-273

Explainable Disease Classification: Exploring Grad-CAM Analysis of CNNs and ViTs

Ali Alqutayfi 1,*, Wadha Almattar 1,2,*, and Sadam Al-Azani 3, Fakhri Alam Khan 1,3,4, Abdullah Al Qahtani 5, Solaiman Alageel 6, and Mohammed Alzahrani 6
1. King Fahd University of Petroleum and Minerals, Information and Computer Science Department, Saudi Arabia
2. Imam Abdulrahman Bin Faisal University, Computer Science Department, Saudi Arabia
3. SDAIA-KFUPM Joint Research Center for Artificial Intelligence, Saudi Arabia
4. Interdisciplinary Research Center of Intelligent Secure Systems, KFUPM, Saudi Arabia
5. Imam Abdulrahman Bin Faisal University, Ophthalmology-College of Medicine, Saudi Arabia
6. Dammam Medical Complex, Radiology Department, Eastern Health Cluster, Saudi Arabia
Email: s202032080@kfupm.edu.sa (A.Z.A.); wmalmattar@iau.edu.sa (W.M.M); sadam.azani@kfupm.edu.sa (S.A.A.); fakhri.khan@kfupm.edu.sa (F.A.K.); aaoqahtani@iau.edu.sa (A.A.A.); salogail@moh.gov.sa (S.M.A.);
i46@me.com (M.Y.A.)
*Corresponding author

Manuscript received September 27, 2024; revised October 16, 2024; accepted December 6, 2024, published February 17, 2025.

Abstract—Deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), are playing an increasingly crucial role in early diagnosis and treatment across medical fields. As these AI models are integrated into clinical practice, the need for explainable AI tools, like Gradient-weighted Class Activation Mapping (Grad-CAM), becomes paramount to building clinician trust and ensuring the reliability of AI-driven diagnoses. However, a gap exists in the literature regarding comprehensive, quantitative, and qualitative comparisons of CNN and ViT performance across diverse medical imaging tasks, particularly those involving variations in object scale. This study compares CNN-based and ViT-based models for two medical imaging tasks: diabetic retinopathy detection from fundus images (small objects) and pneumonia detection from chest X-rays (large objects). We evaluate popular CNN architectures (ResNet, EfficientNet, VGG, Inception) and ViT models (ViT-Base, ViT-Large, ViT-Huge), using both quantitative metrics and expert qualitative assessments. We also analyze Grad-CAM’s effectiveness for visualizing regions of interest in these models. Our results show that ViT-Large outperforms other models on X-rays, while EfficientNet excels on fundus images. However, Grad-CAM struggles to highlight small regions of interest, particularly in diabetic retinopathy, revealing a limitation in current explainable AI methods. This work underscores the need for optimization of explainability tools and contributes to a better understanding of CNN and ViT strengths in medical imaging.
 
Keywords—explainable AI, Vision Transformer (ViT), Convolutional Neural Network (CNN), medical imaging, Gradient-weighted Class Activation Mapping (Grad-CAM)

Cite: Ali Alqutayfi, Wadha Almattar, Sadam Al-Azani, Fakhri Alam Khan, Abdullah Al Qahtani, Solaiman Alageel, and Mohammed Alzahrani, "Explainable Disease Classification: Exploring Grad-CAM Analysis of CNNs and ViTs," Journal of Advances in Information Technology, Vol. 16, No. 2, pp. 264-273, 2025. doi: 10.12720/jait.16.2.264-273

Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions