Home > Published Issues > 2025 > Volume 16, No. 4, 2025 >
JAIT 2025 Vol.16(4): 478-490
doi: 10.12720/jait.16.4.478-490

Hybrid Models for Facial Emotion Recognition and Intensity Detection: Generalization Across Human and Cartoon Faces Using CNN and Vision Transformer

Amornvit Vatcharaphrueksadee 1,2,*, Maleerat Maliyaem 1, and Pannapa Sawakchart 2
1. Faculty of Information Technology and Digital Innovation, King’s Mongkut University of Technology North Bangkok, Bangkok, Thailand
2. Faculty of Information Technology and Digital Innovation, North Bangkok University, Bangkok, Thailand
Email: amornvit.va@northbkk.ac.th (A.V.); maleerat.m@itd.kmutnb.ac.th (M.M.); pannapa.sawakchart@gmail.com (P.S.)
*Corresponding author

Manuscript received November 21, 2024; revised December 24, 2024; accepted January 23, 2025; published April 9, 2025.

Abstract—Facial Emotion Recognition (FER) is pivotal in advancing human-computer interaction, affective computing, and immersive virtual environments. This study introduces a hybrid model that synergizes Convolutional Neural Networks (CNN) and Vision Transformers (ViTs) to improve FER accuracy and enable robust emotion intensity detection. The hybrid architecture leverages CNNs for effective local feature extraction while using ViTs to capture complex global context, thus enhancing the model’s cross-domain adaptability. The model’s generalization capability was rigorously evaluated on both human face datasets (CK+) and cartoon face datasets (FERG-DB), addressing significant challenges in cross-dataset and cross-domain FER. Experimental results indicate that the hybrid model achieves a superior classification accuracy of 94.85% on human face datasets (CK+) and 93.75% on cartoon datasets (FERG-DB), outperforming standalone architectures. Additionally, the model accurately detects emotion intensity with a Mean Squared Error (MSE) of 0.038, demonstrating a generalization improvement of approximately 7% across domains. These results underscore the proposed model’s versatility for real-world and virtual applications, marking a significant advancement in facial emotion recognition.
 
Keywords—Facial Emotion Recognition (FER), Convolutional Neural Networks (CNN), Vision Transformer (ViT), emotion intensity detection, hybrid model, cross-domain generalization, human and cartoon faces

Cite: Amornvit Vatcharaphrueksadee, Maleerat Maliyaem, and Pannapa Sawakchart, "Hybrid Models for Facial Emotion Recognition and Intensity Detection: Generalization Across Human and Cartoon Faces Using CNN and Vision Transformer," Journal of Advances in Information Technology, Vol. 16, No. 4, pp. 478-490, 2025. doi: 10.12720/jait.16.4.478-490

Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions