Comparative Analysis of Deep Learning Architectures for Human Action Recognition using the Stanford 40 Dataset

Open Access

Issue		ITM Web Conf. Volume 85, 2026 Intelligent Systems for a Sustainable Future (ISSF 2026)


Article Number		01003
Number of page(s)		10
Section		AI for Healthcare, Agriculture, Smart Society & Computer Vision
DOI		https://doi.org/10.1051/itmconf/20268501003
Published online		09 April 2026

K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," Apr. 10, 2015, arXiv: arXiv:1409.1556. doi: 10.48550/arXiv.1409.1556 [Google Scholar]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," Dec. 10, 2015, arXiv: arXiv:1512.03385. doi: 10.48550/arXiv.1512.03385. [Google Scholar]
C. Szegedy et al., "Going Deeper with Convolutions," Sep. 17, 2014, arXiv: arXiv:1409.4842. doi: 10.48550/arXiv.1409.4842. [Google Scholar]
G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, "Densely Connected Convolutional Networks," Jan. 28, 2018, arXiv: arXiv:1608.06993. doi: 10.48550/arXiv.1608.06993. [Google Scholar]
A. G. Howard et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," Apr. 17, 2017, arXiv: arXiv:1704.04861. doi: 10.48550/arXiv.1704.04861. [Google Scholar]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, "MobileNetV2: Inverted Residuals and Linear Bottlenecks," Mar. 21, 2019, arXiv: arXiv:1801.04381. doi: 10.48550/arXiv.1801.04381. [Google Scholar]
M. Tan and Q. V. Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," Sep. 11, 2020, arXiv: arXiv:1905.11946. doi: 10.48550/arXiv.1905.11946. [Google Scholar]
M. Tan and Q. V. Le, "EfficientNetV2: Smaller Models and Faster Training". [Google Scholar]
M. Kaseris, I. Kostavelis, and S. Malassiotis, "A Comprehensive Survey on Deep Learning Methods in Human Activity Recognition," Mach. Learn. Knowl. Extr., vol. 6, no. 2, pp. 842–876, Apr. 2024, doi: 10.3390/make6020040. [Google Scholar]
R. R. Dokkar, F. Chaieb, H. Drira, and A. Aberkane, "ConViViT -- A Deep Neural Network Combining Convolutions and Factorized Self-Attention for Human Activity Recognition," Oct. 22, 2023, arXiv: arXiv:2310.14416. doi: 10.48550/arXiv.2310.14416. [Google Scholar]
X. Lu, H. Xing, C. Ye, X. Xie, and Z. Liu, "A keypoints-assisted network with transfer learning for precision human action recognition in still images," Signal Image Video Process., vol. 18, no. 2, pp. 1561–1575, Mar. 2024, doi: 10.1007/s11760-023-02862-y. [Google Scholar]
Y. Zhang and Y. Wang, "A comprehensive survey on RGB-D-based human action recognition: algorithms, datasets, and popular applications," EURASIP J. Image Video Process., vol. 2025, no. 1, p. 15, Aug. 2025, doi: 10.1186/s13640-025-00677-0. [Google Scholar]
A. Dhattarwal, S. Ratnoo, A. Bajaj, and A. Abraham, "Ensemble Transfer Learning for Robust Human Activity Recognition from Images". [Google Scholar]
C. Hu et al., "Teacher-Student Architecture for Knowledge Distillation: A Survey," Aug. 08, 2023, arXiv: arXiv:2308.04268. doi: 10.48550/arXiv.2308.04268. [Google Scholar]
M. J. Page et al., "The PRISMA 2020 statement: an updated guideline for reporting systematic reviews," BMJ, p. n71, Mar. 2021, doi: 10.1136/bmj.n71. [Google Scholar]
B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, "Human action recognition by learning bases of action attributes and parts," in 2011 International Conference on Computer Vision, Barcelona, Spain: IEEE, Nov. 2011, pp. 1331–1338. doi: 10.1109/ICCV.2011.6126386. [Google Scholar]
R. Surendran, A. J, and J. D. Hemanth, "Recognition of human action for scene understanding using world cup optimization and transfer learning approach," PeerJ Comput. Sci., vol. 9, p. e1396, May 2023, doi: 10.7717/peerj-cs.1396. [Google Scholar]
S. R. Hosseyni, S. Seyedin, and H. Taheri, "Human Action Recognition in Still Images Using ConViT," Jan. 11, 2024, arXiv: arXiv:2307.08994. doi: 10.48550/arXiv.2307.08994. [Google Scholar]
H. Ahmadabadi, O. N. Manzari, and A. Ayatollahi, "Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition," Nov. 02, 2023, arXiv: arXiv:2311.01283. doi: 10.48550/arXiv.2311.01283. [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.