Comparative Analysis of Deep Learning Architectures for Human Action Recognition using the Stanford 40 Dataset

K. Harshawardhan; M. Senthil Kumaran

doi:10.1051/itmconf/20268501003

Open Access

Issue		ITM Web Conf. Volume 85, 2026 Intelligent Systems for a Sustainable Future (ISSF 2026)


Article Number		01003
Number of page(s)		10
Section		AI for Healthcare, Agriculture, Smart Society & Computer Vision
DOI		https://doi.org/10.1051/itmconf/20268501003
Published online		09 April 2026

ITM Web of Conferences 85, 01003 (2026)

Comparative Analysis of Deep Learning Architectures for Human Action Recognition using the Stanford 40 Dataset

K. Harshawardhan¹^* and M. Senthil Kumaran²

¹ Research Scholar, Department of Computer Science and Engineering, Sri Chandrasekharendra Saraswathi Viswa Mahavidyalaya (SCSVMV) Deemed to be University, Kanchipuram, India
² Associate Professor, Department of Computer Science and Engineering, Sri Chandrasekharendra Saraswathi Viswa Mahavidyalaya (SCSVMV) Deemed to be University, Kanchipuram, India

^* Corresponding author: This email address is being protected from spambots. You need JavaScript enabled to view it.

Abstract

Human Action Recognition (HAR) in still images is a well-established computer vision task with applications in sports, security, and human-computer interaction scenarios. Current trends in HAR research indicate that deep learning applications are proliferating, and a few publication hot spots are leading to systematic fair evaluations of the state of the art in convolutional neural networks (CNNs). This work conducts a systematic evaluation of seven popular CNN families, ResNet, Inception, MobileNet, DenseNet, VGGNet, EfficientNet, and EfficientNetV2, as well as representative instance variants within each family. The models are trained and evaluated on the Stanford-40 Human Action Recognition dataset using a common experimental setup, constrained to a limited training budget (three epochs) to allow comparability across a large number of model evaluations. The models are assessed utilizing a variety of accuracy, precision, recall, and F1 score metrics. The experiment results reveal model performance trends that are related to model family and model depth. Mid-range models, like ResNet-50 and DenseNet variants yield the best trade-offs between performance and resource consumption. Lightweight models perform the worst but justify the sacrifice in performance with training efficiency. The best overall model is EfficientNetV2-L, which achieves the best performance across all evaluated metrics. It achieves this performance through training-aware model design, improved compound model scaling, and fused MB-Conv blocks as well as greater input resolution, all of which enable effective learning even under the study’s low training budget. In contrast with previous studies that advocate for highly hybridized specialized models, this study provides a framework for systematic evaluation of CNN families under the same training budget. Overall, this study provides realistic HAR baselines for images and informs the reader of relative model selection strategies and trade-offs that may be employed in budget-limited training contexts.

Key words: Convolutional Neural Networks / Human Action Recognition / Performance Evaluation / Stanford 40 Actions Dataset / Image Classification / Deep Learning

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.