| Issue |
ITM Web Conf.
Volume 80, 2025
2025 2nd International Conference on Advanced Computer Applications and Artificial Intelligence (ACAAI 2025)
|
|
|---|---|---|
| Article Number | 01002 | |
| Number of page(s) | 9 | |
| Section | Machine Learning & Deep Learning Algorithms | |
| DOI | https://doi.org/10.1051/itmconf/20258001002 | |
| Published online | 16 December 2025 | |
Hybrid LSTM & Transformer for 3D Human Pose Estimation
School of Data Science, The Chinese University of Hong Kong, Shenzhen, Guangdong, China
* Corresponding author: 225040158@link.cuhk.edu.cn
3D human pose estimation (3DHPE) has evolved into a sophisticated and pivotal technique, emerging as a prominent research focus in computer vision and robotics, based on the power of deep neural networks. Unlike conventional image-based methods, the integration of video data introduces rich temporal information, enabling the exploitation of spatiotemporal correlations to achieve more accurate and robust results. However, modeling both local motion continuity and long-range temporal dependencies remains a significant challenge, as recurrent architectures such as LSTMs excel at capturing local dynamics but struggle with long-term information retention, whereas Transformer-based models provide strong global reasoning yet are less sensitive to fine-grained motion details. In this research, a hybrid LSTM–Transformer network that effectively integrates the local feature sensitivity of LSTMs with the global coordination ability of Transformers was proposed to cover these deficiencies. Multiple fusion strategies were investigated to evaluate their impact on 3DHPE performance. Comprehensive experiments finished on benchmark datasets demonstrate that the proposed Single-branch LSTM+Transformer architecture achieves the most competitive results, yielding a mean per-joint position error (MPJPE) of 49.51 mm, which outperforms conventional Transformer- based-only and LSTM-based-only models. This hybrid framework provides a novel and effective paradigm for future research in video-based human motion understanding.
© The Authors, published by EDP Sciences, 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.

