Hybrid LSTM & Transformer for 3D Human Pose Estimation

Longjie Su

doi:10.1051/itmconf/20258001002

Open Access

Issue		ITM Web Conf. Volume 80, 2025 2025 2^nd International Conference on Advanced Computer Applications and Artificial Intelligence (ACAAI 2025)


Article Number		01002
Number of page(s)		9
Section		Machine Learning & Deep Learning Algorithms
DOI		https://doi.org/10.1051/itmconf/20258001002
Published online		16 December 2025

ITM Web of Conferences 80, 01002 (2025)

Hybrid LSTM & Transformer for 3D Human Pose Estimation

Longjie Su^*

School of Data Science, The Chinese University of Hong Kong, Shenzhen, Guangdong, China

^* Corresponding author: This email address is being protected from spambots. You need JavaScript enabled to view it.

Abstract

3D human pose estimation (3DHPE) has evolved into a sophisticated and pivotal technique, emerging as a prominent research focus in computer vision and robotics, based on the power of deep neural networks. Unlike conventional image-based methods, the integration of video data introduces rich temporal information, enabling the exploitation of spatiotemporal correlations to achieve more accurate and robust results. However, modeling both local motion continuity and long-range temporal dependencies remains a significant challenge, as recurrent architectures such as LSTMs excel at capturing local dynamics but struggle with long-term information retention, whereas Transformer-based models provide strong global reasoning yet are less sensitive to fine-grained motion details. In this research, a hybrid LSTM–Transformer network that effectively integrates the local feature sensitivity of LSTMs with the global coordination ability of Transformers was proposed to cover these deficiencies. Multiple fusion strategies were investigated to evaluate their impact on 3DHPE performance. Comprehensive experiments finished on benchmark datasets demonstrate that the proposed Single-branch LSTM+Transformer architecture achieves the most competitive results, yielding a mean per-joint position error (MPJPE) of 49.51 mm, which outperforms conventional Transformer- based-only and LSTM-based-only models. This hybrid framework provides a novel and effective paradigm for future research in video-based human motion understanding.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.