ViT VO - A Visual Odometry technique Using CNN-Transformer Hybrid Architecture

Jayaraj P. B; Ebin J; Karthik R; Pournami P N

doi:10.1051/itmconf/20235401004

Open Access

Issue		ITM Web Conf. Volume 54, 2023 2^nd International Conference on Advances in Computing, Communication and Security (I3CS-2023)


Article Number		01004
Number of page(s)		8
Section		Computing
DOI		https://doi.org/10.1051/itmconf/20235401004
Published online		04 July 2023

ITM Web of Conferences 54, 01004 (2023)

ViT VO - A Visual Odometry technique Using CNN-Transformer Hybrid Architecture

Jayaraj P. B¹^*, Ebin J¹, Karthik R² and Pournami P N¹

¹ National Insitute of Technology Calicut, India
² SED/ISG, Advanced Inertial Systems, ISRO Inertial Systems Unit, Thiruvananthapuram, Kerala, India

^* e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Abstract

Localization is one of the main tasks involved in the operation of autonomous agents (e.g., vehicle, robot etc.). It allows them to be able to track their paths and properly detect and avoid obstacles. Visual Odometry (VO) is one of the techniques used for agent localization. VO involves estimating the motion of an agent using the images taken by cameras attached to it. Conventional VO algorithms require specific workarounds for challenges posed by the working environment and the captured sensor data. On the other hand, Deep Learning approaches have shown tremendous efficiency and accuracy in tasks that require high degree of adaptability and scalability. In this work, a novel deep learning model is proposed to perform VO tasks for space robotic applications. The model consists of an optical flow estimation module which abstracts away scene-specific details from the input video sequence and produces an intermediate representation. The CNN module which follows next learn relative poses from the optical flow estimates. The final module is a state-of-the-art Vision Transformer, which learn absolute pose from the relative pose learnt by the CNN module. The model is trained on the KITTI dataset and has obtained a promising accuracy of approximately 2%. It has outperformed the baseline model, MagicVO, in a few sequences in the dataset.

Key words: Visual Odometry / Deep Learning / Optical Flow / Convolutional Neural Networks / Generative Adversarial Networks / Sequence-based Models

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.