Handwritten input to speech conversion using transfer learning

Open Access

Issue		ITM Web Conf. Volume 81, 2026 International Conference on Emerging Technologies for Multidisciplinary Innovation and Sustainability (ETMIS 2025)


Article Number		01008
Number of page(s)		15
DOI		https://doi.org/10.1051/itmconf/20268101008
Published online		23 January 2026

A. Abadi et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2016) [Google Scholar]
S. Amodei et al., Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, in Proc. ICML (2016) [Google Scholar]
B. Bai et al., A Comprehensive Survey on Handwritten Text Recognition with Transformers, IEEE TPAMI (2023) [Google Scholar]
P. Chen, Z. Guo, Meta-Learning Approaches for Low-Resource Multilingual Text-to-Speech, IEEE/ACM Trans. Audio, Speech, Lang. Process. (2023) [Google Scholar]
J. Dutta, T. Q. Phan, SATRN++: Lightweight Self-Attention Networks for Handwritten Text Recognition, IEEE Access (2022) [Google Scholar]
A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks (Springer, 2012) [Google Scholar]
M. Gupta, R. Jain, Mobile-OCR++: A Neural Compression Pipeline for Real-Time OCR on Smartphones, in Proc. ACM MobileHCI (2023) [Google Scholar]
L. Huang et al., Hybrid Deep Learning Approaches for Offline Handwritten Text Recognition, Pattern Recognition (2023) [Google Scholar]
ITU-T Recommendation P.800, Methods for Subjective Determination of Transmission Quality (ITU, 1996) [Google Scholar]
J. Kim et al., Accessibility Through AI: Deep Learning Tools for the Visually Impaired, IEEE Access (2022) [Google Scholar]
A. Kacorri et al., Human-Centered AI Systems for Blind and Low-Vision Users: A Comprehensive Review, ACM TOCHI (2024) [Google Scholar]
A. van den Oord et al., WaveNet: A Generative Model for Raw Audio, arXiv:1609.03499 (2016) [Google Scholar]
Y. LeCun, Y. Bengio, G. Hinton, Deep Learning, Nature 521, 436–444 (2015) [CrossRef] [Google Scholar]
M. Li et al., TrOCR: Transformer-based Optical Character Recognition (2022) [Google Scholar]
S. Luz, Efficient OCR for Edge Devices Using Lightweight Neural Models (2022) [Google Scholar]
S. Kang et al., Tacotron: Towards End-to-End Speech Synthesis, arXiv:1703.10135 (2017) [Google Scholar]
Y. Ren et al., FastSpeech: Fast, Robust and Controllable Text to Speech, arXiv:1905.09263 (2019) [Google Scholar]
J. Shen et al., Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, in Proc. ICASSP (2018) [Google Scholar]
R. Smith, An Overview of the Tesseract OCR Engine, in Proc. ICDAR (2007) [Google Scholar]
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556 (2014) [Google Scholar]
X. Xing, D. Yu, ViT-HTR: Vision Transformer Models for Offline Handwritten Text Recognition, Pattern Recognition Letters (2023) [Google Scholar]
C. Wang et al., YourTTS: Towards Zero-Shot Multi-Speaker and Multilingual Speech Synthesis, in Proc. NeurIPS (2022) [Google Scholar]
H. Wang et al., Real-Time Deep Learning Deployment on Low-Power Devices: A Review (2022) [Google Scholar]
W3C, Web Content Accessibility Guidelines (WCAG) 2.1 (World Wide Web Consortium, 2018) [Google Scholar]
O. Zafrir et al., Q8BERT: Quantized 8-bit BERT for Efficient Edge Inference, arXiv:1910.06188 (Revised) (2022) [Google Scholar]
Y. Zhang et al., Multilingual Neural TTS: Unified Modeling for Low-Resource Languages (2023) [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.