Improved features using convolution-augmented transformers for keyword spotting

Open Access

Issue		ITM Web Conf. Volume 47, 2022 2022 2^nd International Conference on Computer, Communication, Control, Automation and Robotics (CCCAR2022)


Article Number		02039
Number of page(s)		7
Section		Algorithm Optimization and Application
DOI		https://doi.org/10.1051/itmconf/20224702039
Published online		23 June 2022

M. Picheny, D. Nahamoo, V. Goel, B. Kingsbury, B. Ramabhadran and S. J. Rennie, Trends and Advances in Speech Recognition, IBM Journal of Research and Development, 55,2011, pp. 2: 1–2: 18. [Google Scholar]
H. Yang, S. Sharma, S. van Vuuren, and H. Hermansky, Relevance of time-frequency features for phonetic and speaker-channel classification, Speech Communication, 31,2000, pp.35–50. [Google Scholar]
Y. Wang, J Yang; H. Liu, Improved Bottleneck Feature using Hierarchical Deep Belief Networks for Keyword Spotting in Continues Speech, International Journal of Signal Processing, Image Processing and Pattern Recognition, 6,2013,pp. 375–386. [CrossRef] [Google Scholar]
D. Horii; A. Ito; T. Nose, Analysis of Feature Extraction by Convolutional Neural Network for Speech Emotion Recognition, 2021 IEEE 10th Global Conference on Consumer Electronics (GCCE), 2021, pp. 425–426. [Google Scholar]
J. Yi, H. Ni, Z. Wen and J. Tao, Improving BLSTM RNN based Mandarin speech recognition using accent dependent bottleneck features, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp.1–5. [Google Scholar]
A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel and A. Vaswani, Bottleneck Transformers for Visual Recognition, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp.16514–16524. [Google Scholar]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 1(2017) 5998–6008. [Google Scholar]
L. Dong, S. Xu and B. Xu, Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp.5884–5888. [Google Scholar]
N. S. Mamatov, N. A. Niyozmatova, S. S. Abdullaev, A. N. Samijonov and K. K. Erejepov, Speech Recognition Based on Transformer Neural Networks, 2021 International Conference on Information Science and Communications Technologies (ICISCT), 2021, pp.1–5. [Google Scholar]
S. Kong, M. Kim, L. M. Hoang, and E. Kim, Automatic LPI radar waveform recognition using CNN, IEEE Access, 2018, pp.4207–4219. [CrossRef] [Google Scholar]
T. Lin, Y. Wang, X. Liu, X. Qiu, A Survey of Transformers, (2021). arXiv:2106.04554 [Google Scholar]
P. Ma, S. Petridis and M. Pantic, “End-To-End Audio-Visual Speech Recognition with Conformers,” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7613–7617. [CrossRef] [Google Scholar]
A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, 2020, pp. 5036–5040. [Google Scholar]
Jansen A and Niyogi P, “Point Process Models for Spotting Keywords in Continuous Speech,” IEEE Transaction on Audio, Speech, and Language Processing, vol. 17, no. 8, (2009), pp. 1457–1470. [CrossRef] [Google Scholar]
Jansen A, “Point Process Models for Event-Based Speech Recognition,” Speech Communication, vol. 51, no. 12, 2009, pp. 1155–1168. [CrossRef] [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.