Audio-Lyrics Multimodal Fusion for Music Genre Clustering with Dynamic Modality Weighting

Open Access

Issue		ITM Web Conf. Volume 84, 2026 2026 International Conference on Advent Trends in Computational Intelligence and Data Science (ATCIDS 2026)


Article Number		03025
Number of page(s)		10
Section		Large Language Models, Generative AI, and Multimodal Learning
DOI		https://doi.org/10.1051/itmconf/20268403025
Published online		06 April 2026

A. Mehra, & P. Narang. Classification and study of music genres with multimodal Spectro-Lyrical Embeddings for Music (SLEM). Multimed Tools Appl 84, 3701–3721 (2025) [Google Scholar]
G. Ru, X. Zhang, J. Wang, et al. Improving music genre classification from multi-modal properties of music and genre correlations perspective//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023: 1–5. (2023) [Google Scholar]
T. F. Tavares and F. J. Ayres, Multi-label cross-lingual auto-matic music genre classification from lyrics with sentence bert, arXiv preprint arXiv:2501.03769 (2025) [Google Scholar]
E. V. Epure, G. Salha-Galvan, & R. Hennequin. Multilingual Music Genre Embeddings for Effective Cross-Lingual Music Item Annotation. ArXiv, abs/2009.07755. (2020) [Google Scholar]
M. Agrawal, A. Nandy. A novel multimodal music genre classifier using hierarchical attention and convolutional neural network. arXiv preprint arXiv:2011.11970. (2020) [Google Scholar]
X. Favory, K. Drossos, T. Virtanen, et al. Learning contextual tag embeddings for cross-modal alignment of audio and tags//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 596-600. (2021) [Google Scholar]
I. Manco, E. Benetos, E. Quinton, et al. Contrastive audio-language learning for music. arXiv preprint arXiv:2208.12208. (2022) [Google Scholar]
Q. Huang, A. Jansen, J. Lee, et al. Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415. (2022) [Google Scholar]
A. Ferraro, X. Favory, K. Drossos, et al. Enriched music representations with multiple cross-modal contrastive learning. IEEE Signal Processing Letters, 28, 733–737. (2021) [Google Scholar]
I. Manco, E. Benetos, E. Quinton, et al. Learning music audio representations via weak language supervision//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 456-460. (2022) [Google Scholar]
G. Tzanetakis, P. Cook. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing, 10(5), 293–302. (2002) [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.