Open Access
| Issue |
ITM Web Conf.
Volume 78, 2025
International Conference on Computer Science and Electronic Information Technology (CSEIT 2025)
|
|
|---|---|---|
| Article Number | 04005 | |
| Number of page(s) | 11 | |
| Section | Foundations and Frontiers in Multimodal AI, Large Models, and Generative Technologies | |
| DOI | https://doi.org/10.1051/itmconf/20257804005 | |
| Published online | 08 September 2025 | |
- Huynh, D., Ngoc, et al. “Visual question answering: from early developments to recent advances—a survey.” arXiv e-prints (2025): arXiv-2501. [Google Scholar]
- Xiao, J., & Zhang, Z. “EduVQA: A multimodal visual question answering framework for smart education.” Alexandria Eng. J. 122 (2025): 615–624. [Google Scholar]
- Zhang, X., et al. “Development of a large-scale medical visual question-answering dataset.” Commun. Med. 4.1 (2024): 277. [Google Scholar]
- Zakari, R. Y., et al. “VQA and visual reasoning: An overview of approaches, datasets, and future direction.” Neurocomputing (2025): 129345. [Google Scholar]
- Chen, C., Han, D., & Wang, J. “Multimodal encoder-decoder attention networks for visual question answering.” IEEE Access 8 (2020): 35662–35671. [Google Scholar]
- Pereira, G. A., & Hussain, M. “A review of transformer-based models for computer vision tasks: Capturing global context and spatial relationships.” arXiv preprint arXiv:2408.15178 (2024). [Google Scholar]
- Yu, Y., et al. “A review of recurrent neural networks: LSTM cells and network architectures.” Neural Comput. 31.7 (2019): 1235–1270. [CrossRef] [Google Scholar]
- Graves, A. “Long short-term memory.” Supervised Sequence Labelling with Recurrent Neural Networks (2012): 37–45. [Google Scholar]
- Tan, K. L., et al. “RoBERTa-LSTM: A hybrid model for sentiment analysis with transformer and recurrent neural network.” IEEE Access 10 (2022): 21517–21525. [Google Scholar]
- Zhu, R., et al. “Exchanging-based multimodal fusion with transformer.” arXiv preprint arXiv:2309.02190 (2023). [Google Scholar]
- Kim, H., Jun, J., & Zhang, B.-T. “Bilinear attention networks.” Adv. Neural Inf. Process. Syst. 31 (2018). [Google Scholar]
- Ono, Y., et al. “LF-Net: Learning local features from images.” Adv. Neural Inf. Process. Syst. 31 (2018). [Google Scholar]
- Barbhuiya, R. K., et al. “Fundamentals of Encoders and Decoders in Generative AI.” Generative AI: Current Trends and Applications. Singapore: Springer Nature Singapore, 2024. 19–33. [Google Scholar]
- Li, Z., et al. “Decoupled semantic graph neural network for knowledge graph embedding.” Neurocomputing 611 (2025): 128614. [Google Scholar]
- Ren, M., Kiros, R., & Zemel, R. “Exploring models and data for image question answering.” Adv. Neural Inf. Process. Syst. (2015). [Google Scholar]
- Goyal, Y., et al. “Making the V in VQA matter: Elevating the role of image understanding in visual question answering.” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2017. [Google Scholar]
- Marino, K., et al. “Ok-VQA: A visual question answering benchmark requiring external knowledge.” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 2019. [Google Scholar]
- Gurari, D., et al. “VizWiz Grand Challenge: Answering visual questions from blind people.” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2018. [Google Scholar]
- Zakari, R. Y., et al. “VQA and visual reasoning: An overview of approaches, datasets, and future direction.” Neurocomputing (2025): 129345. [Google Scholar]
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.

