Deepfake Detection: A Multimodal Survey

Open Access

Issue		ITM Web Conf. Volume 78, 2025 International Conference on Computer Science and Electronic Information Technology (CSEIT 2025)


Article Number		02027
Number of page(s)		18
Section		Machine Learning Applications in Vision, Security, and Healthcare
DOI		https://doi.org/10.1051/itmconf/20257802027
Published online		08 September 2025

Radford, A. et al.: ‘Learning Transferable Visual Models From Natural Language Supervision’, arXiv:2103.00020 [cs.CV]. 2021 [Google Scholar]
Ho, J. et al.: ‘Denoising Diffusion Probabilistic Models’, arXiv:2006.11239 [cs.LG]. 2020 [Google Scholar]
Wang, Z. D. et al.: ‘DIRE for Diffusion-Generated Image Detection’, arXiv:2303.09295 [cs.CV]. 2023 [Google Scholar]
Rössler, A. et al.: ‘FaceForensics++: Learning to Detect Manipulated Facial Images’, arXiv:1901.08971 [cs.CV]. 2019 [Google Scholar]
Li, M. et al.: ‘A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection’, https://doi.org/10.1145/3552466.3556523. 2022 [Google Scholar]
Tuan, N. M. D. and Minh, P. Q. N.: ‘Multimodal Fusion with BERT and Attention Mechanism for Fake News Detection’, arXiv:2104.11476 [cs.CL]. 2021 [Google Scholar]
Zhou, Y. et al.: ‘Multimodal Fake News Detection via CLIP - Guided Learning’, arXiv:2205.14304 [cs.CV]. 2022 [Google Scholar]
Tsai, Y. H. et al.: ‘Multimodal Transformer for Unaligned Multimodal Language Sequences’, arXiv:1906.00295 [cs.CL]. 2019 [Google Scholar]
Wang, Z. et al.: ‘MedCLIP: Contrastive Learning from Unpaired Medical Images and Text’, arXiv:2210.10163 [cs.CV]. 2022 [Google Scholar]
Hu, J. et al.: ‘MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation’, arXiv:2107.06779 [cs.CL]. 2021 [Google Scholar]
Jia, C. et al.: ‘Scaling Up Visual and Vision - Language Representation Learning With Noisy Text Supervision’, arXiv:2102.05918 [cs.CV]. 2021 [Google Scholar]
Hu, M. et al.: ‘Knowledge distillation from multi-modal to mono-modal segmentation networks’, arXiv:2106.09564 [cs.CV]. 2021 [Google Scholar]
Khalid, H. et al.: ‘FakeAVCeleb: A Novel Audio - Video Multimodal Deepfake Dataset’, arXiv:2108.05080 [cs.CV]. 2021 [Google Scholar]
Zhang, H., Lin, L. Y., Fang, Q., & Alioto, M.: ‘On-Chip Laser Voltage Probing Attack Detection with 100% Area Coverage at Above/Below the Bandgap Wavelength and Fully-Automated Design’, 2021 [Google Scholar]
Qin, Y. et al.: ‘WebCPM: Interactive Web Search for Chinese Long - form Question Answering’, arXiv:2305.06849 [cs.CL]. 2023 [Google Scholar]
Korshunov, P., & Marcel, S.: ‘Deepfakes: a new threat to face recognition? assessment and detection.’ arXiv preprint arXiv:1812.08685. 2018 [Google Scholar]
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M.: ‘Faceforensics++: Learning to detect manipulated facial images.’ In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1–11. 2019 [Google Scholar]
Li, Y. Z., Yang, X., Sun, P., Qi, H. G., & Lyu, S. W.: ‘Celeb-df: A large-scale challenging dataset for deepfake forensics.’ In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3207–3216. 2020 [Google Scholar]
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M.: ‘Faceforensics++: Learning to detect manipulated facial images’. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11. 2019 [Google Scholar]
Jiang, L., Li, R., Wu, W., Qian, C., & Loy, C. C.: ‘Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection.’ In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2889–2898. 2020 [Google Scholar]
Haliassos, A. et al.: ‘Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection’, arXiv:2201.07131 [cs.CV]. 2022 [Google Scholar]
Dolhansky, B. et al.: ‘The DeepFake Detection Challenge (DFDC) Dataset’, arXiv:2006.07397 [cs.CV]. 2020 [Google Scholar]
Zhou, Y. et al.: ‘Multimodal Fake News Detection via CLIP - Guided Learning’, arXiv:2205.14304 [cs.CV]. 2022 [Google Scholar]
Yang, X., Li, Y. Z., & Lyu, S. W.: ‘Exposing deep fakes using inconsistent head poses’, In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8261–8265. IEEE, 2019 [Google Scholar]
Madry, A. et al.: ‘Towards Deep Learning Models Resistant to Adversarial Attacks’, arXiv:1706.06083 [stat.ML]. 2017 [Google Scholar]
Wang, L., Zhang, C., Xu, H., Xu, Y., Xu, X., & Wang, S.: ‘Cross-modal Contrastive Learning for Multimodal Fake News Detection.’ arXiv:2302.14057 [cs.LG]. 2023 [Google Scholar]
Tolosana, R., Romero-Tapiador, S., Fierrez, J., & Vera-Rodriguez, R.: ‘DeepFakes Evolution: Analysis of Facial Regions and Fake Detection Performance.’ arXiv:2004.07532 [cs.CV]. https://doi.org/10.48550/arXiv.2004.07532. 2020 [Google Scholar]
Haliassos, A., Mira, R., Petridis, S., & Pantic, M.: ‘Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection.’ arXiv:2201.07131 [cs.CV]. 2022 [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.