The rapid advancement of AI-generated content has made deepfakes increasingly realistic, posing serious risks to identity security, social trust, and public and democratic institutions. Existing detection systems, typically focused on single modalities such as video or audio, often fail to generalize to new manipulation techniques and cannot effectively detect hybrid or low-effort deepfakes. In this perspective letter, we advocate for a new paradigm in deepfake detection, that emphasizes the integration of audio, video, and textual content. We examine the limitations of current systems, including their over-reliance on outdated datasets and limited adversarial robustness. We outline the technical motivations for integrating these modalities, and highlight emerging research directions. By aligning detection strategies with the multimodal nature of AI-driven manipulation, we call for a new generation of systems that are more generalizable and trustworthy.
Future-proofing deepfake detection by integrating audio, video, and text
ACM AI Letters, February 2026
Type:
Journal
Date:
2026-02-17
Department:
Data Science
Eurecom Ref:
8639
Copyright:
© ACM, 2026. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM AI Letters, February 2026 https://doi.org/10.1145/3797958
PERMALINK : https://www.eurecom.fr/publication/8639