Phoneme-Based Audio-Visual Face-Forgery Detector
Trained weights for the detector in "On Phoneme-Based Audio-Visual Face Forgery Detection". The detector is an unweighted-average ensemble of three models operating on phoneme-aligned articulatory, frequency, and noise-residual cues.
Code, feature extraction, and the inference CLI: https://github.com/vlaght/dslp-ensemble-study
License / usage. Weights released under CC-BY-NC-4.0 (non-commercial). Use is also subject to the licenses of the source datasets (FakeAVCeleb, DeepSpeak v2, TalkVid-Bench); research use only.
Components
| Model | Architecture | Input features |
|---|---|---|
| DSLP | dual-stream phoneme-aligned LSTM (2-layer, hidden 256) + learned phoneme embedding (dim 8), mean-pooled, MLP fusion; Focal loss | 58 pruned visual+audio features → 144 visual / 88 audio dims; 139 phoneme classes |
| TAFreq | 2-layer BiLSTM (hidden 128) + soft attention pooling; BCE | 36 frequency-domain features (14 DCT + 22 STFT) → 216 dims |
| TANoise | same as TAFreq | 19 noise-residual features (Laplacian / DoG / quadrant) → 114 dims |
| Ensemble | unweighted mean of the three sigmoid probabilities | — |
Per-video feature expansion: delta (×2) + temporal statistics (std for DSLP, mean+std for TAFreq/TANoise). Output: P(fake) ∈ [0,1], decision at 0.5.
Results
10-fold stratified cross-validation.
Ensemble
| Dataset | AUC | F1 | Accuracy |
|---|---|---|---|
| FakeAVCeleb (21,544 videos) | 0.9593 | 0.9599 | 0.9248 |
| DeepSpeak v2 (16,465 videos) | 0.9984 | 0.9783 | 0.9810 |
Per-component AUC
| Model | FakeAVCeleb | DeepSpeak v2 |
|---|---|---|
| DSLP | 0.8606 | 0.9579 |
| TAFreq | 0.9674 | 0.9954 |
| TANoise | 0.8100 | 0.9356 |
| Ensemble | 0.9593 | 0.9984 |
FakeAVCeleb is harder because many of its forgeries leave mouth motion largely intact.
Files
dslp.pth / dslp_artifacts.pkl DSLP weights + scaler, phoneme encoder, feature cols, visual/audio split
tafreq.pth / tafreq_artifacts.pkl TAFreq weights + scaler, col means, input dim
tanoise.pth / tanoise_artifacts.pkl TANoise weights + scaler, col means, input dim
ensemble_manifest.json dataset, video count, phoneme count, ensemble rule
Training data
Trained on all datasets combined (--dataset ALL): FakeAVCeleb v1.2, an augmented set
of authentic YouTube clips, TalkVid-Bench, and DeepSpeak v2 — 24,767 videos with
complete phoneme + frequency + noise features. Trained on the full set with a 90/10
validation split for early stopping (no held-out test; cross-validated results are in the
paper).
Usage
See the repository for the CLI. With these files in trained/final/:
python cli/classify_ensemble.py --video path/to/video.mp4 # full detector
python cli/classify_dslp.py --video path/to/video.mp4 # single component
Live feature extraction uses MediaPipe FaceLandmarker and a frozen wav2vec 2.0 phoneme
recogniser (facebook/wav2vec2-lv-60-espeak-cv-ft), both fetched automatically.
Limitations
- Trained on the listed datasets; generalisation to unseen generators/domains not guaranteed.
- Needs a visible speaking face and audible speech (phoneme alignment); silent or no-face clips fail extraction.
- Research artifact, not a production forensic tool.
Citation
@misc{boiko_phoneme_av_2026,
title = {On Phoneme-Based Audio-Visual Face Forgery Detection},
author = {Boiko, Vladislav},
year = {2026}
}
Datasets
This model was trained on the datasets below. If you use these weights, please cite them (their licenses require attribution; use is research-only / non-commercial):
@inproceedings{khalid_fakeavceleb_2021,
title = {FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset},
author = {Khalid, Hasam and Tariq, Shahroz and Kim, Minha and Woo, Simon S.},
booktitle = {Thirty-fifth Conference on Neural Information Processing Systems
Datasets and Benchmarks Track (Round 2)},
year = {2021},
url = {https://openreview.net/forum?id=TAXFsg6ZaOl}
}
@misc{barrington_deepspeak_2025,
title = {The DeepSpeak Dataset},
author = {Barrington, Sarah and Bohacek, Matyas and Farid, Hany},
year = {2025},
publisher = {arXiv},
doi = {10.48550/arXiv.2408.05366},
url = {http://arxiv.org/abs/2408.05366}
}
@misc{chen_talkvid_2025,
title = {TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking
Head Synthesis},
author = {Chen, Shunian and Huang, Hejin and Liu, Yexin and Ye, Zihan and Chen,
Pengcheng and Zhu, Chenghao and Guan, Michael and Wang, Rongsheng and
Chen, Junying and Li, Guanbin and Lim, Ser-Nam and Yang, Harry and
Wang, Benyou},
year = {2025},
publisher = {arXiv},
doi = {10.48550/arXiv.2508.13618},
url = {http://arxiv.org/abs/2508.13618}
}
The augmented set of authentic clips is sourced from YouTube; the URL list is in the code repository (subject to YouTube Terms of Service).