| | --- |
| | language: en |
| | license: mit |
| | tags: |
| | - audio |
| | - deepfake-detection |
| | - lora |
| | - speech |
| | base_model: Speech-Arena-2025/DF_Arena_1B_V_1 |
| | library_name: peft |
| | --- |
| | |
| | # VoxGuard LoRA - Deepfake Speech Detection |
| |
|
| | LoRA adapter for detecting AI-generated (deepfake) speech, fine-tuned on top of [Speech-Arena-2025/DF_Arena_1B_V_1](https://huggingface.co/Speech-Arena-2025/DF_Arena_1B_V_1) (1.15B parameters). |
| |
|
| | ## Model Details |
| |
|
| | | | | |
| | |---|---| |
| | | **Base model** | Speech-Arena-2025/DF_Arena_1B_V_1 | |
| | | **Method** | LoRA (Low-Rank Adaptation) | |
| | | **LoRA config** | r=8, alpha=16, dropout=0.1, target_modules="all-linear" | |
| | | **Trainable params** | ~10M / 1.15B (0.86%) | |
| | |
| | ## Training Data |
| | |
| | - **Real speech:** LibriSpeech samples (280+ unique speakers across clean and other subsets) |
| | - **Fake speech:** 10,000+ samples generated with [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) voice cloning via Replicate API |
| | - **Augmentation:** Phone-call audio degradation (codec, noise, band-pass, clipping, reverb, packet loss) |
| | - **Dataset:** [gereon/voxguard-synthetic-speech](https://huggingface.co/datasets/gereon/voxguard-synthetic-speech) |
| | |
| | ## Results |
| | |
| | ### Augmented Model (root - recommended) |
| | |
| | | Metric | Baseline | Val | Test | |
| | |--------|----------|-----|------| |
| | | Accuracy | 77.5% | 99.2% | 100% | |
| | | F1 | 0.794 | 0.992 | 1.000 | |
| | |
| | Trained for 20 epochs (8.7 hrs) with phone-call audio augmentation on 6K samples. |
| | |
| | ### Non-Augmented Model (`non-augmented/`) |
| | |
| | | Metric | Baseline | Val | Test | |
| | |--------|----------|-----|------| |
| | | Accuracy | 97.5% | 100% | 100% | |
| | | F1 | 0.976 | 1.000 | 1.000 | |
| | |
| | Early-stopped at epoch 14/20 (best at epoch 2) on 2K samples. |
| | |
| | > Note: This represents an intentional overfit - the goal is to maintain the base model's generalizability while learning signatures of new deepfake models. |
| | |
| | ## Usage |
| | |
| | ```python |
| | from peft import PeftModel |
| | from transformers import AutoModelForAudioClassification, AutoFeatureExtractor |
| | |
| | base_model = AutoModelForAudioClassification.from_pretrained("Speech-Arena-2025/DF_Arena_1B_V_1") |
| | model = PeftModel.from_pretrained(base_model, "gereon/voxguard-lora") |
| | feature_extractor = AutoFeatureExtractor.from_pretrained("Speech-Arena-2025/DF_Arena_1B_V_1") |
| | |
| | # For the non-augmented variant: |
| | # model = PeftModel.from_pretrained(base_model, "gereon/voxguard-lora", subfolder="non-augmented") |
| | ``` |
| | |
| | ## Related |
| | |
| | - **Dataset:** [gereon/voxguard-synthetic-speech](https://huggingface.co/datasets/gereon/voxguard-synthetic-speech) - 10K+ synthetic speech samples |
| | - **Code:** [gereonelvers/voxguard](https://github.com/gereonelvers/voxguard) |
| | |