File size: 2,568 Bytes
a3652eb 3abe553 a3652eb 98675e2 a3652eb 3abe553 a3652eb 98675e2 a3652eb 3abe553 a3652eb 98675e2 a3652eb 3abe553 a3652eb 98675e2 a3652eb 98675e2 a3652eb 98675e2 a3652eb 3abe553 a3652eb 98675e2 a3652eb 98675e2 1b00f9f 98675e2 1b00f9f 3abe553 a3652eb 3abe553 a3652eb 3abe553 a3652eb 3abe553 a3652eb 98675e2 a3652eb 98675e2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | ---
language: en
license: mit
tags:
- audio
- deepfake-detection
- lora
- speech
base_model: Speech-Arena-2025/DF_Arena_1B_V_1
library_name: peft
---
# VoxGuard LoRA - Deepfake Speech Detection
LoRA adapter for detecting AI-generated (deepfake) speech, fine-tuned on top of [Speech-Arena-2025/DF_Arena_1B_V_1](https://huggingface.co/Speech-Arena-2025/DF_Arena_1B_V_1) (1.15B parameters).
## Model Details
| | |
|---|---|
| **Base model** | Speech-Arena-2025/DF_Arena_1B_V_1 |
| **Method** | LoRA (Low-Rank Adaptation) |
| **LoRA config** | r=8, alpha=16, dropout=0.1, target_modules="all-linear" |
| **Trainable params** | ~10M / 1.15B (0.86%) |
## Training Data
- **Real speech:** LibriSpeech samples (280+ unique speakers across clean and other subsets)
- **Fake speech:** 10,000+ samples generated with [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) voice cloning via Replicate API
- **Augmentation:** Phone-call audio degradation (codec, noise, band-pass, clipping, reverb, packet loss)
- **Dataset:** [gereon/voxguard-synthetic-speech](https://huggingface.co/datasets/gereon/voxguard-synthetic-speech)
## Results
### Augmented Model (root - recommended)
| Metric | Baseline | Val | Test |
|--------|----------|-----|------|
| Accuracy | 77.5% | 99.2% | 100% |
| F1 | 0.794 | 0.992 | 1.000 |
Trained for 20 epochs (8.7 hrs) with phone-call audio augmentation on 6K samples.
### Non-Augmented Model (`non-augmented/`)
| Metric | Baseline | Val | Test |
|--------|----------|-----|------|
| Accuracy | 97.5% | 100% | 100% |
| F1 | 0.976 | 1.000 | 1.000 |
Early-stopped at epoch 14/20 (best at epoch 2) on 2K samples.
> Note: This represents an intentional overfit - the goal is to maintain the base model's generalizability while learning signatures of new deepfake models.
## Usage
```python
from peft import PeftModel
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
base_model = AutoModelForAudioClassification.from_pretrained("Speech-Arena-2025/DF_Arena_1B_V_1")
model = PeftModel.from_pretrained(base_model, "gereon/voxguard-lora")
feature_extractor = AutoFeatureExtractor.from_pretrained("Speech-Arena-2025/DF_Arena_1B_V_1")
# For the non-augmented variant:
# model = PeftModel.from_pretrained(base_model, "gereon/voxguard-lora", subfolder="non-augmented")
```
## Related
- **Dataset:** [gereon/voxguard-synthetic-speech](https://huggingface.co/datasets/gereon/voxguard-synthetic-speech) - 10K+ synthetic speech samples
- **Code:** [gereonelvers/voxguard](https://github.com/gereonelvers/voxguard)
|