File size: 2,568 Bytes
a3652eb
3abe553
 
 
 
 
 
 
a3652eb
 
 
 
98675e2
a3652eb
3abe553
a3652eb
 
 
98675e2
 
 
 
 
 
a3652eb
3abe553
a3652eb
98675e2
 
 
 
a3652eb
3abe553
a3652eb
98675e2
a3652eb
98675e2
 
 
 
a3652eb
98675e2
a3652eb
3abe553
a3652eb
98675e2
 
 
 
a3652eb
98675e2
1b00f9f
98675e2
1b00f9f
3abe553
a3652eb
3abe553
 
 
a3652eb
3abe553
 
 
a3652eb
3abe553
 
 
a3652eb
98675e2
a3652eb
98675e2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
language: en
license: mit
tags:
  - audio
  - deepfake-detection
  - lora
  - speech
base_model: Speech-Arena-2025/DF_Arena_1B_V_1
library_name: peft
---

# VoxGuard LoRA - Deepfake Speech Detection

LoRA adapter for detecting AI-generated (deepfake) speech, fine-tuned on top of [Speech-Arena-2025/DF_Arena_1B_V_1](https://huggingface.co/Speech-Arena-2025/DF_Arena_1B_V_1) (1.15B parameters).

## Model Details

| | |
|---|---|
| **Base model** | Speech-Arena-2025/DF_Arena_1B_V_1 |
| **Method** | LoRA (Low-Rank Adaptation) |
| **LoRA config** | r=8, alpha=16, dropout=0.1, target_modules="all-linear" |
| **Trainable params** | ~10M / 1.15B (0.86%) |

## Training Data

- **Real speech:** LibriSpeech samples (280+ unique speakers across clean and other subsets)
- **Fake speech:** 10,000+ samples generated with [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) voice cloning via Replicate API
- **Augmentation:** Phone-call audio degradation (codec, noise, band-pass, clipping, reverb, packet loss)
- **Dataset:** [gereon/voxguard-synthetic-speech](https://huggingface.co/datasets/gereon/voxguard-synthetic-speech)

## Results

### Augmented Model (root - recommended)

| Metric | Baseline | Val | Test |
|--------|----------|-----|------|
| Accuracy | 77.5% | 99.2% | 100% |
| F1 | 0.794 | 0.992 | 1.000 |

Trained for 20 epochs (8.7 hrs) with phone-call audio augmentation on 6K samples.

### Non-Augmented Model (`non-augmented/`)

| Metric | Baseline | Val | Test |
|--------|----------|-----|------|
| Accuracy | 97.5% | 100% | 100% |
| F1 | 0.976 | 1.000 | 1.000 |

Early-stopped at epoch 14/20 (best at epoch 2) on 2K samples.

> Note: This represents an intentional overfit - the goal is to maintain the base model's generalizability while learning signatures of new deepfake models.

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

base_model = AutoModelForAudioClassification.from_pretrained("Speech-Arena-2025/DF_Arena_1B_V_1")
model = PeftModel.from_pretrained(base_model, "gereon/voxguard-lora")
feature_extractor = AutoFeatureExtractor.from_pretrained("Speech-Arena-2025/DF_Arena_1B_V_1")

# For the non-augmented variant:
# model = PeftModel.from_pretrained(base_model, "gereon/voxguard-lora", subfolder="non-augmented")
```

## Related

- **Dataset:** [gereon/voxguard-synthetic-speech](https://huggingface.co/datasets/gereon/voxguard-synthetic-speech) - 10K+ synthetic speech samples
- **Code:** [gereonelvers/voxguard](https://github.com/gereonelvers/voxguard)