File size: 4,726 Bytes
c5c9261
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
title: Voice Detection API
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
---

# πŸŽ™οΈ VoiceGuard β€” AI Voice Detection API

**State-of-the-Art Deepfake Audio Detection System**

VoiceGuard is a high-performance API designed to detect AI-generated speech with exceptional accuracy. It combines advanced neural network inference with traditional forensic audio analysis to provide a robust defense against deepfake audio.

## πŸš€ Key Features

*   **Multi-Stage Detection Pipeline**: Fuses deep learning with signal processing forensics.
*   **Explainable AI**: Provides detailed, human-readable explanations for every detection.
*   **Dual Analysis Engine**:
    *   **Neural Model**: Wav2Vec2-based classifier with attentive pooling.
    *   **Forensic Analyzers**: Spectral, Temporal, Formant, and Artifact detection.
*   **Real-time Base64 Processing**: Optimized for low-latency API integration.
*   **Audio Quality Profiling**: Automatically assesses SNR, clipping, and silence ratios.

## 🌐 Live Demo
Experience the API instantly on Hugging Face Spaces:
**[πŸ‘‰ Try VoiceGuard Demo](https://huggingface.co/spaces/Pandaisop/voice-detection-api)**

---

## πŸ› οΈ Installation & Setup

### Prerequisites
*   Python 3.9+
*   RAM: 4GB+ (8GB recommended for optimal performance)

### 1. Clone the Repository
```bash
git clone <your-repo-url>
cd voice-detection-api
```

### 2. Install Dependencies
```bash
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install requirements
pip install -r requirements.txt
```

### 3. Configure Model (Optional)
By default, the API downloads a pre-trained model. To use your local trained model:
1.  Ensure your model files are in the `model/` directory.
2.  Update `.env` file:
    ```bash
    MODEL_NAME=./model
    ```

### 4. Run the API
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
The API will be available at `http://localhost:8000`.

---

## πŸ“– Usage

### API Endpoint: `/detect`
**Method**: `POST`
**Content-Type**: `application/json`

#### Request Body
```json
{
  "language": "English",
  "audioFormat": "mp3",
  "audioBase64": "<base64_encoded_audio_string>"
}
```

#### Response Example
```json
{
  "status": "success",
  "language": "English",
  "classification": "AI_GENERATED",
  "confidenceScore": 0.98,
  "explanation": "Strong indicators of AI-generated speech detected. Evidence: unnaturally uniform spectral texture, and metronomic pause timing. Neural model and forensic analyzers are in agreement.",
  "analyzersAgree": true,
  "inferenceTimeMs": 450.2
}
```

### Web UI
Navigate to `http://localhost:8000` in your browser to access the built-in testing console. You can upload audio files directly to test the detection engine.

---

## 🧠 Model Architecture & Approach

VoiceGuard uses a **Hybrid Detection Architecture** to maximize robustness.

### 1. Neural Analysis Engine
*   **Backbone**: Fine-tuned **Wav2Vec 2.0** (XLSR-53) for extracting high-level speech representations.
*   **Classification Head**: **Attentive Statistics Pooling** layer that learns to weigh important frames, followed by a dense MLP classifier.
*   **Strategy**: analyzing multiple overlapping segments of the audio to catch partial deepfakes.

### 2. Forensic Analysis Engine
A suite of signal processing algorithms detects artifacts that neural models might miss:
*   **Spectral Analysis**: Detects unnatural smoothness in the frequency domain (typical of vocoders).
*   **Temporal Analysis**: Identifies robotic cadence and lack of natural micro-jitter in energy.
*   **Formant Analysis**: Checks for realistic formant transitions and vocal tract consistency.
*   **Artifact Detection**: Scans for phase discontinuities, digital silence, and synthesis clicks.

### 3. Decision Fusion
The **Fusion Engine** combines the probabilistic output of the Neural Model with the weighted findings of the Forensic Analyzers.
*   **Agreement Check**: If both engines agree, confidence is boosted.
*   **Disagreement Handling**: If engines disagree, the system lowers confidence and flags the result for manual review in the explanation.

---

## πŸ§ͺ Development

### Running Tests
```bash
pytest
```

### Project Structure
*   `app/main.py`: FastAPI entry point and route definitions.
*   `app/core/model.py`: Neural model inference logic.
*   `app/core/forensics.py`: Signal processing and forensic analyzers.
*   `app/core/explanation.py`: Logic for generating human-readable explanations.
*   `trainer/`: Scripts used for training and evaluating the model.

---

## πŸ“„ License
MIT License. See `LICENSE` for more information.