File size: 4,507 Bytes
b5ac8e5
 
 
 
 
 
 
 
 
 
2a4d245
 
 
 
 
 
 
 
 
62f98bb
2a4d245
62f98bb
2a4d245
 
 
 
 
 
 
 
62f98bb
 
2a4d245
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: VoiceGuard API
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 7860
---

# AI-Generated Voice Detector API

A production-ready REST API that accurately detects whether a given voice recording is **AI-generated** or **Human**.  
Built for the **AI-Generated Voice Detection Challenge** with specific support for **Tamil, English, Hindi, Malayalam, and Telugu**.

---

## πŸš€ Features

- **Multilingual Support**: Uses the state-of-the-art **MMS-300M (Massively Multilingual Speech)** model (`nii-yamagishilab/mms-300m-anti-deepfake`) derived from **XLS-R**, supporting 100+ languages including Indic languages.
- **Strict API Specification**: Compliant with challenge requirements (Base64 MP3 input, standardized JSON response).
- **Smart Hybrid Detection**: Combines Deep Learning embeddings with **Acoustic Heuristics** (Pitch, Flatness, Liveness) for "Conservative Consensus" detection.
- **Explainability**: Provides human-readable explanations for every decision.
- **Secure**: Protected via `x-api-key` header authentication.

---

## πŸ› οΈ Tech Stack

- **Framework**: FastAPI (Python)
- **Model**: PyTorch + HuggingFace Transformers (`nii-yamagishilab/mms-300m-anti-deepfake`)
- **Toolkit**: **SpeechBrain** (Environment ready for advanced audio processing)
- **Audio Processing**: `pydub` (ffmpeg) + `librosa`
- **Deployment**: Uvicorn

---

## πŸ“₯ Installation

### 1. Pre-requisites
- **Python 3.8+**
- **FFmpeg**: Required for audio processing (`pydub`).
  - **Linux**: `sudo apt install ffmpeg`
  - **Windows**: [Download here](https://ffmpeg.org/download.html) and add to Path.

### 2. Setup (Linux / macOS)
```bash
# Create virtual environment
python3 -m venv venv

# Activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt
```

### 3. Setup (Windows)
```powershell
# Create virtual environment
python -m venv venv

# Activate
.\venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### 4. Configure Environment
Create a `.env` file in the root directory:
```bash
API_KEY=test-key-123
```

---

## ▢️ Running the Server

**Universal Command:**
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
*The server will start at `http://localhost:8000`.*

---

## πŸ“‘ API Usage

### Endpoint: `POST /api/voice-detection`

#### Headers
| Key | Value |
| -- | -- |
| `x-api-key` | `your-secret-key-123` |
| `Content-Type` | `application/json` |

#### Request Body
```json
{
  "language": "Tamil",
  "audioFormat": "mp3",
  "audioBase64": "<BASE64_ENCODED_MP3_STRING>"
}
```

#### Response Example
```json
{
  "status": "success",
  "language": "Tamil",
  "classification": "HUMAN",
  "confidenceScore": 0.98,
  "explanation": "High pitch variance and natural prosody detected."
}
```

---

## πŸ§ͺ Testing

### 1. Run the Verification Script
We have a built-in test suite that verifies the audio pipeline and model inference:
```bash
python verify_pipeline.py
```

### 2. Run End-to-End API Test
To test the actual running server with a real generated MP3 file:
```bash
# Ensure server is running in another terminal first!
python test_api.py
```

### 3. cURL Command
```bash
curl -X POST http://127.0.0.1:8000/api/voice-detection \
  -H "x-api-key: your-secret-key-123" \
  -H "Content-Type: application/json" \
  -d '{
    "language": "English",
    "audioFormat": "mp3",
    "audioBase64": "SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU2LjM2LjEwMAAAAAAA..."
  }'
```

---

## πŸ“‚ Project Structure

```text
voice-detector/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py       # API Entry point & Routes
β”‚   β”œβ”€β”€ infer.py      # Model Inference Logic (XLS-R + Classifier)
β”‚   β”œβ”€β”€ audio.py      # Audio Normalization (Base64 -> 16kHz WAV)
β”‚   └── auth.py       # Utilities
β”œβ”€β”€ model/            # Model weights storage
β”œβ”€β”€ requirements.txt  # Python dependencies
β”œβ”€β”€ .env              # Config keys
β”œβ”€β”€ verify_pipeline.py# System health check script
└── test_api.py       # Live API integration test
```

---

## 🧠 Model Logic (How it works)

1.  **Input**: Takes Base64 MP3.
2.  **Normalization**: Converts to **16,000Hz Mono WAV**.
3.  **Encoder**: Feeds audio into **Wav2Vec2-XLS-R-53** to get a 1024-dimensional embedding.
4.  **Feature Extraction**: Calculates **Pitch Variance** to detect robotic flatness.
5.  **Classifier**: A linear layer combines `[Embedding (1024) + Pitch (1)]` to predict `AI_GENERATED` or `HUMAN`.