Spaces:
Sleeping
Sleeping
File size: 4,507 Bytes
b5ac8e5 2a4d245 62f98bb 2a4d245 62f98bb 2a4d245 62f98bb 2a4d245 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
title: VoiceGuard API
emoji: π‘οΈ
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 7860
---
# AI-Generated Voice Detector API
A production-ready REST API that accurately detects whether a given voice recording is **AI-generated** or **Human**.
Built for the **AI-Generated Voice Detection Challenge** with specific support for **Tamil, English, Hindi, Malayalam, and Telugu**.
---
## π Features
- **Multilingual Support**: Uses the state-of-the-art **MMS-300M (Massively Multilingual Speech)** model (`nii-yamagishilab/mms-300m-anti-deepfake`) derived from **XLS-R**, supporting 100+ languages including Indic languages.
- **Strict API Specification**: Compliant with challenge requirements (Base64 MP3 input, standardized JSON response).
- **Smart Hybrid Detection**: Combines Deep Learning embeddings with **Acoustic Heuristics** (Pitch, Flatness, Liveness) for "Conservative Consensus" detection.
- **Explainability**: Provides human-readable explanations for every decision.
- **Secure**: Protected via `x-api-key` header authentication.
---
## π οΈ Tech Stack
- **Framework**: FastAPI (Python)
- **Model**: PyTorch + HuggingFace Transformers (`nii-yamagishilab/mms-300m-anti-deepfake`)
- **Toolkit**: **SpeechBrain** (Environment ready for advanced audio processing)
- **Audio Processing**: `pydub` (ffmpeg) + `librosa`
- **Deployment**: Uvicorn
---
## π₯ Installation
### 1. Pre-requisites
- **Python 3.8+**
- **FFmpeg**: Required for audio processing (`pydub`).
- **Linux**: `sudo apt install ffmpeg`
- **Windows**: [Download here](https://ffmpeg.org/download.html) and add to Path.
### 2. Setup (Linux / macOS)
```bash
# Create virtual environment
python3 -m venv venv
# Activate
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
```
### 3. Setup (Windows)
```powershell
# Create virtual environment
python -m venv venv
# Activate
.\venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
### 4. Configure Environment
Create a `.env` file in the root directory:
```bash
API_KEY=test-key-123
```
---
## βΆοΈ Running the Server
**Universal Command:**
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
*The server will start at `http://localhost:8000`.*
---
## π‘ API Usage
### Endpoint: `POST /api/voice-detection`
#### Headers
| Key | Value |
| -- | -- |
| `x-api-key` | `your-secret-key-123` |
| `Content-Type` | `application/json` |
#### Request Body
```json
{
"language": "Tamil",
"audioFormat": "mp3",
"audioBase64": "<BASE64_ENCODED_MP3_STRING>"
}
```
#### Response Example
```json
{
"status": "success",
"language": "Tamil",
"classification": "HUMAN",
"confidenceScore": 0.98,
"explanation": "High pitch variance and natural prosody detected."
}
```
---
## π§ͺ Testing
### 1. Run the Verification Script
We have a built-in test suite that verifies the audio pipeline and model inference:
```bash
python verify_pipeline.py
```
### 2. Run End-to-End API Test
To test the actual running server with a real generated MP3 file:
```bash
# Ensure server is running in another terminal first!
python test_api.py
```
### 3. cURL Command
```bash
curl -X POST http://127.0.0.1:8000/api/voice-detection \
-H "x-api-key: your-secret-key-123" \
-H "Content-Type: application/json" \
-d '{
"language": "English",
"audioFormat": "mp3",
"audioBase64": "SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU2LjM2LjEwMAAAAAAA..."
}'
```
---
## π Project Structure
```text
voice-detector/
βββ app/
β βββ main.py # API Entry point & Routes
β βββ infer.py # Model Inference Logic (XLS-R + Classifier)
β βββ audio.py # Audio Normalization (Base64 -> 16kHz WAV)
β βββ auth.py # Utilities
βββ model/ # Model weights storage
βββ requirements.txt # Python dependencies
βββ .env # Config keys
βββ verify_pipeline.py# System health check script
βββ test_api.py # Live API integration test
```
---
## π§ Model Logic (How it works)
1. **Input**: Takes Base64 MP3.
2. **Normalization**: Converts to **16,000Hz Mono WAV**.
3. **Encoder**: Feeds audio into **Wav2Vec2-XLS-R-53** to get a 1024-dimensional embedding.
4. **Feature Extraction**: Calculates **Pitch Variance** to detect robotic flatness.
5. **Classifier**: A linear layer combines `[Embedding (1024) + Pitch (1)]` to predict `AI_GENERATED` or `HUMAN`.
|