VoiceGuard-API / README.md
S-Vetrivel's picture
Heavy & Accurate: Integrated SpeechBrain VAD + MMS-300M pipeline
62f98bb
---
title: VoiceGuard API
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 7860
---
# AI-Generated Voice Detector API
A production-ready REST API that accurately detects whether a given voice recording is **AI-generated** or **Human**.
Built for the **AI-Generated Voice Detection Challenge** with specific support for **Tamil, English, Hindi, Malayalam, and Telugu**.
---
## πŸš€ Features
- **Multilingual Support**: Uses the state-of-the-art **MMS-300M (Massively Multilingual Speech)** model (`nii-yamagishilab/mms-300m-anti-deepfake`) derived from **XLS-R**, supporting 100+ languages including Indic languages.
- **Strict API Specification**: Compliant with challenge requirements (Base64 MP3 input, standardized JSON response).
- **Smart Hybrid Detection**: Combines Deep Learning embeddings with **Acoustic Heuristics** (Pitch, Flatness, Liveness) for "Conservative Consensus" detection.
- **Explainability**: Provides human-readable explanations for every decision.
- **Secure**: Protected via `x-api-key` header authentication.
---
## πŸ› οΈ Tech Stack
- **Framework**: FastAPI (Python)
- **Model**: PyTorch + HuggingFace Transformers (`nii-yamagishilab/mms-300m-anti-deepfake`)
- **Toolkit**: **SpeechBrain** (Environment ready for advanced audio processing)
- **Audio Processing**: `pydub` (ffmpeg) + `librosa`
- **Deployment**: Uvicorn
---
## πŸ“₯ Installation
### 1. Pre-requisites
- **Python 3.8+**
- **FFmpeg**: Required for audio processing (`pydub`).
- **Linux**: `sudo apt install ffmpeg`
- **Windows**: [Download here](https://ffmpeg.org/download.html) and add to Path.
### 2. Setup (Linux / macOS)
```bash
# Create virtual environment
python3 -m venv venv
# Activate
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
```
### 3. Setup (Windows)
```powershell
# Create virtual environment
python -m venv venv
# Activate
.\venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
### 4. Configure Environment
Create a `.env` file in the root directory:
```bash
API_KEY=test-key-123
```
---
## ▢️ Running the Server
**Universal Command:**
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
*The server will start at `http://localhost:8000`.*
---
## πŸ“‘ API Usage
### Endpoint: `POST /api/voice-detection`
#### Headers
| Key | Value |
| -- | -- |
| `x-api-key` | `your-secret-key-123` |
| `Content-Type` | `application/json` |
#### Request Body
```json
{
"language": "Tamil",
"audioFormat": "mp3",
"audioBase64": "<BASE64_ENCODED_MP3_STRING>"
}
```
#### Response Example
```json
{
"status": "success",
"language": "Tamil",
"classification": "HUMAN",
"confidenceScore": 0.98,
"explanation": "High pitch variance and natural prosody detected."
}
```
---
## πŸ§ͺ Testing
### 1. Run the Verification Script
We have a built-in test suite that verifies the audio pipeline and model inference:
```bash
python verify_pipeline.py
```
### 2. Run End-to-End API Test
To test the actual running server with a real generated MP3 file:
```bash
# Ensure server is running in another terminal first!
python test_api.py
```
### 3. cURL Command
```bash
curl -X POST http://127.0.0.1:8000/api/voice-detection \
-H "x-api-key: your-secret-key-123" \
-H "Content-Type: application/json" \
-d '{
"language": "English",
"audioFormat": "mp3",
"audioBase64": "SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU2LjM2LjEwMAAAAAAA..."
}'
```
---
## πŸ“‚ Project Structure
```text
voice-detector/
β”œβ”€β”€ app/
β”‚ β”œβ”€β”€ main.py # API Entry point & Routes
β”‚ β”œβ”€β”€ infer.py # Model Inference Logic (XLS-R + Classifier)
β”‚ β”œβ”€β”€ audio.py # Audio Normalization (Base64 -> 16kHz WAV)
β”‚ └── auth.py # Utilities
β”œβ”€β”€ model/ # Model weights storage
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ .env # Config keys
β”œβ”€β”€ verify_pipeline.py# System health check script
└── test_api.py # Live API integration test
```
---
## 🧠 Model Logic (How it works)
1. **Input**: Takes Base64 MP3.
2. **Normalization**: Converts to **16,000Hz Mono WAV**.
3. **Encoder**: Feeds audio into **Wav2Vec2-XLS-R-53** to get a 1024-dimensional embedding.
4. **Feature Extraction**: Calculates **Pitch Variance** to detect robotic flatness.
5. **Classifier**: A linear layer combines `[Embedding (1024) + Pitch (1)]` to predict `AI_GENERATED` or `HUMAN`.