VoiceGuard-API / README.md
S-Vetrivel's picture
Heavy & Accurate: Integrated SpeechBrain VAD + MMS-300M pipeline
62f98bb
metadata
title: VoiceGuard API
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 7860

AI-Generated Voice Detector API

A production-ready REST API that accurately detects whether a given voice recording is AI-generated or Human.
Built for the AI-Generated Voice Detection Challenge with specific support for Tamil, English, Hindi, Malayalam, and Telugu.


πŸš€ Features

  • Multilingual Support: Uses the state-of-the-art MMS-300M (Massively Multilingual Speech) model (nii-yamagishilab/mms-300m-anti-deepfake) derived from XLS-R, supporting 100+ languages including Indic languages.
  • Strict API Specification: Compliant with challenge requirements (Base64 MP3 input, standardized JSON response).
  • Smart Hybrid Detection: Combines Deep Learning embeddings with Acoustic Heuristics (Pitch, Flatness, Liveness) for "Conservative Consensus" detection.
  • Explainability: Provides human-readable explanations for every decision.
  • Secure: Protected via x-api-key header authentication.

πŸ› οΈ Tech Stack

  • Framework: FastAPI (Python)
  • Model: PyTorch + HuggingFace Transformers (nii-yamagishilab/mms-300m-anti-deepfake)
  • Toolkit: SpeechBrain (Environment ready for advanced audio processing)
  • Audio Processing: pydub (ffmpeg) + librosa
  • Deployment: Uvicorn

πŸ“₯ Installation

1. Pre-requisites

  • Python 3.8+
  • FFmpeg: Required for audio processing (pydub).
    • Linux: sudo apt install ffmpeg
    • Windows: Download here and add to Path.

2. Setup (Linux / macOS)

# Create virtual environment
python3 -m venv venv

# Activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. Setup (Windows)

# Create virtual environment
python -m venv venv

# Activate
.\venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

4. Configure Environment

Create a .env file in the root directory:

API_KEY=test-key-123

▢️ Running the Server

Universal Command:

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

The server will start at http://localhost:8000.


πŸ“‘ API Usage

Endpoint: POST /api/voice-detection

Headers

Key Value
x-api-key your-secret-key-123
Content-Type application/json

Request Body

{
  "language": "Tamil",
  "audioFormat": "mp3",
  "audioBase64": "<BASE64_ENCODED_MP3_STRING>"
}

Response Example

{
  "status": "success",
  "language": "Tamil",
  "classification": "HUMAN",
  "confidenceScore": 0.98,
  "explanation": "High pitch variance and natural prosody detected."
}

πŸ§ͺ Testing

1. Run the Verification Script

We have a built-in test suite that verifies the audio pipeline and model inference:

python verify_pipeline.py

2. Run End-to-End API Test

To test the actual running server with a real generated MP3 file:

# Ensure server is running in another terminal first!
python test_api.py

3. cURL Command

curl -X POST http://127.0.0.1:8000/api/voice-detection \
  -H "x-api-key: your-secret-key-123" \
  -H "Content-Type: application/json" \
  -d '{
    "language": "English",
    "audioFormat": "mp3",
    "audioBase64": "SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU2LjM2LjEwMAAAAAAA..."
  }'

πŸ“‚ Project Structure

voice-detector/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py       # API Entry point & Routes
β”‚   β”œβ”€β”€ infer.py      # Model Inference Logic (XLS-R + Classifier)
β”‚   β”œβ”€β”€ audio.py      # Audio Normalization (Base64 -> 16kHz WAV)
β”‚   └── auth.py       # Utilities
β”œβ”€β”€ model/            # Model weights storage
β”œβ”€β”€ requirements.txt  # Python dependencies
β”œβ”€β”€ .env              # Config keys
β”œβ”€β”€ verify_pipeline.py# System health check script
└── test_api.py       # Live API integration test

🧠 Model Logic (How it works)

  1. Input: Takes Base64 MP3.
  2. Normalization: Converts to 16,000Hz Mono WAV.
  3. Encoder: Feeds audio into Wav2Vec2-XLS-R-53 to get a 1024-dimensional embedding.
  4. Feature Extraction: Calculates Pitch Variance to detect robotic flatness.
  5. Classifier: A linear layer combines [Embedding (1024) + Pitch (1)] to predict AI_GENERATED or HUMAN.