--- title: VoiceGuard API emoji: ๐Ÿ›ก๏ธ colorFrom: blue colorTo: green sdk: docker pinned: false app_port: 7860 --- # AI-Generated Voice Detector API A production-ready REST API that accurately detects whether a given voice recording is **AI-generated** or **Human**. Built for the **AI-Generated Voice Detection Challenge** with specific support for **Tamil, English, Hindi, Malayalam, and Telugu**. --- ## ๐Ÿš€ Features - **Multilingual Support**: Uses the state-of-the-art **MMS-300M (Massively Multilingual Speech)** model (`nii-yamagishilab/mms-300m-anti-deepfake`) derived from **XLS-R**, supporting 100+ languages including Indic languages. - **Strict API Specification**: Compliant with challenge requirements (Base64 MP3 input, standardized JSON response). - **Smart Hybrid Detection**: Combines Deep Learning embeddings with **Acoustic Heuristics** (Pitch, Flatness, Liveness) for "Conservative Consensus" detection. - **Explainability**: Provides human-readable explanations for every decision. - **Secure**: Protected via `x-api-key` header authentication. --- ## ๐Ÿ› ๏ธ Tech Stack - **Framework**: FastAPI (Python) - **Model**: PyTorch + HuggingFace Transformers (`nii-yamagishilab/mms-300m-anti-deepfake`) - **Toolkit**: **SpeechBrain** (Environment ready for advanced audio processing) - **Audio Processing**: `pydub` (ffmpeg) + `librosa` - **Deployment**: Uvicorn --- ## ๐Ÿ“ฅ Installation ### 1. Pre-requisites - **Python 3.8+** - **FFmpeg**: Required for audio processing (`pydub`). - **Linux**: `sudo apt install ffmpeg` - **Windows**: [Download here](https://ffmpeg.org/download.html) and add to Path. ### 2. Setup (Linux / macOS) ```bash # Create virtual environment python3 -m venv venv # Activate source venv/bin/activate # Install dependencies pip install -r requirements.txt ``` ### 3. Setup (Windows) ```powershell # Create virtual environment python -m venv venv # Activate .\venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` ### 4. Configure Environment Create a `.env` file in the root directory: ```bash API_KEY=test-key-123 ``` --- ## โ–ถ๏ธ Running the Server **Universal Command:** ```bash uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload ``` *The server will start at `http://localhost:8000`.* --- ## ๐Ÿ“ก API Usage ### Endpoint: `POST /api/voice-detection` #### Headers | Key | Value | | -- | -- | | `x-api-key` | `your-secret-key-123` | | `Content-Type` | `application/json` | #### Request Body ```json { "language": "Tamil", "audioFormat": "mp3", "audioBase64": "" } ``` #### Response Example ```json { "status": "success", "language": "Tamil", "classification": "HUMAN", "confidenceScore": 0.98, "explanation": "High pitch variance and natural prosody detected." } ``` --- ## ๐Ÿงช Testing ### 1. Run the Verification Script We have a built-in test suite that verifies the audio pipeline and model inference: ```bash python verify_pipeline.py ``` ### 2. Run End-to-End API Test To test the actual running server with a real generated MP3 file: ```bash # Ensure server is running in another terminal first! python test_api.py ``` ### 3. cURL Command ```bash curl -X POST http://127.0.0.1:8000/api/voice-detection \ -H "x-api-key: your-secret-key-123" \ -H "Content-Type: application/json" \ -d '{ "language": "English", "audioFormat": "mp3", "audioBase64": "SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU2LjM2LjEwMAAAAAAA..." }' ``` --- ## ๐Ÿ“‚ Project Structure ```text voice-detector/ โ”œโ”€โ”€ app/ โ”‚ โ”œโ”€โ”€ main.py # API Entry point & Routes โ”‚ โ”œโ”€โ”€ infer.py # Model Inference Logic (XLS-R + Classifier) โ”‚ โ”œโ”€โ”€ audio.py # Audio Normalization (Base64 -> 16kHz WAV) โ”‚ โ””โ”€โ”€ auth.py # Utilities โ”œโ”€โ”€ model/ # Model weights storage โ”œโ”€โ”€ requirements.txt # Python dependencies โ”œโ”€โ”€ .env # Config keys โ”œโ”€โ”€ verify_pipeline.py# System health check script โ””โ”€โ”€ test_api.py # Live API integration test ``` --- ## ๐Ÿง  Model Logic (How it works) 1. **Input**: Takes Base64 MP3. 2. **Normalization**: Converts to **16,000Hz Mono WAV**. 3. **Encoder**: Feeds audio into **Wav2Vec2-XLS-R-53** to get a 1024-dimensional embedding. 4. **Feature Extraction**: Calculates **Pitch Variance** to detect robotic flatness. 5. **Classifier**: A linear layer combines `[Embedding (1024) + Pitch (1)]` to predict `AI_GENERATED` or `HUMAN`.