# VoiceForge API Documentation ## Base URL - Development: `http://localhost:8000` - Production: `https://api.voiceforge.example.com` ## Authentication Most endpoints require authentication via JWT Bearer token. ```http Authorization: Bearer ``` --- ## Endpoints ### Authentication #### POST /api/v1/auth/register Register a new user. **Request:** ```json { "email": "user@example.com", "password": "strongpassword123", "name": "Jane Doe" } ``` **Response:** ```json { "id": 1, "email": "user@example.com", "name": "Jane Doe", "created_at": "2024-01-01T10:00:00" } ``` #### POST /api/v1/auth/login Login to get a JWT token. **Request (Form Data):** - username: user@example.com - password: strongpassword123 **Response:** ```json { "access_token": "ey...", "token_type": "bearer" } ``` #### GET /api/v1/auth/me Get current user profile. --- ### Endpoints ### Health Check #### GET /health Check if the API is running. **Response:** ```json { "status": "healthy", "service": "voiceforge-api", "version": "1.0.0" } ``` #### GET /health/memory Get current memory usage and loaded models. **Response:** ```json { "memory_mb": 1523.4, "loaded_models": ["distil-small.en", "small"], "models_detail": { "distil-small.en": {"loaded": true, "idle_seconds": 45.2} } } ``` #### POST /health/memory/cleanup Unload idle models (inactive > 5 minutes) to free memory. #### POST /health/memory/unload-all Unload ALL models to free maximum memory (~1GB reduction). --- ### WebSocket Endpoints #### WS /api/v1/ws/tts/{client_id} Real-time TTS streaming via WebSocket (ultra-low latency). **Protocol:** - **Client sends:** JSON `{"text": "...", "voice": "...", "rate": "+0%", "pitch": "+0Hz"}` - **Server sends:** Binary audio chunks followed by JSON `{"status": "complete", "ttfb_ms": 150}` **Expected TTFB:** <500ms --- ### Speech-to-Text #### GET /api/v1/stt/languages Get list of supported languages. **Response:** ```json { "languages": [ { "code": "en-US", "name": "English (US)", "native_name": "English", "flag": "🇺🇸", "stt_supported": true, "tts_supported": true } ], "total": 10 } ``` #### POST /api/v1/stt/upload Transcribe an uploaded audio file. **Request:** - Content-Type: `multipart/form-data` | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | file | file | Yes | Audio file (WAV, MP3, M4A, FLAC, OGG) | | language | string | No | Language code (default: en-US) | | enable_punctuation | boolean | No | Add punctuation (default: true) | | enable_word_timestamps | boolean | No | Include word timing (default: true) | | enable_diarization | boolean | No | Speaker detection (default: false) | | speaker_count | integer | No | Expected speakers (2-10) | **Response:** ```json { "id": 1, "text": "Hello, world. This is a test transcription.", "segments": [ { "text": "Hello, world.", "start_time": 0.0, "end_time": 1.5, "speaker": null, "confidence": 0.95 } ], "words": [ { "word": "Hello", "start_time": 0.0, "end_time": 0.5, "confidence": 0.98 } ], "language": "en-US", "confidence": 0.95, "duration": 3.5, "word_count": 7, "processing_time": 1.23 } ``` #### POST /api/v1/stt/upload/quality High-quality transcription mode (optimized for accuracy). **Parameters (form-data):** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | file | file | Yes | Audio file | | language | string | No | Language code (default: en-US) | | preprocess | boolean | No | Apply noise reduction (default: false) | **Features:** - beam_size=5 for more accurate decoding (~40% fewer errors) - condition_on_previous_text=False to reduce hallucinations - Optional audio preprocessing for noisy environments **Response:** Same as `/upload` --- #### POST /api/v1/stt/upload/batch Batch transcription for high throughput (2-3x speedup). **Parameters (form-data):** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | files | file[] | Yes | Multiple audio files | | language | string | No | Language code (default: en-US) | | batch_size | integer | No | Batch size (default: 8) | **Response:** ```json { "count": 3, "results": [ {"filename": "audio1.mp3", "text": "...", "processing_time": 2.1}, {"filename": "audio2.mp3", "text": "...", "processing_time": 1.8} ], "mode": "batched", "batch_size": 8 } ``` ``` #### POST /api/v1/stt/upload/diarize Perform Speaker Diarization ("Who said what") on an audio file. **Requirements:** - `HF_TOKEN` must be set in `.env` (Hugging Face Token for pyannote model access) - Uses `faster-whisper` for transcription + `pyannote.audio` for speaker identification **Parameters (form-data):** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | file | file | Yes | Audio file | | num_speakers | integer | No | Exact number of speakers (optional) | | min_speakers | integer | No | Min expected speakers (optional) | | max_speakers | integer | No | Max expected speakers (optional) | | language | string | No | Language code, e.g. 'en' (auto-detected if not provided) | **Response:** ```json { "segments": [ { "start": 0.0, "end": 0.84, "text": "Hello Test", "speaker": "SPEAKER_00" } ], "speaker_stats": { "SPEAKER_00": 0.84 }, "language": "en", "status": "success" } ``` ### Transcripts & Analysis #### GET /api/v1/transcripts List all past transcriptions. **Response:** ```json [ { "id": 1, "text": "Hello world...", "created_at": "2024-01-01T12:00:00", "word_count": 150, "language": "en-US" } ] ``` #### POST /api/v1/transcripts/{id}/analyze Run NLP analysis (Sentiment, Keywords, Summary) on a transcript. **Response:** ```json { "status": "success", "analysis": { "sentiment": {"polarity": 0.5, "subjectivity": 0.1}, "keywords": ["artificial intelligence", "voice", "app"], "summary": "This is a summary of the transcript." } } ``` #### GET /api/v1/transcripts/{id}/export Download transcript in a specific format. **Query Parameters:** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | format | string | Yes | txt, srt, vtt, pdf | **Response:** - File download (text/plain, text/vtt, application/pdf) --- ### Text-to-Speech #### GET /api/v1/tts/voices Get all available voices. **Query Parameters:** | Parameter | Type | Description | |-----------|------|-------------| | language | string | Filter by language code | **Response:** ```json { "voices": [ { "name": "en-US-Wavenet-D", "language_code": "en-US", "language_name": "English (US)", "ssml_gender": "MALE", "natural_sample_rate": 24000, "voice_type": "WaveNet", "display_name": "D (Male, WaveNet)", "flag": "🇺🇸" } ], "total": 50, "language_filter": null } ``` #### GET /api/v1/tts/voices/{language} Get voices for a specific language. **Parameters:** | Parameter | Type | Description | |-----------|------|-------------| | language | path | Language code (e.g., en-US) | #### POST /api/v1/tts/synthesize Convert text to speech. **Request:** ```json { "text": "Hello, this is a test.", "language": "en-US", "voice": "en-US-Wavenet-D", "audio_encoding": "MP3", "speaking_rate": 1.0, "pitch": 0.0, "volume_gain_db": 0.0, "use_ssml": false } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | text | string | Yes | Text to synthesize (max 5000 chars) | | language | string | No | Language code (default: en-US) | | voice | string | No | Voice name | | audio_encoding | string | No | MP3, LINEAR16, OGG_OPUS | | speaking_rate | float | No | Speed 0.25-4.0 (default: 1.0) | | pitch | float | No | Pitch -20 to 20 (default: 0.0) | | volume_gain_db | float | No | Volume -96 to 16 dB | | use_ssml | boolean | No | Treat as SSML markup | **Response:** ```json { "audio_content": "", "audio_size": 12345, "duration_estimate": 2.5, "voice_used": "en-US-Wavenet-D", "language": "en-US", "encoding": "MP3", "sample_rate": 24000, "processing_time": 0.45 } ``` #### POST /api/v1/tts/synthesize/audio Synthesize and return audio file directly. Same request as `/synthesize`, but returns the audio file as a download. #### POST /api/v1/tts/stream Stream synthesized audio for immediate playback. **Request:** Same as `/synthesize`. **Response:** Chunked audio stream (`audio/mpeg`). Ideal for long text to reduce latency (TTFB). #### POST /api/v1/tts/ssml Synthesize audio using SSML for prosody control (rate, pitch, emphasis). **Request:** - `text`: Text to speak - `voice`: Voice name (default: "en-US-AriaNeural") - `rate`: Speed (e.g., "fast", "-10%") - `pitch`: Pitch (e.g., "high", "+5Hz") - `emphasis`: "strong", "moderate", "reduced" - `auto_breaks`: true/false **Response:** Audio file (`audio/mpeg`). --- ## Error Responses All errors follow this format: ```json { "error": "error_type", "message": "Human readable message", "detail": "Additional details (debug mode only)" } ``` ### Common Error Codes | Code | Type | Description | |------|------|-------------| | 400 | validation_error | Invalid request parameters | | 401 | unauthorized | Missing or invalid auth token | | 403 | forbidden | Insufficient permissions | | 404 | not_found | Resource not found | | 413 | file_too_large | Upload exceeds size limit | | 429 | rate_limited | Too many requests | | 500 | internal_error | Server error | --- ## Rate Limits | Tier | Limit | |------|-------| | Free | 60 requests/minute | | Pro | 600 requests/minute | | Enterprise | Custom | --- ## Supported Audio Formats | Format | Extension | Notes | |--------|-----------|-------| | WAV | .wav | Best quality, no conversion | | MP3 | .mp3 | Common, converted | | M4A | .m4a | iOS format | | FLAC | .flac | Lossless | | OGG | .ogg | Open format | | WebM | .webm | Browser recording | --- ## Code Examples ### Python ```python import requests # Transcribe audio with open("audio.wav", "rb") as f: response = requests.post( "http://localhost:8000/api/v1/stt/upload", files={"file": f}, data={"language": "en-US"} ) print(response.json()["text"]) # Synthesize speech response = requests.post( "http://localhost:8000/api/v1/tts/synthesize", json={"text": "Hello world", "language": "en-US"} ) import base64 audio = base64.b64decode(response.json()["audio_content"]) with open("output.mp3", "wb") as f: f.write(audio) ``` ### cURL ```bash # Transcribe curl -X POST http://localhost:8000/api/v1/stt/upload \ -F "file=@audio.wav" \ -F "language=en-US" # Synthesize curl -X POST http://localhost:8000/api/v1/tts/synthesize \ -H "Content-Type: application/json" \ -d '{"text": "Hello", "language": "en-US"}' ```