| # VoiceForge API Documentation | |
| ## Base URL | |
| - Development: `http://localhost:8000` | |
| - Production: `https://api.voiceforge.example.com` | |
| ## Authentication | |
| Most endpoints require authentication via JWT Bearer token. | |
| ```http | |
| Authorization: Bearer <token> | |
| ``` | |
| --- | |
| ## Endpoints | |
| ### Authentication | |
| #### POST /api/v1/auth/register | |
| Register a new user. | |
| **Request:** | |
| ```json | |
| { | |
| "email": "user@example.com", | |
| "password": "strongpassword123", | |
| "name": "Jane Doe" | |
| } | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "id": 1, | |
| "email": "user@example.com", | |
| "name": "Jane Doe", | |
| "created_at": "2024-01-01T10:00:00" | |
| } | |
| ``` | |
| #### POST /api/v1/auth/login | |
| Login to get a JWT token. | |
| **Request (Form Data):** | |
| - username: user@example.com | |
| - password: strongpassword123 | |
| **Response:** | |
| ```json | |
| { | |
| "access_token": "ey...", | |
| "token_type": "bearer" | |
| } | |
| ``` | |
| #### GET /api/v1/auth/me | |
| Get current user profile. | |
| --- | |
| ### Endpoints | |
| ### Health Check | |
| #### GET /health | |
| Check if the API is running. | |
| **Response:** | |
| ```json | |
| { | |
| "status": "healthy", | |
| "service": "voiceforge-api", | |
| "version": "1.0.0" | |
| } | |
| ``` | |
| #### GET /health/memory | |
| Get current memory usage and loaded models. | |
| **Response:** | |
| ```json | |
| { | |
| "memory_mb": 1523.4, | |
| "loaded_models": ["distil-small.en", "small"], | |
| "models_detail": { | |
| "distil-small.en": {"loaded": true, "idle_seconds": 45.2} | |
| } | |
| } | |
| ``` | |
| #### POST /health/memory/cleanup | |
| Unload idle models (inactive > 5 minutes) to free memory. | |
| #### POST /health/memory/unload-all | |
| Unload ALL models to free maximum memory (~1GB reduction). | |
| --- | |
| ### WebSocket Endpoints | |
| #### WS /api/v1/ws/tts/{client_id} | |
| Real-time TTS streaming via WebSocket (ultra-low latency). | |
| **Protocol:** | |
| - **Client sends:** JSON `{"text": "...", "voice": "...", "rate": "+0%", "pitch": "+0Hz"}` | |
| - **Server sends:** Binary audio chunks followed by JSON `{"status": "complete", "ttfb_ms": 150}` | |
| **Expected TTFB:** <500ms | |
| --- | |
| ### Speech-to-Text | |
| #### GET /api/v1/stt/languages | |
| Get list of supported languages. | |
| **Response:** | |
| ```json | |
| { | |
| "languages": [ | |
| { | |
| "code": "en-US", | |
| "name": "English (US)", | |
| "native_name": "English", | |
| "flag": "🇺🇸", | |
| "stt_supported": true, | |
| "tts_supported": true | |
| } | |
| ], | |
| "total": 10 | |
| } | |
| ``` | |
| #### POST /api/v1/stt/upload | |
| Transcribe an uploaded audio file. | |
| **Request:** | |
| - Content-Type: `multipart/form-data` | |
| | Parameter | Type | Required | Description | | |
| |-----------|------|----------|-------------| | |
| | file | file | Yes | Audio file (WAV, MP3, M4A, FLAC, OGG) | | |
| | language | string | No | Language code (default: en-US) | | |
| | enable_punctuation | boolean | No | Add punctuation (default: true) | | |
| | enable_word_timestamps | boolean | No | Include word timing (default: true) | | |
| | enable_diarization | boolean | No | Speaker detection (default: false) | | |
| | speaker_count | integer | No | Expected speakers (2-10) | | |
| **Response:** | |
| ```json | |
| { | |
| "id": 1, | |
| "text": "Hello, world. This is a test transcription.", | |
| "segments": [ | |
| { | |
| "text": "Hello, world.", | |
| "start_time": 0.0, | |
| "end_time": 1.5, | |
| "speaker": null, | |
| "confidence": 0.95 | |
| } | |
| ], | |
| "words": [ | |
| { | |
| "word": "Hello", | |
| "start_time": 0.0, | |
| "end_time": 0.5, | |
| "confidence": 0.98 | |
| } | |
| ], | |
| "language": "en-US", | |
| "confidence": 0.95, | |
| "duration": 3.5, | |
| "word_count": 7, | |
| "processing_time": 1.23 | |
| } | |
| ``` | |
| #### POST /api/v1/stt/upload/quality | |
| High-quality transcription mode (optimized for accuracy). | |
| **Parameters (form-data):** | |
| | Parameter | Type | Required | Description | | |
| |-----------|------|----------|-------------| | |
| | file | file | Yes | Audio file | | |
| | language | string | No | Language code (default: en-US) | | |
| | preprocess | boolean | No | Apply noise reduction (default: false) | | |
| **Features:** | |
| - beam_size=5 for more accurate decoding (~40% fewer errors) | |
| - condition_on_previous_text=False to reduce hallucinations | |
| - Optional audio preprocessing for noisy environments | |
| **Response:** Same as `/upload` | |
| --- | |
| #### POST /api/v1/stt/upload/batch | |
| Batch transcription for high throughput (2-3x speedup). | |
| **Parameters (form-data):** | |
| | Parameter | Type | Required | Description | | |
| |-----------|------|----------|-------------| | |
| | files | file[] | Yes | Multiple audio files | | |
| | language | string | No | Language code (default: en-US) | | |
| | batch_size | integer | No | Batch size (default: 8) | | |
| **Response:** | |
| ```json | |
| { | |
| "count": 3, | |
| "results": [ | |
| {"filename": "audio1.mp3", "text": "...", "processing_time": 2.1}, | |
| {"filename": "audio2.mp3", "text": "...", "processing_time": 1.8} | |
| ], | |
| "mode": "batched", | |
| "batch_size": 8 | |
| } | |
| ``` | |
| ``` | |
| #### POST /api/v1/stt/upload/diarize | |
| Perform Speaker Diarization ("Who said what") on an audio file. | |
| **Requirements:** | |
| - `HF_TOKEN` must be set in `.env` (Hugging Face Token for pyannote model access) | |
| - Uses `faster-whisper` for transcription + `pyannote.audio` for speaker identification | |
| **Parameters (form-data):** | |
| | Parameter | Type | Required | Description | | |
| |-----------|------|----------|-------------| | |
| | file | file | Yes | Audio file | | |
| | num_speakers | integer | No | Exact number of speakers (optional) | | |
| | min_speakers | integer | No | Min expected speakers (optional) | | |
| | max_speakers | integer | No | Max expected speakers (optional) | | |
| | language | string | No | Language code, e.g. 'en' (auto-detected if not provided) | | |
| **Response:** | |
| ```json | |
| { | |
| "segments": [ | |
| { | |
| "start": 0.0, | |
| "end": 0.84, | |
| "text": "Hello Test", | |
| "speaker": "SPEAKER_00" | |
| } | |
| ], | |
| "speaker_stats": { | |
| "SPEAKER_00": 0.84 | |
| }, | |
| "language": "en", | |
| "status": "success" | |
| } | |
| ``` | |
| ### Transcripts & Analysis | |
| #### GET /api/v1/transcripts | |
| List all past transcriptions. | |
| **Response:** | |
| ```json | |
| [ | |
| { | |
| "id": 1, | |
| "text": "Hello world...", | |
| "created_at": "2024-01-01T12:00:00", | |
| "word_count": 150, | |
| "language": "en-US" | |
| } | |
| ] | |
| ``` | |
| #### POST /api/v1/transcripts/{id}/analyze | |
| Run NLP analysis (Sentiment, Keywords, Summary) on a transcript. | |
| **Response:** | |
| ```json | |
| { | |
| "status": "success", | |
| "analysis": { | |
| "sentiment": {"polarity": 0.5, "subjectivity": 0.1}, | |
| "keywords": ["artificial intelligence", "voice", "app"], | |
| "summary": "This is a summary of the transcript." | |
| } | |
| } | |
| ``` | |
| #### GET /api/v1/transcripts/{id}/export | |
| Download transcript in a specific format. | |
| **Query Parameters:** | |
| | Parameter | Type | Required | Description | | |
| |-----------|------|----------|-------------| | |
| | format | string | Yes | txt, srt, vtt, pdf | | |
| **Response:** | |
| - File download (text/plain, text/vtt, application/pdf) | |
| --- | |
| ### Text-to-Speech | |
| #### GET /api/v1/tts/voices | |
| Get all available voices. | |
| **Query Parameters:** | |
| | Parameter | Type | Description | | |
| |-----------|------|-------------| | |
| | language | string | Filter by language code | | |
| **Response:** | |
| ```json | |
| { | |
| "voices": [ | |
| { | |
| "name": "en-US-Wavenet-D", | |
| "language_code": "en-US", | |
| "language_name": "English (US)", | |
| "ssml_gender": "MALE", | |
| "natural_sample_rate": 24000, | |
| "voice_type": "WaveNet", | |
| "display_name": "D (Male, WaveNet)", | |
| "flag": "🇺🇸" | |
| } | |
| ], | |
| "total": 50, | |
| "language_filter": null | |
| } | |
| ``` | |
| #### GET /api/v1/tts/voices/{language} | |
| Get voices for a specific language. | |
| **Parameters:** | |
| | Parameter | Type | Description | | |
| |-----------|------|-------------| | |
| | language | path | Language code (e.g., en-US) | | |
| #### POST /api/v1/tts/synthesize | |
| Convert text to speech. | |
| **Request:** | |
| ```json | |
| { | |
| "text": "Hello, this is a test.", | |
| "language": "en-US", | |
| "voice": "en-US-Wavenet-D", | |
| "audio_encoding": "MP3", | |
| "speaking_rate": 1.0, | |
| "pitch": 0.0, | |
| "volume_gain_db": 0.0, | |
| "use_ssml": false | |
| } | |
| ``` | |
| | Field | Type | Required | Description | | |
| |-------|------|----------|-------------| | |
| | text | string | Yes | Text to synthesize (max 5000 chars) | | |
| | language | string | No | Language code (default: en-US) | | |
| | voice | string | No | Voice name | | |
| | audio_encoding | string | No | MP3, LINEAR16, OGG_OPUS | | |
| | speaking_rate | float | No | Speed 0.25-4.0 (default: 1.0) | | |
| | pitch | float | No | Pitch -20 to 20 (default: 0.0) | | |
| | volume_gain_db | float | No | Volume -96 to 16 dB | | |
| | use_ssml | boolean | No | Treat as SSML markup | | |
| **Response:** | |
| ```json | |
| { | |
| "audio_content": "<base64 encoded audio>", | |
| "audio_size": 12345, | |
| "duration_estimate": 2.5, | |
| "voice_used": "en-US-Wavenet-D", | |
| "language": "en-US", | |
| "encoding": "MP3", | |
| "sample_rate": 24000, | |
| "processing_time": 0.45 | |
| } | |
| ``` | |
| #### POST /api/v1/tts/synthesize/audio | |
| Synthesize and return audio file directly. | |
| Same request as `/synthesize`, but returns the audio file as a download. | |
| #### POST /api/v1/tts/stream | |
| Stream synthesized audio for immediate playback. | |
| **Request:** Same as `/synthesize`. | |
| **Response:** Chunked audio stream (`audio/mpeg`). Ideal for long text to reduce latency (TTFB). | |
| #### POST /api/v1/tts/ssml | |
| Synthesize audio using SSML for prosody control (rate, pitch, emphasis). | |
| **Request:** | |
| - `text`: Text to speak | |
| - `voice`: Voice name (default: "en-US-AriaNeural") | |
| - `rate`: Speed (e.g., "fast", "-10%") | |
| - `pitch`: Pitch (e.g., "high", "+5Hz") | |
| - `emphasis`: "strong", "moderate", "reduced" | |
| - `auto_breaks`: true/false | |
| **Response:** Audio file (`audio/mpeg`). | |
| --- | |
| ## Error Responses | |
| All errors follow this format: | |
| ```json | |
| { | |
| "error": "error_type", | |
| "message": "Human readable message", | |
| "detail": "Additional details (debug mode only)" | |
| } | |
| ``` | |
| ### Common Error Codes | |
| | Code | Type | Description | | |
| |------|------|-------------| | |
| | 400 | validation_error | Invalid request parameters | | |
| | 401 | unauthorized | Missing or invalid auth token | | |
| | 403 | forbidden | Insufficient permissions | | |
| | 404 | not_found | Resource not found | | |
| | 413 | file_too_large | Upload exceeds size limit | | |
| | 429 | rate_limited | Too many requests | | |
| | 500 | internal_error | Server error | | |
| --- | |
| ## Rate Limits | |
| | Tier | Limit | | |
| |------|-------| | |
| | Free | 60 requests/minute | | |
| | Pro | 600 requests/minute | | |
| | Enterprise | Custom | | |
| --- | |
| ## Supported Audio Formats | |
| | Format | Extension | Notes | | |
| |--------|-----------|-------| | |
| | WAV | .wav | Best quality, no conversion | | |
| | MP3 | .mp3 | Common, converted | | |
| | M4A | .m4a | iOS format | | |
| | FLAC | .flac | Lossless | | |
| | OGG | .ogg | Open format | | |
| | WebM | .webm | Browser recording | | |
| --- | |
| ## Code Examples | |
| ### Python | |
| ```python | |
| import requests | |
| # Transcribe audio | |
| with open("audio.wav", "rb") as f: | |
| response = requests.post( | |
| "http://localhost:8000/api/v1/stt/upload", | |
| files={"file": f}, | |
| data={"language": "en-US"} | |
| ) | |
| print(response.json()["text"]) | |
| # Synthesize speech | |
| response = requests.post( | |
| "http://localhost:8000/api/v1/tts/synthesize", | |
| json={"text": "Hello world", "language": "en-US"} | |
| ) | |
| import base64 | |
| audio = base64.b64decode(response.json()["audio_content"]) | |
| with open("output.mp3", "wb") as f: | |
| f.write(audio) | |
| ``` | |
| ### cURL | |
| ```bash | |
| # Transcribe | |
| curl -X POST http://localhost:8000/api/v1/stt/upload \ | |
| -F "file=@audio.wav" \ | |
| -F "language=en-US" | |
| # Synthesize | |
| curl -X POST http://localhost:8000/api/v1/tts/synthesize \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text": "Hello", "language": "en-US"}' | |
| ``` | |