creator-o1
Initial commit: Complete VoiceForge Enterprise Speech AI Platform
d00203b
# VoiceForge API Documentation
## Base URL
- Development: `http://localhost:8000`
- Production: `https://api.voiceforge.example.com`
## Authentication
Most endpoints require authentication via JWT Bearer token.
```http
Authorization: Bearer <token>
```
---
## Endpoints
### Authentication
#### POST /api/v1/auth/register
Register a new user.
**Request:**
```json
{
"email": "user@example.com",
"password": "strongpassword123",
"name": "Jane Doe"
}
```
**Response:**
```json
{
"id": 1,
"email": "user@example.com",
"name": "Jane Doe",
"created_at": "2024-01-01T10:00:00"
}
```
#### POST /api/v1/auth/login
Login to get a JWT token.
**Request (Form Data):**
- username: user@example.com
- password: strongpassword123
**Response:**
```json
{
"access_token": "ey...",
"token_type": "bearer"
}
```
#### GET /api/v1/auth/me
Get current user profile.
---
### Endpoints
### Health Check
#### GET /health
Check if the API is running.
**Response:**
```json
{
"status": "healthy",
"service": "voiceforge-api",
"version": "1.0.0"
}
```
#### GET /health/memory
Get current memory usage and loaded models.
**Response:**
```json
{
"memory_mb": 1523.4,
"loaded_models": ["distil-small.en", "small"],
"models_detail": {
"distil-small.en": {"loaded": true, "idle_seconds": 45.2}
}
}
```
#### POST /health/memory/cleanup
Unload idle models (inactive > 5 minutes) to free memory.
#### POST /health/memory/unload-all
Unload ALL models to free maximum memory (~1GB reduction).
---
### WebSocket Endpoints
#### WS /api/v1/ws/tts/{client_id}
Real-time TTS streaming via WebSocket (ultra-low latency).
**Protocol:**
- **Client sends:** JSON `{"text": "...", "voice": "...", "rate": "+0%", "pitch": "+0Hz"}`
- **Server sends:** Binary audio chunks followed by JSON `{"status": "complete", "ttfb_ms": 150}`
**Expected TTFB:** <500ms
---
### Speech-to-Text
#### GET /api/v1/stt/languages
Get list of supported languages.
**Response:**
```json
{
"languages": [
{
"code": "en-US",
"name": "English (US)",
"native_name": "English",
"flag": "🇺🇸",
"stt_supported": true,
"tts_supported": true
}
],
"total": 10
}
```
#### POST /api/v1/stt/upload
Transcribe an uploaded audio file.
**Request:**
- Content-Type: `multipart/form-data`
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file (WAV, MP3, M4A, FLAC, OGG) |
| language | string | No | Language code (default: en-US) |
| enable_punctuation | boolean | No | Add punctuation (default: true) |
| enable_word_timestamps | boolean | No | Include word timing (default: true) |
| enable_diarization | boolean | No | Speaker detection (default: false) |
| speaker_count | integer | No | Expected speakers (2-10) |
**Response:**
```json
{
"id": 1,
"text": "Hello, world. This is a test transcription.",
"segments": [
{
"text": "Hello, world.",
"start_time": 0.0,
"end_time": 1.5,
"speaker": null,
"confidence": 0.95
}
],
"words": [
{
"word": "Hello",
"start_time": 0.0,
"end_time": 0.5,
"confidence": 0.98
}
],
"language": "en-US",
"confidence": 0.95,
"duration": 3.5,
"word_count": 7,
"processing_time": 1.23
}
```
#### POST /api/v1/stt/upload/quality
High-quality transcription mode (optimized for accuracy).
**Parameters (form-data):**
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file |
| language | string | No | Language code (default: en-US) |
| preprocess | boolean | No | Apply noise reduction (default: false) |
**Features:**
- beam_size=5 for more accurate decoding (~40% fewer errors)
- condition_on_previous_text=False to reduce hallucinations
- Optional audio preprocessing for noisy environments
**Response:** Same as `/upload`
---
#### POST /api/v1/stt/upload/batch
Batch transcription for high throughput (2-3x speedup).
**Parameters (form-data):**
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| files | file[] | Yes | Multiple audio files |
| language | string | No | Language code (default: en-US) |
| batch_size | integer | No | Batch size (default: 8) |
**Response:**
```json
{
"count": 3,
"results": [
{"filename": "audio1.mp3", "text": "...", "processing_time": 2.1},
{"filename": "audio2.mp3", "text": "...", "processing_time": 1.8}
],
"mode": "batched",
"batch_size": 8
}
```
```
#### POST /api/v1/stt/upload/diarize
Perform Speaker Diarization ("Who said what") on an audio file.
**Requirements:**
- `HF_TOKEN` must be set in `.env` (Hugging Face Token for pyannote model access)
- Uses `faster-whisper` for transcription + `pyannote.audio` for speaker identification
**Parameters (form-data):**
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file |
| num_speakers | integer | No | Exact number of speakers (optional) |
| min_speakers | integer | No | Min expected speakers (optional) |
| max_speakers | integer | No | Max expected speakers (optional) |
| language | string | No | Language code, e.g. 'en' (auto-detected if not provided) |
**Response:**
```json
{
"segments": [
{
"start": 0.0,
"end": 0.84,
"text": "Hello Test",
"speaker": "SPEAKER_00"
}
],
"speaker_stats": {
"SPEAKER_00": 0.84
},
"language": "en",
"status": "success"
}
```
### Transcripts & Analysis
#### GET /api/v1/transcripts
List all past transcriptions.
**Response:**
```json
[
{
"id": 1,
"text": "Hello world...",
"created_at": "2024-01-01T12:00:00",
"word_count": 150,
"language": "en-US"
}
]
```
#### POST /api/v1/transcripts/{id}/analyze
Run NLP analysis (Sentiment, Keywords, Summary) on a transcript.
**Response:**
```json
{
"status": "success",
"analysis": {
"sentiment": {"polarity": 0.5, "subjectivity": 0.1},
"keywords": ["artificial intelligence", "voice", "app"],
"summary": "This is a summary of the transcript."
}
}
```
#### GET /api/v1/transcripts/{id}/export
Download transcript in a specific format.
**Query Parameters:**
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| format | string | Yes | txt, srt, vtt, pdf |
**Response:**
- File download (text/plain, text/vtt, application/pdf)
---
### Text-to-Speech
#### GET /api/v1/tts/voices
Get all available voices.
**Query Parameters:**
| Parameter | Type | Description |
|-----------|------|-------------|
| language | string | Filter by language code |
**Response:**
```json
{
"voices": [
{
"name": "en-US-Wavenet-D",
"language_code": "en-US",
"language_name": "English (US)",
"ssml_gender": "MALE",
"natural_sample_rate": 24000,
"voice_type": "WaveNet",
"display_name": "D (Male, WaveNet)",
"flag": "🇺🇸"
}
],
"total": 50,
"language_filter": null
}
```
#### GET /api/v1/tts/voices/{language}
Get voices for a specific language.
**Parameters:**
| Parameter | Type | Description |
|-----------|------|-------------|
| language | path | Language code (e.g., en-US) |
#### POST /api/v1/tts/synthesize
Convert text to speech.
**Request:**
```json
{
"text": "Hello, this is a test.",
"language": "en-US",
"voice": "en-US-Wavenet-D",
"audio_encoding": "MP3",
"speaking_rate": 1.0,
"pitch": 0.0,
"volume_gain_db": 0.0,
"use_ssml": false
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| text | string | Yes | Text to synthesize (max 5000 chars) |
| language | string | No | Language code (default: en-US) |
| voice | string | No | Voice name |
| audio_encoding | string | No | MP3, LINEAR16, OGG_OPUS |
| speaking_rate | float | No | Speed 0.25-4.0 (default: 1.0) |
| pitch | float | No | Pitch -20 to 20 (default: 0.0) |
| volume_gain_db | float | No | Volume -96 to 16 dB |
| use_ssml | boolean | No | Treat as SSML markup |
**Response:**
```json
{
"audio_content": "<base64 encoded audio>",
"audio_size": 12345,
"duration_estimate": 2.5,
"voice_used": "en-US-Wavenet-D",
"language": "en-US",
"encoding": "MP3",
"sample_rate": 24000,
"processing_time": 0.45
}
```
#### POST /api/v1/tts/synthesize/audio
Synthesize and return audio file directly.
Same request as `/synthesize`, but returns the audio file as a download.
#### POST /api/v1/tts/stream
Stream synthesized audio for immediate playback.
**Request:** Same as `/synthesize`.
**Response:** Chunked audio stream (`audio/mpeg`). Ideal for long text to reduce latency (TTFB).
#### POST /api/v1/tts/ssml
Synthesize audio using SSML for prosody control (rate, pitch, emphasis).
**Request:**
- `text`: Text to speak
- `voice`: Voice name (default: "en-US-AriaNeural")
- `rate`: Speed (e.g., "fast", "-10%")
- `pitch`: Pitch (e.g., "high", "+5Hz")
- `emphasis`: "strong", "moderate", "reduced"
- `auto_breaks`: true/false
**Response:** Audio file (`audio/mpeg`).
---
## Error Responses
All errors follow this format:
```json
{
"error": "error_type",
"message": "Human readable message",
"detail": "Additional details (debug mode only)"
}
```
### Common Error Codes
| Code | Type | Description |
|------|------|-------------|
| 400 | validation_error | Invalid request parameters |
| 401 | unauthorized | Missing or invalid auth token |
| 403 | forbidden | Insufficient permissions |
| 404 | not_found | Resource not found |
| 413 | file_too_large | Upload exceeds size limit |
| 429 | rate_limited | Too many requests |
| 500 | internal_error | Server error |
---
## Rate Limits
| Tier | Limit |
|------|-------|
| Free | 60 requests/minute |
| Pro | 600 requests/minute |
| Enterprise | Custom |
---
## Supported Audio Formats
| Format | Extension | Notes |
|--------|-----------|-------|
| WAV | .wav | Best quality, no conversion |
| MP3 | .mp3 | Common, converted |
| M4A | .m4a | iOS format |
| FLAC | .flac | Lossless |
| OGG | .ogg | Open format |
| WebM | .webm | Browser recording |
---
## Code Examples
### Python
```python
import requests
# Transcribe audio
with open("audio.wav", "rb") as f:
response = requests.post(
"http://localhost:8000/api/v1/stt/upload",
files={"file": f},
data={"language": "en-US"}
)
print(response.json()["text"])
# Synthesize speech
response = requests.post(
"http://localhost:8000/api/v1/tts/synthesize",
json={"text": "Hello world", "language": "en-US"}
)
import base64
audio = base64.b64decode(response.json()["audio_content"])
with open("output.mp3", "wb") as f:
f.write(audio)
```
### cURL
```bash
# Transcribe
curl -X POST http://localhost:8000/api/v1/stt/upload \
-F "file=@audio.wav" \
-F "language=en-US"
# Synthesize
curl -X POST http://localhost:8000/api/v1/tts/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Hello", "language": "en-US"}'
```