Spaces:

lordofgaming
/

voiceforge-universal

Running

File size: 11,068 Bytes

d00203b

# VoiceForge API Documentation

## Base URL

- Development: `http://localhost:8000`
- Production: `https://api.voiceforge.example.com`

## Authentication

Most endpoints require authentication via JWT Bearer token.

```http
Authorization: Bearer <token>
```

---

## Endpoints

### Authentication

#### POST /api/v1/auth/register

Register a new user.

**Request:**
```json
{
  "email": "user@example.com",
  "password": "strongpassword123",
  "name": "Jane Doe"
}
```

**Response:**
```json
{
  "id": 1,
  "email": "user@example.com",
  "name": "Jane Doe",
  "created_at": "2024-01-01T10:00:00"
}
```

#### POST /api/v1/auth/login

Login to get a JWT token.

**Request (Form Data):**
- username: user@example.com
- password: strongpassword123

**Response:**
```json
{
  "access_token": "ey...",
  "token_type": "bearer"
}
```

#### GET /api/v1/auth/me

Get current user profile.

---

### Endpoints

### Health Check

#### GET /health

Check if the API is running.

**Response:**
```json
{
  "status": "healthy",
  "service": "voiceforge-api",
  "version": "1.0.0"
}
```

#### GET /health/memory

Get current memory usage and loaded models.

**Response:**
```json
{
  "memory_mb": 1523.4,
  "loaded_models": ["distil-small.en", "small"],
  "models_detail": {
    "distil-small.en": {"loaded": true, "idle_seconds": 45.2}
  }
}
```

#### POST /health/memory/cleanup

Unload idle models (inactive > 5 minutes) to free memory.

#### POST /health/memory/unload-all

Unload ALL models to free maximum memory (~1GB reduction).

---

### WebSocket Endpoints

#### WS /api/v1/ws/tts/{client_id}

Real-time TTS streaming via WebSocket (ultra-low latency).

**Protocol:**
- **Client sends:** JSON `{"text": "...", "voice": "...", "rate": "+0%", "pitch": "+0Hz"}`
- **Server sends:** Binary audio chunks followed by JSON `{"status": "complete", "ttfb_ms": 150}`

**Expected TTFB:** <500ms

---

### Speech-to-Text

#### GET /api/v1/stt/languages

Get list of supported languages.

**Response:**
```json
{
  "languages": [
    {
      "code": "en-US",
      "name": "English (US)",
      "native_name": "English",
      "flag": "🇺🇸",
      "stt_supported": true,
      "tts_supported": true
    }
  ],
  "total": 10
}
```

#### POST /api/v1/stt/upload

Transcribe an uploaded audio file.

**Request:**
- Content-Type: `multipart/form-data`

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file (WAV, MP3, M4A, FLAC, OGG) |
| language | string | No | Language code (default: en-US) |
| enable_punctuation | boolean | No | Add punctuation (default: true) |
| enable_word_timestamps | boolean | No | Include word timing (default: true) |
| enable_diarization | boolean | No | Speaker detection (default: false) |
| speaker_count | integer | No | Expected speakers (2-10) |

**Response:**
```json
{
  "id": 1,
  "text": "Hello, world. This is a test transcription.",
  "segments": [
    {
      "text": "Hello, world.",
      "start_time": 0.0,
      "end_time": 1.5,
      "speaker": null,
      "confidence": 0.95
    }
  ],
  "words": [
    {
      "word": "Hello",
      "start_time": 0.0,
      "end_time": 0.5,
      "confidence": 0.98
    }
  ],
  "language": "en-US",
  "confidence": 0.95,
  "duration": 3.5,
  "word_count": 7,
  "processing_time": 1.23
}
```

#### POST /api/v1/stt/upload/quality

High-quality transcription mode (optimized for accuracy).

**Parameters (form-data):**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file |
| language | string | No | Language code (default: en-US) |
| preprocess | boolean | No | Apply noise reduction (default: false) |

**Features:**
- beam_size=5 for more accurate decoding (~40% fewer errors)
- condition_on_previous_text=False to reduce hallucinations
- Optional audio preprocessing for noisy environments

**Response:** Same as `/upload`

---

#### POST /api/v1/stt/upload/batch

Batch transcription for high throughput (2-3x speedup).

**Parameters (form-data):**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| files | file[] | Yes | Multiple audio files |
| language | string | No | Language code (default: en-US) |
| batch_size | integer | No | Batch size (default: 8) |

**Response:**
```json
{
  "count": 3,
  "results": [
    {"filename": "audio1.mp3", "text": "...", "processing_time": 2.1},
    {"filename": "audio2.mp3", "text": "...", "processing_time": 1.8}
  ],
  "mode": "batched",
  "batch_size": 8
}
```

```

#### POST /api/v1/stt/upload/diarize

Perform Speaker Diarization ("Who said what") on an audio file.

**Requirements:**
- `HF_TOKEN` must be set in `.env` (Hugging Face Token for pyannote model access)
- Uses `faster-whisper` for transcription + `pyannote.audio` for speaker identification

**Parameters (form-data):**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file |
| num_speakers | integer | No | Exact number of speakers (optional) |
| min_speakers | integer | No | Min expected speakers (optional) |
| max_speakers | integer | No | Max expected speakers (optional) |
| language | string | No | Language code, e.g. 'en' (auto-detected if not provided) |

**Response:**
```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 0.84,
      "text": "Hello Test",
      "speaker": "SPEAKER_00"
    }
  ],
  "speaker_stats": {
    "SPEAKER_00": 0.84
  },
  "language": "en",
  "status": "success"
}
```

### Transcripts & Analysis

#### GET /api/v1/transcripts

List all past transcriptions.

**Response:**
```json
[
  {
    "id": 1,
    "text": "Hello world...",
    "created_at": "2024-01-01T12:00:00",
    "word_count": 150,
    "language": "en-US"
  }
]
```

#### POST /api/v1/transcripts/{id}/analyze

Run NLP analysis (Sentiment, Keywords, Summary) on a transcript.

**Response:**
```json
{
  "status": "success",
  "analysis": {
    "sentiment": {"polarity": 0.5, "subjectivity": 0.1},
    "keywords": ["artificial intelligence", "voice", "app"],
    "summary": "This is a summary of the transcript."
  }
}
```

#### GET /api/v1/transcripts/{id}/export

Download transcript in a specific format.

**Query Parameters:**
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| format | string | Yes | txt, srt, vtt, pdf |

**Response:**
- File download (text/plain, text/vtt, application/pdf)

---

### Text-to-Speech

#### GET /api/v1/tts/voices

Get all available voices.

**Query Parameters:**
| Parameter | Type | Description |
|-----------|------|-------------|
| language | string | Filter by language code |

**Response:**
```json
{
  "voices": [
    {
      "name": "en-US-Wavenet-D",
      "language_code": "en-US",
      "language_name": "English (US)",
      "ssml_gender": "MALE",
      "natural_sample_rate": 24000,
      "voice_type": "WaveNet",
      "display_name": "D (Male, WaveNet)",
      "flag": "🇺🇸"
    }
  ],
  "total": 50,
  "language_filter": null
}
```

#### GET /api/v1/tts/voices/{language}

Get voices for a specific language.

**Parameters:**
| Parameter | Type | Description |
|-----------|------|-------------|
| language | path | Language code (e.g., en-US) |

#### POST /api/v1/tts/synthesize

Convert text to speech.

**Request:**
```json
{
  "text": "Hello, this is a test.",
  "language": "en-US",
  "voice": "en-US-Wavenet-D",
  "audio_encoding": "MP3",
  "speaking_rate": 1.0,
  "pitch": 0.0,
  "volume_gain_db": 0.0,
  "use_ssml": false
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| text | string | Yes | Text to synthesize (max 5000 chars) |
| language | string | No | Language code (default: en-US) |
| voice | string | No | Voice name |
| audio_encoding | string | No | MP3, LINEAR16, OGG_OPUS |
| speaking_rate | float | No | Speed 0.25-4.0 (default: 1.0) |
| pitch | float | No | Pitch -20 to 20 (default: 0.0) |
| volume_gain_db | float | No | Volume -96 to 16 dB |
| use_ssml | boolean | No | Treat as SSML markup |

**Response:**
```json
{
  "audio_content": "<base64 encoded audio>",
  "audio_size": 12345,
  "duration_estimate": 2.5,
  "voice_used": "en-US-Wavenet-D",
  "language": "en-US",
  "encoding": "MP3",
  "sample_rate": 24000,
  "processing_time": 0.45
}
```

#### POST /api/v1/tts/synthesize/audio

Synthesize and return audio file directly.

Same request as `/synthesize`, but returns the audio file as a download.

#### POST /api/v1/tts/stream

Stream synthesized audio for immediate playback.

**Request:** Same as `/synthesize`.

**Response:** Chunked audio stream (`audio/mpeg`). Ideal for long text to reduce latency (TTFB).

#### POST /api/v1/tts/ssml

Synthesize audio using SSML for prosody control (rate, pitch, emphasis).

**Request:**
- `text`: Text to speak
- `voice`: Voice name (default: "en-US-AriaNeural")
- `rate`: Speed (e.g., "fast", "-10%")
- `pitch`: Pitch (e.g., "high", "+5Hz")
- `emphasis`: "strong", "moderate", "reduced"
- `auto_breaks`: true/false

**Response:** Audio file (`audio/mpeg`).

---

## Error Responses

All errors follow this format:

```json
{
  "error": "error_type",
  "message": "Human readable message",
  "detail": "Additional details (debug mode only)"
}
```

### Common Error Codes

| Code | Type | Description |
|------|------|-------------|
| 400 | validation_error | Invalid request parameters |
| 401 | unauthorized | Missing or invalid auth token |
| 403 | forbidden | Insufficient permissions |
| 404 | not_found | Resource not found |
| 413 | file_too_large | Upload exceeds size limit |
| 429 | rate_limited | Too many requests |
| 500 | internal_error | Server error |

---

## Rate Limits

| Tier | Limit |
|------|-------|
| Free | 60 requests/minute |
| Pro | 600 requests/minute |
| Enterprise | Custom |

---

## Supported Audio Formats

| Format | Extension | Notes |
|--------|-----------|-------|
| WAV | .wav | Best quality, no conversion |
| MP3 | .mp3 | Common, converted |
| M4A | .m4a | iOS format |
| FLAC | .flac | Lossless |
| OGG | .ogg | Open format |
| WebM | .webm | Browser recording |

---

## Code Examples

### Python

```python
import requests

# Transcribe audio
with open("audio.wav", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/v1/stt/upload",
        files={"file": f},
        data={"language": "en-US"}
    )
    print(response.json()["text"])

# Synthesize speech
response = requests.post(
    "http://localhost:8000/api/v1/tts/synthesize",
    json={"text": "Hello world", "language": "en-US"}
)
import base64
audio = base64.b64decode(response.json()["audio_content"])
with open("output.mp3", "wb") as f:
    f.write(audio)
```

### cURL

```bash
# Transcribe
curl -X POST http://localhost:8000/api/v1/stt/upload \
  -F "file=@audio.wav" \
  -F "language=en-US"

# Synthesize
curl -X POST http://localhost:8000/api/v1/tts/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "language": "en-US"}'
```