Spaces:

lordofgaming
/

voiceforge-universal

Running

App Files Files Community

voiceforge-universal / docs /API.md

creator-o1

Initial commit: Complete VoiceForge Enterprise Speech AI Platform

d00203b 3 days ago

preview code

raw

history blame contribute delete

11.1 kB

	# VoiceForge API Documentation

	## Base URL

	- Development: `http://localhost:8000`
	- Production: `https://api.voiceforge.example.com`

	## Authentication

	Most endpoints require authentication via JWT Bearer token.

	```http
	Authorization: Bearer <token>
	```

	---

	## Endpoints

	### Authentication

	#### POST /api/v1/auth/register

	Register a new user.

	Request:
	```json
	{
	"email": "user@example.com",
	"password": "strongpassword123",
	"name": "Jane Doe"
	}
	```

	Response:
	```json
	{
	"id": 1,
	"email": "user@example.com",
	"name": "Jane Doe",
	"created_at": "2024-01-01T10:00:00"
	}
	```

	#### POST /api/v1/auth/login

	Login to get a JWT token.

	Request (Form Data):
	- username: user@example.com
	- password: strongpassword123

	Response:
	```json
	{
	"access_token": "ey...",
	"token_type": "bearer"
	}
	```

	#### GET /api/v1/auth/me

	Get current user profile.

	---

	### Endpoints

	### Health Check

	#### GET /health

	Check if the API is running.

	Response:
	```json
	{
	"status": "healthy",
	"service": "voiceforge-api",
	"version": "1.0.0"
	}
	```

	#### GET /health/memory

	Get current memory usage and loaded models.

	Response:
	```json
	{
	"memory_mb": 1523.4,
	"loaded_models": ["distil-small.en", "small"],
	"models_detail": {
	"distil-small.en": {"loaded": true, "idle_seconds": 45.2}
	}
	}
	```

	#### POST /health/memory/cleanup

	Unload idle models (inactive > 5 minutes) to free memory.

	#### POST /health/memory/unload-all

	Unload ALL models to free maximum memory (~1GB reduction).

	---

	### WebSocket Endpoints

	#### WS /api/v1/ws/tts/{client_id}

	Real-time TTS streaming via WebSocket (ultra-low latency).

	Protocol:
	- Client sends: JSON `{"text": "...", "voice": "...", "rate": "+0%", "pitch": "+0Hz"}`
	- Server sends: Binary audio chunks followed by JSON `{"status": "complete", "ttfb_ms": 150}`

	Expected TTFB: <500ms

	---

	### Speech-to-Text

	#### GET /api/v1/stt/languages

	Get list of supported languages.

	Response:
	```json
	{
	"languages": [
	{
	"code": "en-US",
	"name": "English (US)",
	"native_name": "English",
	"flag": "🇺🇸",
	"stt_supported": true,
	"tts_supported": true
	}
	],
	"total": 10
	}
	```

	#### POST /api/v1/stt/upload

	Transcribe an uploaded audio file.

	Request:
	- Content-Type: `multipart/form-data`

	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| file \| file \| Yes \| Audio file (WAV, MP3, M4A, FLAC, OGG) \|
	\| language \| string \| No \| Language code (default: en-US) \|
	\| enable_punctuation \| boolean \| No \| Add punctuation (default: true) \|
	\| enable_word_timestamps \| boolean \| No \| Include word timing (default: true) \|
	\| enable_diarization \| boolean \| No \| Speaker detection (default: false) \|
	\| speaker_count \| integer \| No \| Expected speakers (2-10) \|

	Response:
	```json
	{
	"id": 1,
	"text": "Hello, world. This is a test transcription.",
	"segments": [
	{
	"text": "Hello, world.",
	"start_time": 0.0,
	"end_time": 1.5,
	"speaker": null,
	"confidence": 0.95
	}
	],
	"words": [
	{
	"word": "Hello",
	"start_time": 0.0,
	"end_time": 0.5,
	"confidence": 0.98
	}
	],
	"language": "en-US",
	"confidence": 0.95,
	"duration": 3.5,
	"word_count": 7,
	"processing_time": 1.23
	}
	```

	#### POST /api/v1/stt/upload/quality

	High-quality transcription mode (optimized for accuracy).

	Parameters (form-data):

	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| file \| file \| Yes \| Audio file \|
	\| language \| string \| No \| Language code (default: en-US) \|
	\| preprocess \| boolean \| No \| Apply noise reduction (default: false) \|

	Features:
	- beam_size=5 for more accurate decoding (~40% fewer errors)
	- condition_on_previous_text=False to reduce hallucinations
	- Optional audio preprocessing for noisy environments

	Response: Same as `/upload`

	---

	#### POST /api/v1/stt/upload/batch

	Batch transcription for high throughput (2-3x speedup).

	Parameters (form-data):

	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| files \| file[] \| Yes \| Multiple audio files \|
	\| language \| string \| No \| Language code (default: en-US) \|
	\| batch_size \| integer \| No \| Batch size (default: 8) \|

	Response:
	```json
	{
	"count": 3,
	"results": [
	{"filename": "audio1.mp3", "text": "...", "processing_time": 2.1},
	{"filename": "audio2.mp3", "text": "...", "processing_time": 1.8}
	],
	"mode": "batched",
	"batch_size": 8
	}
	```

	```

	#### POST /api/v1/stt/upload/diarize

	Perform Speaker Diarization ("Who said what") on an audio file.

	Requirements:
	- `HF_TOKEN` must be set in `.env` (Hugging Face Token for pyannote model access)
	- Uses `faster-whisper` for transcription + `pyannote.audio` for speaker identification

	Parameters (form-data):

	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| file \| file \| Yes \| Audio file \|
	\| num_speakers \| integer \| No \| Exact number of speakers (optional) \|
	\| min_speakers \| integer \| No \| Min expected speakers (optional) \|
	\| max_speakers \| integer \| No \| Max expected speakers (optional) \|
	\| language \| string \| No \| Language code, e.g. 'en' (auto-detected if not provided) \|

	Response:
	```json
	{
	"segments": [
	{
	"start": 0.0,
	"end": 0.84,
	"text": "Hello Test",
	"speaker": "SPEAKER_00"
	}
	],
	"speaker_stats": {
	"SPEAKER_00": 0.84
	},
	"language": "en",
	"status": "success"
	}
	```

	### Transcripts & Analysis

	#### GET /api/v1/transcripts

	List all past transcriptions.

	Response:
	```json
	[
	{
	"id": 1,
	"text": "Hello world...",
	"created_at": "2024-01-01T12:00:00",
	"word_count": 150,
	"language": "en-US"
	}
	]
	```

	#### POST /api/v1/transcripts/{id}/analyze

	Run NLP analysis (Sentiment, Keywords, Summary) on a transcript.

	Response:
	```json
	{
	"status": "success",
	"analysis": {
	"sentiment": {"polarity": 0.5, "subjectivity": 0.1},
	"keywords": ["artificial intelligence", "voice", "app"],
	"summary": "This is a summary of the transcript."
	}
	}
	```

	#### GET /api/v1/transcripts/{id}/export

	Download transcript in a specific format.

	Query Parameters:
	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| format \| string \| Yes \| txt, srt, vtt, pdf \|

	Response:
	- File download (text/plain, text/vtt, application/pdf)

	---

	### Text-to-Speech

	#### GET /api/v1/tts/voices

	Get all available voices.

	Query Parameters:
	\| Parameter \| Type \| Description \|
	\|-----------\|------\|-------------\|
	\| language \| string \| Filter by language code \|

	Response:
	```json
	{
	"voices": [
	{
	"name": "en-US-Wavenet-D",
	"language_code": "en-US",
	"language_name": "English (US)",
	"ssml_gender": "MALE",
	"natural_sample_rate": 24000,
	"voice_type": "WaveNet",
	"display_name": "D (Male, WaveNet)",
	"flag": "🇺🇸"
	}
	],
	"total": 50,
	"language_filter": null
	}
	```

	#### GET /api/v1/tts/voices/{language}

	Get voices for a specific language.

	Parameters:
	\| Parameter \| Type \| Description \|
	\|-----------\|------\|-------------\|
	\| language \| path \| Language code (e.g., en-US) \|

	#### POST /api/v1/tts/synthesize

	Convert text to speech.

	Request:
	```json
	{
	"text": "Hello, this is a test.",
	"language": "en-US",
	"voice": "en-US-Wavenet-D",
	"audio_encoding": "MP3",
	"speaking_rate": 1.0,
	"pitch": 0.0,
	"volume_gain_db": 0.0,
	"use_ssml": false
	}
	```

	\| Field \| Type \| Required \| Description \|
	\|-------\|------\|----------\|-------------\|
	\| text \| string \| Yes \| Text to synthesize (max 5000 chars) \|
	\| language \| string \| No \| Language code (default: en-US) \|
	\| voice \| string \| No \| Voice name \|
	\| audio_encoding \| string \| No \| MP3, LINEAR16, OGG_OPUS \|
	\| speaking_rate \| float \| No \| Speed 0.25-4.0 (default: 1.0) \|
	\| pitch \| float \| No \| Pitch -20 to 20 (default: 0.0) \|
	\| volume_gain_db \| float \| No \| Volume -96 to 16 dB \|
	\| use_ssml \| boolean \| No \| Treat as SSML markup \|

	Response:
	```json
	{
	"audio_content": "<base64 encoded audio>",
	"audio_size": 12345,
	"duration_estimate": 2.5,
	"voice_used": "en-US-Wavenet-D",
	"language": "en-US",
	"encoding": "MP3",
	"sample_rate": 24000,
	"processing_time": 0.45
	}
	```

	#### POST /api/v1/tts/synthesize/audio

	Synthesize and return audio file directly.

	Same request as `/synthesize`, but returns the audio file as a download.

	#### POST /api/v1/tts/stream

	Stream synthesized audio for immediate playback.

	Request: Same as `/synthesize`.

	Response: Chunked audio stream (`audio/mpeg`). Ideal for long text to reduce latency (TTFB).

	#### POST /api/v1/tts/ssml

	Synthesize audio using SSML for prosody control (rate, pitch, emphasis).

	Request:
	- `text`: Text to speak
	- `voice`: Voice name (default: "en-US-AriaNeural")
	- `rate`: Speed (e.g., "fast", "-10%")
	- `pitch`: Pitch (e.g., "high", "+5Hz")
	- `emphasis`: "strong", "moderate", "reduced"
	- `auto_breaks`: true/false

	Response: Audio file (`audio/mpeg`).

	---

	## Error Responses

	All errors follow this format:

	```json
	{
	"error": "error_type",
	"message": "Human readable message",
	"detail": "Additional details (debug mode only)"
	}
	```

	### Common Error Codes

	\| Code \| Type \| Description \|
	\|------\|------\|-------------\|
	\| 400 \| validation_error \| Invalid request parameters \|
	\| 401 \| unauthorized \| Missing or invalid auth token \|
	\| 403 \| forbidden \| Insufficient permissions \|
	\| 404 \| not_found \| Resource not found \|
	\| 413 \| file_too_large \| Upload exceeds size limit \|
	\| 429 \| rate_limited \| Too many requests \|
	\| 500 \| internal_error \| Server error \|

	---

	## Rate Limits

	\| Tier \| Limit \|
	\|------\|-------\|
	\| Free \| 60 requests/minute \|
	\| Pro \| 600 requests/minute \|
	\| Enterprise \| Custom \|

	---

	## Supported Audio Formats

	\| Format \| Extension \| Notes \|
	\|--------\|-----------\|-------\|
	\| WAV \| .wav \| Best quality, no conversion \|
	\| MP3 \| .mp3 \| Common, converted \|
	\| M4A \| .m4a \| iOS format \|
	\| FLAC \| .flac \| Lossless \|
	\| OGG \| .ogg \| Open format \|
	\| WebM \| .webm \| Browser recording \|

	---

	## Code Examples

	### Python

	```python
	import requests

	# Transcribe audio
	with open("audio.wav", "rb") as f:
	response = requests.post(
	"http://localhost:8000/api/v1/stt/upload",
	files={"file": f},
	data={"language": "en-US"}
	)
	print(response.json()["text"])

	# Synthesize speech
	response = requests.post(
	"http://localhost:8000/api/v1/tts/synthesize",
	json={"text": "Hello world", "language": "en-US"}
	)
	import base64
	audio = base64.b64decode(response.json()["audio_content"])
	with open("output.mp3", "wb") as f:
	f.write(audio)
	```

	### cURL

	```bash
	# Transcribe
	curl -X POST http://localhost:8000/api/v1/stt/upload \
	-F "file=@audio.wav" \
	-F "language=en-US"

	# Synthesize
	curl -X POST http://localhost:8000/api/v1/tts/synthesize \
	-H "Content-Type: application/json" \
	-d '{"text": "Hello", "language": "en-US"}'
	```