metadata
license: apache-2.0
title: Forge-TTS
sdk: docker
emoji: 🏃
colorFrom: yellow
colorTo: indigo
short_description: TTS
HF Spaces CPU TTS API (Docker)
API-only Space with separate endpoints for:
- XTTS v2 (voice cloning with an uploaded reference clip; Polish by default)
- Parler-TTS mini multilingual v1.1 (fast, high-quality Polish TTS; style controlled by text description)
- Piper (backup, local voices; bring your own
.onnxvoice files)
Runs on HF Spaces free CPU (2 vCPU / 16GB RAM) with CPU-friendly defaults:
- Chunking (sentence-based) to avoid timeouts on long text
- Streaming via SSE (each chunk returned as a standalone WAV)
- Optional torch.compile and optional dynamic int8 quantization hooks
Endpoints
Health
GET /health
XTTS v2
POST /v1/xtts/synthesize(multipart/form-data; WAV bytes)POST /v1/xtts/stream(SSE; base64 WAV chunks)
Parler
POST /v1/parler/synthesize(JSON; WAV bytes)POST /v1/parler/stream(JSON; SSE base64 WAV chunks)
Piper
GET /v1/piper/voicesPOST /v1/piper/synthesize(JSON; WAV bytes)
OpenAPI docs:
/docs
Usage examples
XTTS voice cloning (file upload)
curl -X POST "http://localhost:7860/v1/xtts/synthesize" \
-F "text=Cześć! To jest test głosu." \
-F "language=pl" \
-F "chunking=true" \
-F "speaker_wav=@reference.wav" \
--output out.wav
XTTS streaming (SSE)
This streams multiple WAV chunks (base64) as events. Your client should decode each wav_b64 and play/append.
curl -N -X POST "http://localhost:7860/v1/xtts/stream" \
-H "Content-Type: application/json" \
-d '{"text":"Cześć! To jest dłuższy tekst. Druga fraza. Trzecia fraza.","language":"pl","chunking":true}'
Parler synth
curl -X POST "http://localhost:7860/v1/parler/synthesize" \
-H "Content-Type: application/json" \
-d '{
"text":"Cześć! To Parler w języku polskim.",
"description":"A calm female Polish voice, close-mic, warm tone, subtle smile, studio quality."
}' \
--output parler.wav
Piper voices + synth
curl "http://localhost:7860/v1/piper/voices"
curl -X POST "http://localhost:7860/v1/piper/synthesize" \
-H "Content-Type: application/json" \
-d '{"text":"To jest Piper jako kopia zapasowa.","voice_id":"pl_PL-gosia-medium"}' \
--output piper.wav
Environment variables (important knobs)
XTTS
XTTS_MODEL_NAME(default:tts_models/multilingual/multi-dataset/xtts_v2)XTTS_DEFAULT_LANGUAGE(default:pl)XTTS_TORCH_COMPILE=1to attempttorch.compile()(best-effort)XTTS_DYNAMIC_INT8=1to attempt dynamic int8 quantization (best-effort)
Parler
PARLER_MODEL_NAME(default:parler-tts/parler-tts-mini-multilingual-v1.1)PARLER_DEFAULT_DESCRIPTION(default is neutral Polish)PARLER_SEED(default:0)PARLER_TORCH_COMPILE=1(best-effort)PARLER_DYNAMIC_INT8=1(best-effort)
Chunking / joining
CHUNK_MAX_CHARS(default: 260)CHUNK_MAX_WORDS(default: 40)CHUNK_MAX_SENTENCES(default: 8)JOIN_SILENCE_MS(default: 60)
Piper
Bring your own Piper .onnx voices:
- Put voice files in
/data/piper(auto-scanned) OR - Set
PIPER_VOICES_JSON='{"voice_id":"/data/piper/voice.onnx"}' - Optionally set
PIPER_VOICES_DIR(default:/data/piper)
Notes on “streaming”
XTTS and Parler streaming here is implemented by:
- Sentence chunking (fast + stable on CPU)
- Returning each chunk as its own WAV event over SSE
This avoids needing the full WAV length upfront and prevents long-run timeouts on free Spaces CPU.