Forge-TTS / README.md
chmielvu's picture
Update README.md
b3b8160 verified
metadata
license: apache-2.0
title: Forge-TTS
sdk: docker
emoji: 🏃
colorFrom: yellow
colorTo: indigo
short_description: TTS

HF Spaces CPU TTS API (Docker)

API-only Space with separate endpoints for:

  • XTTS v2 (voice cloning with an uploaded reference clip; Polish by default)
  • Parler-TTS mini multilingual v1.1 (fast, high-quality Polish TTS; style controlled by text description)
  • Piper (backup, local voices; bring your own .onnx voice files)

Runs on HF Spaces free CPU (2 vCPU / 16GB RAM) with CPU-friendly defaults:

  • Chunking (sentence-based) to avoid timeouts on long text
  • Streaming via SSE (each chunk returned as a standalone WAV)
  • Optional torch.compile and optional dynamic int8 quantization hooks

Endpoints

Health

  • GET /health

XTTS v2

  • POST /v1/xtts/synthesize (multipart/form-data; WAV bytes)
  • POST /v1/xtts/stream (SSE; base64 WAV chunks)

Parler

  • POST /v1/parler/synthesize (JSON; WAV bytes)
  • POST /v1/parler/stream (JSON; SSE base64 WAV chunks)

Piper

  • GET /v1/piper/voices
  • POST /v1/piper/synthesize (JSON; WAV bytes)

OpenAPI docs:

  • /docs

Usage examples

XTTS voice cloning (file upload)

curl -X POST "http://localhost:7860/v1/xtts/synthesize" \
  -F "text=Cześć! To jest test głosu." \
  -F "language=pl" \
  -F "chunking=true" \
  -F "speaker_wav=@reference.wav" \
  --output out.wav

XTTS streaming (SSE)

This streams multiple WAV chunks (base64) as events. Your client should decode each wav_b64 and play/append.

curl -N -X POST "http://localhost:7860/v1/xtts/stream" \
  -H "Content-Type: application/json" \
  -d '{"text":"Cześć! To jest dłuższy tekst. Druga fraza. Trzecia fraza.","language":"pl","chunking":true}'

Parler synth

curl -X POST "http://localhost:7860/v1/parler/synthesize" \
  -H "Content-Type: application/json" \
  -d '{
    "text":"Cześć! To Parler w języku polskim.",
    "description":"A calm female Polish voice, close-mic, warm tone, subtle smile, studio quality."
  }' \
  --output parler.wav

Piper voices + synth

curl "http://localhost:7860/v1/piper/voices"

curl -X POST "http://localhost:7860/v1/piper/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text":"To jest Piper jako kopia zapasowa.","voice_id":"pl_PL-gosia-medium"}' \
  --output piper.wav

Environment variables (important knobs)

XTTS

  • XTTS_MODEL_NAME (default: tts_models/multilingual/multi-dataset/xtts_v2)
  • XTTS_DEFAULT_LANGUAGE (default: pl)
  • XTTS_TORCH_COMPILE=1 to attempt torch.compile() (best-effort)
  • XTTS_DYNAMIC_INT8=1 to attempt dynamic int8 quantization (best-effort)

Parler

  • PARLER_MODEL_NAME (default: parler-tts/parler-tts-mini-multilingual-v1.1)
  • PARLER_DEFAULT_DESCRIPTION (default is neutral Polish)
  • PARLER_SEED (default: 0)
  • PARLER_TORCH_COMPILE=1 (best-effort)
  • PARLER_DYNAMIC_INT8=1 (best-effort)

Chunking / joining

  • CHUNK_MAX_CHARS (default: 260)
  • CHUNK_MAX_WORDS (default: 40)
  • CHUNK_MAX_SENTENCES (default: 8)
  • JOIN_SILENCE_MS (default: 60)

Piper

Bring your own Piper .onnx voices:

  • Put voice files in /data/piper (auto-scanned) OR
  • Set PIPER_VOICES_JSON='{"voice_id":"/data/piper/voice.onnx"}'
  • Optionally set PIPER_VOICES_DIR (default: /data/piper)

Notes on “streaming”

XTTS and Parler streaming here is implemented by:

  1. Sentence chunking (fast + stable on CPU)
  2. Returning each chunk as its own WAV event over SSE

This avoids needing the full WAV length upfront and prevents long-run timeouts on free Spaces CPU.