Spaces:

chmielvu
/

Forge-TTS

Sleeping

App Files Files Community

Forge-TTS / README.md

chmielvu

Update README.md

b3b8160 verified 16 days ago

preview code

raw

history blame contribute delete

3.68 kB

	---
	license: apache-2.0
	title: Forge-TTS
	sdk: docker
	emoji: 🏃
	colorFrom: yellow
	colorTo: indigo
	short_description: TTS
	---
	# HF Spaces CPU TTS API (Docker)

	API-only Space with separate endpoints for:

	- XTTS v2 (voice cloning with an uploaded reference clip; Polish by default)
	- Parler-TTS mini multilingual v1.1 (fast, high-quality Polish TTS; style controlled by text description)
	- Piper (backup, local voices; bring your own `.onnx` voice files)

	Runs on HF Spaces free CPU (2 vCPU / 16GB RAM) with CPU-friendly defaults:
	- Chunking (sentence-based) to avoid timeouts on long text
	- Streaming via SSE (each chunk returned as a standalone WAV)
	- Optional torch.compile and optional dynamic int8 quantization hooks

	---

	## Endpoints

	### Health
	- `GET /health`

	### XTTS v2
	- `POST /v1/xtts/synthesize` (multipart/form-data; WAV bytes)
	- `POST /v1/xtts/stream` (SSE; base64 WAV chunks)

	### Parler
	- `POST /v1/parler/synthesize` (JSON; WAV bytes)
	- `POST /v1/parler/stream` (JSON; SSE base64 WAV chunks)

	### Piper
	- `GET /v1/piper/voices`
	- `POST /v1/piper/synthesize` (JSON; WAV bytes)

	OpenAPI docs:
	- `/docs`

	---

	## Usage examples

	### XTTS voice cloning (file upload)
	```bash
	curl -X POST "http://localhost:7860/v1/xtts/synthesize" \
	-F "text=Cześć! To jest test głosu." \
	-F "language=pl" \
	-F "chunking=true" \
	-F "speaker_wav=@reference.wav" \
	--output out.wav
	```

	### XTTS streaming (SSE)
	This streams multiple WAV chunks (base64) as events. Your client should decode each `wav_b64` and play/append.
	```bash
	curl -N -X POST "http://localhost:7860/v1/xtts/stream" \
	-H "Content-Type: application/json" \
	-d '{"text":"Cześć! To jest dłuższy tekst. Druga fraza. Trzecia fraza.","language":"pl","chunking":true}'
	```

	### Parler synth
	```bash
	curl -X POST "http://localhost:7860/v1/parler/synthesize" \
	-H "Content-Type: application/json" \
	-d '{
	"text":"Cześć! To Parler w języku polskim.",
	"description":"A calm female Polish voice, close-mic, warm tone, subtle smile, studio quality."
	}' \
	--output parler.wav
	```

	### Piper voices + synth
	```bash
	curl "http://localhost:7860/v1/piper/voices"

	curl -X POST "http://localhost:7860/v1/piper/synthesize" \
	-H "Content-Type: application/json" \
	-d '{"text":"To jest Piper jako kopia zapasowa.","voice_id":"pl_PL-gosia-medium"}' \
	--output piper.wav
	```

	---

	## Environment variables (important knobs)

	### XTTS
	- `XTTS_MODEL_NAME` (default: `tts_models/multilingual/multi-dataset/xtts_v2`)
	- `XTTS_DEFAULT_LANGUAGE` (default: `pl`)
	- `XTTS_TORCH_COMPILE=1` to attempt `torch.compile()` (best-effort)
	- `XTTS_DYNAMIC_INT8=1` to attempt dynamic int8 quantization (best-effort)

	### Parler
	- `PARLER_MODEL_NAME` (default: `parler-tts/parler-tts-mini-multilingual-v1.1`)
	- `PARLER_DEFAULT_DESCRIPTION` (default is neutral Polish)
	- `PARLER_SEED` (default: `0`)
	- `PARLER_TORCH_COMPILE=1` (best-effort)
	- `PARLER_DYNAMIC_INT8=1` (best-effort)

	### Chunking / joining
	- `CHUNK_MAX_CHARS` (default: 260)
	- `CHUNK_MAX_WORDS` (default: 40)
	- `CHUNK_MAX_SENTENCES` (default: 8)
	- `JOIN_SILENCE_MS` (default: 60)

	### Piper
	Bring your own Piper `.onnx` voices:
	- Put voice files in `/data/piper` (auto-scanned) OR
	- Set `PIPER_VOICES_JSON='{"voice_id":"/data/piper/voice.onnx"}'`
	- Optionally set `PIPER_VOICES_DIR` (default: `/data/piper`)

	---

	## Notes on “streaming”
	XTTS and Parler streaming here is implemented by:
	1) Sentence chunking (fast + stable on CPU)
	2) Returning each chunk as its own WAV event over SSE

	This avoids needing the full WAV length upfront and prevents long-run timeouts on free Spaces CPU.