|
|
--- |
|
|
license: apache-2.0 |
|
|
title: Forge-TTS |
|
|
sdk: docker |
|
|
emoji: 🏃 |
|
|
colorFrom: yellow |
|
|
colorTo: indigo |
|
|
short_description: TTS |
|
|
--- |
|
|
# HF Spaces CPU TTS API (Docker) |
|
|
|
|
|
API-only Space with **separate endpoints** for: |
|
|
|
|
|
- **XTTS v2** (voice cloning with an uploaded reference clip; Polish by default) |
|
|
- **Parler-TTS mini multilingual v1.1** (fast, high-quality Polish TTS; style controlled by text description) |
|
|
- **Piper** (backup, local voices; bring your own `.onnx` voice files) |
|
|
|
|
|
Runs on **HF Spaces free CPU (2 vCPU / 16GB RAM)** with CPU-friendly defaults: |
|
|
- **Chunking** (sentence-based) to avoid timeouts on long text |
|
|
- **Streaming** via SSE (each chunk returned as a standalone WAV) |
|
|
- Optional **torch.compile** and optional **dynamic int8 quantization** hooks |
|
|
|
|
|
--- |
|
|
|
|
|
## Endpoints |
|
|
|
|
|
### Health |
|
|
- `GET /health` |
|
|
|
|
|
### XTTS v2 |
|
|
- `POST /v1/xtts/synthesize` (multipart/form-data; WAV bytes) |
|
|
- `POST /v1/xtts/stream` (SSE; base64 WAV chunks) |
|
|
|
|
|
### Parler |
|
|
- `POST /v1/parler/synthesize` (JSON; WAV bytes) |
|
|
- `POST /v1/parler/stream` (JSON; SSE base64 WAV chunks) |
|
|
|
|
|
### Piper |
|
|
- `GET /v1/piper/voices` |
|
|
- `POST /v1/piper/synthesize` (JSON; WAV bytes) |
|
|
|
|
|
OpenAPI docs: |
|
|
- `/docs` |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage examples |
|
|
|
|
|
### XTTS voice cloning (file upload) |
|
|
```bash |
|
|
curl -X POST "http://localhost:7860/v1/xtts/synthesize" \ |
|
|
-F "text=Cześć! To jest test głosu." \ |
|
|
-F "language=pl" \ |
|
|
-F "chunking=true" \ |
|
|
-F "speaker_wav=@reference.wav" \ |
|
|
--output out.wav |
|
|
``` |
|
|
|
|
|
### XTTS streaming (SSE) |
|
|
This streams **multiple WAV chunks** (base64) as events. Your client should decode each `wav_b64` and play/append. |
|
|
```bash |
|
|
curl -N -X POST "http://localhost:7860/v1/xtts/stream" \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{"text":"Cześć! To jest dłuższy tekst. Druga fraza. Trzecia fraza.","language":"pl","chunking":true}' |
|
|
``` |
|
|
|
|
|
### Parler synth |
|
|
```bash |
|
|
curl -X POST "http://localhost:7860/v1/parler/synthesize" \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"text":"Cześć! To Parler w języku polskim.", |
|
|
"description":"A calm female Polish voice, close-mic, warm tone, subtle smile, studio quality." |
|
|
}' \ |
|
|
--output parler.wav |
|
|
``` |
|
|
|
|
|
### Piper voices + synth |
|
|
```bash |
|
|
curl "http://localhost:7860/v1/piper/voices" |
|
|
|
|
|
curl -X POST "http://localhost:7860/v1/piper/synthesize" \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{"text":"To jest Piper jako kopia zapasowa.","voice_id":"pl_PL-gosia-medium"}' \ |
|
|
--output piper.wav |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Environment variables (important knobs) |
|
|
|
|
|
### XTTS |
|
|
- `XTTS_MODEL_NAME` (default: `tts_models/multilingual/multi-dataset/xtts_v2`) |
|
|
- `XTTS_DEFAULT_LANGUAGE` (default: `pl`) |
|
|
- `XTTS_TORCH_COMPILE=1` to attempt `torch.compile()` (best-effort) |
|
|
- `XTTS_DYNAMIC_INT8=1` to attempt dynamic int8 quantization (best-effort) |
|
|
|
|
|
### Parler |
|
|
- `PARLER_MODEL_NAME` (default: `parler-tts/parler-tts-mini-multilingual-v1.1`) |
|
|
- `PARLER_DEFAULT_DESCRIPTION` (default is neutral Polish) |
|
|
- `PARLER_SEED` (default: `0`) |
|
|
- `PARLER_TORCH_COMPILE=1` (best-effort) |
|
|
- `PARLER_DYNAMIC_INT8=1` (best-effort) |
|
|
|
|
|
### Chunking / joining |
|
|
- `CHUNK_MAX_CHARS` (default: 260) |
|
|
- `CHUNK_MAX_WORDS` (default: 40) |
|
|
- `CHUNK_MAX_SENTENCES` (default: 8) |
|
|
- `JOIN_SILENCE_MS` (default: 60) |
|
|
|
|
|
### Piper |
|
|
Bring your own Piper `.onnx` voices: |
|
|
- Put voice files in `/data/piper` (auto-scanned) **OR** |
|
|
- Set `PIPER_VOICES_JSON='{"voice_id":"/data/piper/voice.onnx"}'` |
|
|
- Optionally set `PIPER_VOICES_DIR` (default: `/data/piper`) |
|
|
|
|
|
--- |
|
|
|
|
|
## Notes on “streaming” |
|
|
XTTS and Parler streaming here is implemented by: |
|
|
1) **Sentence chunking** (fast + stable on CPU) |
|
|
2) Returning each chunk as its own **WAV** event over SSE |
|
|
|
|
|
This avoids needing the full WAV length upfront and prevents long-run timeouts on free Spaces CPU. |