Spaces:

chmielvu
/

Forge-TTS

Sleeping

App Files Files Community

chmielvu commited on Feb 3

Commit

b2428d0

verified ·

1 Parent(s): f29c18d

Deploy: CPU-optimized TTS with critical thread safety fixes

Browse files

Files changed (5) hide show

Dockerfile +30 -0
README.md +112 -7
__pycache__/app.cpython-311.pyc +0 -0
app.py +705 -0
requirements.txt +29 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,30 @@

+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    HF_HOME=/data/hf \
+    TRANSFORMERS_CACHE=/data/hf/transformers \
+    TORCH_HOME=/data/hf/torch \
+    COQUI_TOS_AGREED=1 \
+    OMP_NUM_THREADS=1 \
+    MKL_NUM_THREADS=1 \
+    NUMEXPR_NUM_THREADS=2 \
+    KMP_BLOCKTIME=1 \
+    KMP_AFFINITY="granularity=fine,compact,1,0"
+RUN apt-get update && apt-get install -y --no-install-recommends \
+      git \
+      ffmpeg \
+      libsndfile1 \
+      build-essential \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+COPY requirements.txt /app/requirements.txt
+RUN pip install --no-cache-dir -r /app/requirements.txt
+COPY app.py /app/app.py
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

README.md CHANGED Viewed

@@ -1,10 +1,115 @@
 ---
-title: Forge TTS
-emoji: 😻
-colorFrom: blue
-colorTo: red
-sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# HF Spaces CPU TTS API (Docker)
+API-only Space with **separate endpoints** for:
+- **XTTS v2** (voice cloning with an uploaded reference clip; Polish by default)
+- **Parler-TTS mini multilingual v1.1** (fast, high-quality Polish TTS; style controlled by text description)
+- **Piper** (backup, local voices; bring your own `.onnx` voice files)
+Runs on **HF Spaces free CPU (2 vCPU / 16GB RAM)** with CPU-friendly defaults:
+- **Chunking** (sentence-based) to avoid timeouts on long text
+- **Streaming** via SSE (each chunk returned as a standalone WAV)
+- Optional **torch.compile** and optional **dynamic int8 quantization** hooks
 ---
+## Endpoints
+### Health
+- `GET /health`
+### XTTS v2
+- `POST /v1/xtts/synthesize` (multipart/form-data; WAV bytes)
+- `POST /v1/xtts/stream` (SSE; base64 WAV chunks)
+### Parler
+- `POST /v1/parler/synthesize` (JSON; WAV bytes)
+- `POST /v1/parler/stream` (JSON; SSE base64 WAV chunks)
+### Piper
+- `GET /v1/piper/voices`
+- `POST /v1/piper/synthesize` (JSON; WAV bytes)
+OpenAPI docs:
+- `/docs`
 ---
+## Usage examples
+### XTTS voice cloning (file upload)
+```bash
+curl -X POST "http://localhost:7860/v1/xtts/synthesize" \
+  -F "text=Cześć! To jest test głosu." \
+  -F "language=pl" \
+  -F "chunking=true" \
+  -F "speaker_wav=@reference.wav" \
+  --output out.wav
+```
+### XTTS streaming (SSE)
+This streams **multiple WAV chunks** (base64) as events. Your client should decode each `wav_b64` and play/append.
+```bash
+curl -N -X POST "http://localhost:7860/v1/xtts/stream" \
+  -H "Content-Type: application/json" \
+  -d '{"text":"Cześć! To jest dłuższy tekst. Druga fraza. Trzecia fraza.","language":"pl","chunking":true}'
+```
+### Parler synth
+```bash
+curl -X POST "http://localhost:7860/v1/parler/synthesize" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "text":"Cześć! To Parler w języku polskim.",
+    "description":"A calm female Polish voice, close-mic, warm tone, subtle smile, studio quality."
+  }' \
+  --output parler.wav
+```
+### Piper voices + synth
+```bash
+curl "http://localhost:7860/v1/piper/voices"
+curl -X POST "http://localhost:7860/v1/piper/synthesize" \
+  -H "Content-Type: application/json" \
+  -d '{"text":"To jest Piper jako kopia zapasowa.","voice_id":"pl_PL-gosia-medium"}' \
+  --output piper.wav
+```
+---
+## Environment variables (important knobs)
+### XTTS
+- `XTTS_MODEL_NAME` (default: `tts_models/multilingual/multi-dataset/xtts_v2`)
+- `XTTS_DEFAULT_LANGUAGE` (default: `pl`)
+- `XTTS_TORCH_COMPILE=1` to attempt `torch.compile()` (best-effort)
+- `XTTS_DYNAMIC_INT8=1` to attempt dynamic int8 quantization (best-effort)
+### Parler
+- `PARLER_MODEL_NAME` (default: `parler-tts/parler-tts-mini-multilingual-v1.1`)
+- `PARLER_DEFAULT_DESCRIPTION` (default is neutral Polish)
+- `PARLER_SEED` (default: `0`)
+- `PARLER_TORCH_COMPILE=1` (best-effort)
+- `PARLER_DYNAMIC_INT8=1` (best-effort)
+### Chunking / joining
+- `CHUNK_MAX_CHARS` (default: 260)
+- `CHUNK_MAX_WORDS` (default: 40)
+- `CHUNK_MAX_SENTENCES` (default: 8)
+- `JOIN_SILENCE_MS` (default: 60)
+### Piper
+Bring your own Piper `.onnx` voices:
+- Put voice files in `/data/piper` (auto-scanned) **OR**
+- Set `PIPER_VOICES_JSON='{"voice_id":"/data/piper/voice.onnx"}'`
+- Optionally set `PIPER_VOICES_DIR` (default: `/data/piper`)
+---
+## Notes on “streaming”
+XTTS and Parler streaming here is implemented by:
+1) **Sentence chunking** (fast + stable on CPU)
+2) Returning each chunk as its own **WAV** event over SSE
+This avoids needing the full WAV length upfront and prevents long-run timeouts on free Spaces CPU.

__pycache__/app.cpython-311.pyc ADDED Viewed

Binary file (38.9 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,705 @@

+"""
+HF Spaces (Docker) CPU-only TTS API
+- Separate endpoints per service: XTTS v2, Parler-TTS mini multilingual, Piper.
+- CPU-friendly defaults for 2 vCPU / 16 GB RAM:
+  - Sentence chunking (default ON)
+  - Streaming via SSE (each chunk returned as standalone WAV)
+  - Optional torch.compile, optional dynamic int8 quantization hooks
+Notes:
+- This repo is intentionally API-only (no Gradio UI).
+- HF Spaces expects a web server on port 7860.
+"""
+from __future__ import annotations
+import asyncio
+import base64
+import io
+import json
+import os
+import re
+import tempfile
+import threading
+import time
+from dataclasses import dataclass
+from functools import lru_cache
+from typing import Dict, Generator, Iterable, List, Optional, Tuple
+import numpy as np
+import soundfile as sf
+import torch
+from fastapi import FastAPI, File, Form, HTTPException, UploadFile
+from fastapi.responses import Response, StreamingResponse
+from pydantic import BaseModel, Field
+# --- Optional deps (import lazily where possible) ---
+# XTTS (Coqui TTS)
+from TTS.api import TTS
+# Parler-TTS (transformers)
+from transformers import AutoTokenizer, set_seed
+try:
+    from parler_tts import ParlerTTSForConditionalGeneration
+except Exception:
+    ParlerTTSForConditionalGeneration = None  # type: ignore
+# Piper fallback
+try:
+    from piper.voice import PiperVoice
+except Exception:
+    PiperVoice = None  # type: ignore
+# -----------------------
+# Settings / knobs
+# -----------------------
+def _env_bool(name: str, default: bool = False) -> bool:
+    v = os.getenv(name)
+    if v is None:
+        return default
+    return v.strip().lower() in {"1", "true", "yes", "y", "on"}
+@dataclass(frozen=True)
+class Settings:
+    # XTTS v2
+    xtts_model_name: str = os.getenv("XTTS_MODEL_NAME", "tts_models/multilingual/multi-dataset/xtts_v2")
+    xtts_default_language: str = os.getenv("XTTS_DEFAULT_LANGUAGE", "pl")
+    xtts_torch_compile: bool = _env_bool("XTTS_TORCH_COMPILE", False)
+    xtts_dynamic_int8: bool = _env_bool("XTTS_DYNAMIC_INT8", False)
+    # Parler
+    parler_model_name: str = os.getenv("PARLER_MODEL_NAME", "parler-tts/parler-tts-mini-multilingual-v1.1")
+    parler_default_description: str = os.getenv(
+        "PARLER_DEFAULT_DESCRIPTION",
+        # Keep this neutral; users can override per-request.
+        "A clear, natural, studio-recorded voice speaking Polish with steady pacing.",
+    )
+    parler_seed: int = int(os.getenv("PARLER_SEED", "0"))
+    parler_torch_compile: bool = _env_bool("PARLER_TORCH_COMPILE", False)
+    parler_dynamic_int8: bool = _env_bool("PARLER_DYNAMIC_INT8", False)
+    # Piper
+    # Provide a JSON mapping voice_id -> model_path; or mount voices into /data/piper and auto-scan.
+    piper_voices_json: str = os.getenv("PIPER_VOICES_JSON", "")
+    piper_voices_dir: str = os.getenv("PIPER_VOICES_DIR", "/data/piper")
+    # Chunking / streaming defaults
+    chunk_max_chars: int = int(os.getenv("CHUNK_MAX_CHARS", "260"))
+    chunk_max_words: int = int(os.getenv("CHUNK_MAX_WORDS", "40"))
+    chunk_max_sentences: int = int(os.getenv("CHUNK_MAX_SENTENCES", "8"))  # cap chunks to avoid CPU timeouts
+    join_silence_ms: int = int(os.getenv("JOIN_SILENCE_MS", "60"))
+    # Runtime
+    num_threads: int = int(os.getenv("OMP_NUM_THREADS", "2"))  # set via Dockerfile already
+    request_timeout_s: int = int(os.getenv("REQUEST_TIMEOUT_S", "240"))
+S = Settings()
+# Conservative CPU threading.
+torch.set_num_threads(S.num_threads)
+torch.set_num_interop_threads(max(1, S.num_threads // 2))
+# -----------------------
+# Utilities
+# -----------------------
+_SENT_SPLIT_RE = re.compile(r"(?<=[\.\!\?\:\;])\s+|\n+")
+_WS_RE = re.compile(r"\s+")
+def normalize_text(text: str) -> str:
+    text = text.strip()
+    text = _WS_RE.sub(" ", text)
+    return text
+def split_text_into_chunks(
+    text: str,
+    max_chars: int = S.chunk_max_chars,
+    max_words: int = S.chunk_max_words,
+    max_sentences: int = S.chunk_max_sentences,
+) -> List[str]:
+    """
+    Chunking strategy:
+    1) Sentence split (cheap, deterministic).
+    2) Merge sentences into chunks constrained by max_chars and max_words.
+    """
+    text = normalize_text(text)
+    if not text:
+        return []
+    sents = [s.strip() for s in _SENT_SPLIT_RE.split(text) if s.strip()]
+    chunks: List[str] = []
+    cur: List[str] = []
+    cur_chars = 0
+    cur_words = 0
+    def flush():
+        nonlocal cur, cur_chars, cur_words
+        if cur:
+            chunks.append(" ".join(cur).strip())
+            cur = []
+            cur_chars = 0
+            cur_words = 0
+    for sent in sents:
+        w = sent.split()
+        sent_words = len(w)
+        sent_chars = len(sent)
+        if (cur_chars + sent_chars > max_chars) or (cur_words + sent_words > max_words):
+            flush()
+        cur.append(sent)
+        cur_chars += sent_chars + 1
+        cur_words += sent_words
+        if max_sentences and len(chunks) + (1 if cur else 0) >= max_sentences:
+            flush()
+            break
+    flush()
+    return chunks
+def wav_bytes_from_audio(audio: np.ndarray, sr: int) -> bytes:
+    audio = np.asarray(audio, dtype=np.float32)
+    buf = io.BytesIO()
+    sf.write(buf, audio, sr, format="WAV", subtype="PCM_16")
+    return buf.getvalue()
+def concat_audio(chunks: List[np.ndarray], sr: int, silence_ms: int = S.join_silence_ms) -> np.ndarray:
+    if not chunks:
+        return np.zeros((1,), dtype=np.float32)
+    if len(chunks) == 1:
+        return np.asarray(chunks[0], dtype=np.float32)
+    silence = np.zeros((int(sr * (silence_ms / 1000.0)),), dtype=np.float32) if silence_ms > 0 else None
+    out = []
+    for i, ch in enumerate(chunks):
+        out.append(np.asarray(ch, dtype=np.float32))
+        if silence is not None and i != len(chunks) - 1:
+            out.append(silence)
+    return np.concatenate(out, axis=0)
+def b64encode_bytes(b: bytes) -> str:
+    return base64.b64encode(b).decode("ascii")
+def safe_filename(prefix: str = "audio", ext: str = ".wav") -> str:
+    return f"{prefix}_{int(time.time() * 1000)}{ext}"
+def _filter_kwargs(fn, kwargs: Dict) -> Dict:
+    """
+    Accepts a kwargs dict, returns a subset that the callable supports.
+    This keeps the API stable even if underlying libs differ by version.
+    """
+    import inspect
+    try:
+        sig = inspect.signature(fn)
+    except Exception:
+        return kwargs
+    accepted = set(sig.parameters.keys())
+    return {k: v for k, v in kwargs.items() if k in accepted}
+# -----------------------
+# Model manager (lazy + locked)
+# -----------------------
+class _Locks:
+    xtts = threading.Lock()
+    xtts_infer = threading.Lock()  # Protect XTTS inference
+    parler = threading.Lock()
+    parler_infer = threading.Lock()  # Protect Parler inference
+    piper = threading.Lock()
+class ModelManager:
+    def __init__(self) -> None:
+        self._xtts: Optional[TTS] = None
+        self._parler = None
+        self._parler_tok = None
+        self._piper_voices: Dict[str, str] = {}
+        self._piper_loaded: Dict[str, "PiperVoice"] = {}  # type: ignore
+    def _maybe_torch_compile(self, module: torch.nn.Module) -> torch.nn.Module:
+        if not hasattr(torch, "compile"):
+            return module
+        try:
+            return torch.compile(module)  # type: ignore
+        except Exception:
+            return module
+    def _maybe_dynamic_int8(self, module: torch.nn.Module) -> torch.nn.Module:
+        """
+        Dynamic int8 quantization is most useful for Linear-heavy CPU graphs.
+        This is a best-effort hook: if it fails, we keep fp32.
+        """
+        try:
+            from torch.ao.quantization import quantize_dynamic
+            return quantize_dynamic(module, {torch.nn.Linear}, dtype=torch.qint8)
+        except Exception:
+            return module
+    def get_xtts(self) -> TTS:
+        with _Locks.xtts:
+            if self._xtts is None:
+                tts = TTS(model_name=S.xtts_model_name, progress_bar=False, gpu=False)
+                # Best-effort: compile/quantize inner model if accessible.
+                try:
+                    # Some versions expose synthesizer.tts_model
+                    inner = getattr(getattr(tts, "synthesizer", None), "tts_model", None)
+                    if isinstance(inner, torch.nn.Module):
+                        if S.xtts_dynamic_int8:
+                            inner = self._maybe_dynamic_int8(inner)
+                            tts.synthesizer.tts_model = inner
+                        if S.xtts_torch_compile:
+                            inner = self._maybe_torch_compile(inner)
+                            tts.synthesizer.tts_model = inner
+                except Exception:
+                    pass
+                self._xtts = tts
+            return self._xtts
+    def get_parler(self):
+        with _Locks.parler:
+            if ParlerTTSForConditionalGeneration is None:
+                raise RuntimeError("parler_tts is not installed or failed to import.")
+            if self._parler is None or self._parler_tok is None:
+                tok = AutoTokenizer.from_pretrained(S.parler_model_name)
+                model = ParlerTTSForConditionalGeneration.from_pretrained(S.parler_model_name).to("cpu")
+                model.eval()
+                # Best-effort compile/quantize
+                if isinstance(model, torch.nn.Module):
+                    if S.parler_dynamic_int8:
+                        model = self._maybe_dynamic_int8(model)
+                    if S.parler_torch_compile:
+                        model = self._maybe_torch_compile(model)
+                self._parler = model
+                self._parler_tok = tok
+            return self._parler, self._parler_tok
+    def _load_piper_registry(self) -> Dict[str, str]:
+        reg: Dict[str, str] = {}
+        if S.piper_voices_json:
+            try:
+                reg.update(json.loads(S.piper_voices_json))
+            except Exception:
+                pass
+        # Auto-scan dir for *.onnx
+        try:
+            if os.path.isdir(S.piper_voices_dir):
+                for fn in os.listdir(S.piper_voices_dir):
+                    if fn.endswith(".onnx"):
+                        voice_id = os.path.splitext(fn)[0]
+                        reg.setdefault(voice_id, os.path.join(S.piper_voices_dir, fn))
+        except Exception:
+            pass
+        return reg
+    def list_piper_voices(self) -> Dict[str, str]:
+        with _Locks.piper:
+            if not self._piper_voices:
+                self._piper_voices = self._load_piper_registry()
+            return dict(self._piper_voices)
+    def get_piper(self, voice_id: str):
+        if PiperVoice is None:
+            raise RuntimeError("piper-tts is not installed or failed to import.")
+        with _Locks.piper:
+            if not self._piper_voices:
+                self._piper_voices = self._load_piper_registry()
+            if voice_id not in self._piper_voices:
+                raise KeyError(f"Unknown Piper voice_id '{voice_id}'.")
+            if voice_id not in self._piper_loaded:
+                self._piper_loaded[voice_id] = PiperVoice.load(self._piper_voices[voice_id])
+            return self._piper_loaded[voice_id]
+MM = ModelManager()
+# -----------------------
+# Request models
+# -----------------------
+class XttsJsonRequest(BaseModel):
+    text: str = Field(..., min_length=1)
+    language: str = Field(default=S.xtts_default_language)
+    # speed/prosody knobs (best-effort; filtered by signature)
+    speed: float = Field(default=1.0, ge=0.5, le=2.0)
+    temperature: float = Field(default=0.7, ge=0.0, le=1.5)
+    length_penalty: float = Field(default=1.0, ge=0.5, le=2.0)
+    # chunking knobs
+    chunking: bool = True
+    chunk_max_chars: int = Field(default=S.chunk_max_chars, ge=80, le=2000)
+    chunk_max_words: int = Field(default=S.chunk_max_words, ge=10, le=300)
+    chunk_max_sentences: int = Field(default=S.chunk_max_sentences, ge=0, le=100)
+    join_silence_ms: int = Field(default=S.join_silence_ms, ge=0, le=500)
+class ParlerJsonRequest(BaseModel):
+    text: str = Field(..., min_length=1)
+    description: str = Field(default=S.parler_default_description)
+    seed: int = Field(default=S.parler_seed, ge=0, le=2**31-1)
+    temperature: float = Field(default=0.7, ge=0.0, le=1.5)
+    max_new_tokens: int = Field(default=512, ge=64, le=2048)
+    chunking: bool = True
+    chunk_max_chars: int = Field(default=S.chunk_max_chars, ge=80, le=2000)
+    chunk_max_words: int = Field(default=S.chunk_max_words, ge=10, le=300)
+    chunk_max_sentences: int = Field(default=S.chunk_max_sentences, ge=0, le=100)
+    join_silence_ms: int = Field(default=S.join_silence_ms, ge=0, le=500)
+class PiperJsonRequest(BaseModel):
+    text: str = Field(..., min_length=1)
+    voice_id: str = Field(..., min_length=1)
+    # Optional tuning (depends on Piper voice config support)
+    length_scale: float = Field(default=1.0, ge=0.5, le=2.0)
+    noise_scale: float = Field(default=0.667, ge=0.0, le=2.0)
+    noise_w: float = Field(default=0.8, ge=0.0, le=2.0)
+# -----------------------
+# Inference functions
+# -----------------------
+def _xtts_synthesize_chunks(
+    text: str,
+    language: str,
+    speaker_wav_path: Optional[str],
+    speed: float,
+    temperature: float,
+    length_penalty: float,
+    chunking: bool,
+    chunk_max_chars: int,
+    chunk_max_words: int,
+    chunk_max_sentences: int,
+) -> Tuple[int, List[np.ndarray]]:
+    """
+    Returns (sample_rate, [audio_chunks...]) where each chunk is a waveform.
+    """
+    tts = MM.get_xtts()
+    segments = [normalize_text(text)] if not chunking else split_text_into_chunks(text, chunk_max_chars, chunk_max_words, max_sentences=chunk_max_sentences)
+    if not segments:
+        raise ValueError("Empty text after normalization.")
+    chunks: List[np.ndarray] = []
+    # Protect inference with lock (PyTorch models are not thread-safe for concurrent inference)
+    with _Locks.xtts_infer:
+        for seg in segments:
+            kwargs = {
+                "text": seg,
+                "speaker_wav": speaker_wav_path,
+                "language": language,
+                "speed": speed,
+                "temperature": temperature,
+                "length_penalty": length_penalty,
+            }
+            kwargs = _filter_kwargs(tts.tts, kwargs)
+            audio = tts.tts(**kwargs)  # numpy array
+            chunks.append(np.asarray(audio, dtype=np.float32))
+    # Best-effort sample rate extraction
+    sr = getattr(getattr(tts, "synthesizer", None), "output_sample_rate", None)
+    if not isinstance(sr, int):
+        # XTTS commonly uses 24000
+        sr = 24000
+    return sr, chunks
+def _parler_synthesize_chunks(
+    text: str,
+    description: str,
+    seed: int,
+    temperature: float,
+    max_new_tokens: int,
+    chunking: bool,
+    chunk_max_chars: int,
+    chunk_max_words: int,
+    chunk_max_sentences: int,
+) -> Tuple[int, List[np.ndarray]]:
+    model, tok = MM.get_parler()
+    if seed:
+        set_seed(seed)
+    segments = [normalize_text(text)] if not chunking else split_text_into_chunks(text, chunk_max_chars, chunk_max_words, max_sentences=chunk_max_sentences)
+    if not segments:
+        raise ValueError("Empty text after normalization.")
+    chunks: List[np.ndarray] = []
+    sr = getattr(model.config, "sampling_rate", 24000)
+    # Protect inference with lock (PyTorch models are not thread-safe for concurrent inference)
+    with _Locks.parler_infer:
+        for seg in segments:
+            input_ids = tok(description, return_tensors="pt").input_ids
+            prompt_ids = tok(seg, return_tensors="pt").input_ids
+            gen_kwargs = {
+                "input_ids": input_ids,
+                "prompt_input_ids": prompt_ids,
+                "temperature": temperature,
+                "max_new_tokens": max_new_tokens,
+            }
+            gen_kwargs = _filter_kwargs(model.generate, gen_kwargs)
+            out = model.generate(**gen_kwargs)
+            # transformers audio generation outputs: audio_values
+            audio = getattr(out, "audio_values", None)
+            if audio is None:
+                raise RuntimeError("Parler-TTS generate() did not return audio_values. Check transformers/parler_tts versions.")
+            wav = audio.cpu().numpy().squeeze()
+            chunks.append(np.asarray(wav, dtype=np.float32))
+    return int(sr), chunks
+def _piper_synthesize(text: str, voice_id: str, length_scale: float, noise_scale: float, noise_w: float) -> Tuple[int, np.ndarray]:
+    voice = MM.get_piper(voice_id)
+    audio = voice.synthesize(text, length_scale=length_scale, noise_scale=noise_scale, noise_w=noise_w)
+    # PiperVoice returns int16 pcm at 22050/16000 depending on voice
+    sr = getattr(voice, "sample_rate", 22050)
+    audio = np.asarray(audio, dtype=np.float32) / 32768.0
+    return int(sr), audio
+# -----------------------
+# Warmup & monitoring
+# -----------------------
+import logging
+logger = logging.getLogger(__name__)
+# Track warmup status
+_warmup_complete = {"parler": False, "xtts": False}
+async def _warmup_task():
+    """Pre-load Parler-TTS model on startup (non-blocking background task)."""
+    try:
+        logger.info("🔥 Warming up Parler-TTS model...")
+        start = time.time()
+        # Pre-load Parler model and tokenizer
+        model, tok = MM.get_parler()
+        # Run tiny inference to compile model graph
+        test_text = "Test"
+        test_desc = "A neutral voice"
+        input_ids = tok(test_desc, return_tensors="pt").input_ids
+        prompt_ids = tok(test_text, return_tensors="pt").input_ids
+        _ = model.generate(input_ids=input_ids, prompt_input_ids=prompt_ids, max_new_tokens=50)
+        elapsed = time.time() - start
+        _warmup_complete["parler"] = True
+        logger.info(f"✅ Parler-TTS ready! (warmup took {elapsed:.1f}s)")
+    except Exception as e:
+        logger.error(f"❌ Warmup failed: {e}", exc_info=True)
+        logger.warning("⚠️ First request will be slow (~60s)")
+# -----------------------
+# FastAPI app
+# -----------------------
+app = FastAPI(title="TTS API (XTTS v2 / Parler / Piper)", version="1.1.0")
+@app.on_event("startup")
+async def warmup():
+    """Launch warmup in background to not block Space readiness check."""
+    asyncio.create_task(_warmup_task())
+@app.get("/health")
+def health():
+    return {
+        "status": "ok",
+        "warmup_complete": _warmup_complete,
+        "models_loaded": {
+            "xtts": MM._xtts is not None,
+            "parler": MM._parler is not None,
+            "piper": len(MM._piper_loaded) > 0,
+        },
+        "services": {
+            "xtts": True,
+            "parler": ParlerTTSForConditionalGeneration is not None,
+            "piper": PiperVoice is not None,
+        },
+        "defaults": {
+            "chunk_max_chars": S.chunk_max_chars,
+            "chunk_max_words": S.chunk_max_words,
+            "join_silence_ms": S.join_silence_ms,
+        },
+    }
+@app.get("/v1/piper/voices")
+def list_piper_voices():
+    try:
+        return {"voices": MM.list_piper_voices()}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+# -------- XTTS endpoints (separate service) --------
+@app.post("/v1/xtts/synthesize")
+async def xtts_synthesize(
+    # multipart for true voice cloning upload
+    text: str = Form(...),
+    language: str = Form(S.xtts_default_language),
+    speed: float = Form(1.0),
+    temperature: float = Form(0.7),
+    length_penalty: float = Form(1.0),
+    chunking: bool = Form(True),
+    chunk_max_chars: int = Form(S.chunk_max_chars),
+    chunk_max_words: int = Form(S.chunk_max_words),
+    chunk_max_sentences: int = Form(S.chunk_max_sentences),
+    join_silence_ms: int = Form(S.join_silence_ms),
+    speaker_wav: Optional[UploadFile] = File(None),
+):
+    """
+    Returns a single WAV file.
+    If speaker_wav is provided, XTTS does voice cloning (zero-shot, best with 6-10s clean Polish reference).
+    """
+    speaker_path = None
+    try:
+        if speaker_wav is not None:
+            with tempfile.NamedTemporaryFile(delete=False, suffix=os.path.splitext(speaker_wav.filename or "ref.wav")[1] or ".wav") as tmp:
+                tmp.write(await speaker_wav.read())
+                speaker_path = tmp.name
+        def work():
+            sr, chunks = _xtts_synthesize_chunks(
+                text=text,
+                language=language,
+                speaker_wav_path=speaker_path,
+                speed=float(speed),
+                temperature=float(temperature),
+                length_penalty=float(length_penalty),
+                chunking=bool(chunking),
+                chunk_max_chars=int(chunk_max_chars),
+                chunk_max_words=int(chunk_max_words),
+                chunk_max_sentences=int(chunk_max_sentences),
+            )
+            audio = concat_audio(chunks, sr, silence_ms=int(join_silence_ms))
+            return sr, audio
+        sr, audio = await asyncio.to_thread(work)
+        wav = wav_bytes_from_audio(audio, sr)
+        return Response(content=wav, media_type="audio/wav", headers={"X-Sample-Rate": str(sr)})
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+    finally:
+        if speaker_path and os.path.exists(speaker_path):
+            try:
+                os.remove(speaker_path)
+            except Exception:
+                pass
+@app.post("/v1/xtts/stream")
+async def xtts_stream(req: XttsJsonRequest, speaker_wav: Optional[UploadFile] = File(None)):
+    """
+    Streams chunk-by-chunk via Server-Sent Events (SSE).
+    Each SSE event contains base64-encoded standalone WAV for that chunk.
+    """
+    speaker_path = None
+    try:
+        if speaker_wav is not None:
+            with tempfile.NamedTemporaryFile(delete=False, suffix=os.path.splitext(speaker_wav.filename or "ref.wav")[1] or ".wav") as tmp:
+                tmp.write(await speaker_wav.read())
+                speaker_path = tmp.name
+        def gen() -> Generator[bytes, None, None]:
+            try:
+                sr, chunks = _xtts_synthesize_chunks(
+                    text=req.text,
+                    language=req.language,
+                    speaker_wav_path=speaker_path,
+                    speed=req.speed,
+                    temperature=req.temperature,
+                    length_penalty=req.length_penalty,
+                    chunking=req.chunking,
+                    chunk_max_chars=req.chunk_max_chars,
+                    chunk_max_words=req.chunk_max_words,
+                    chunk_max_sentences=req.chunk_max_sentences,
+                )
+                for i, ch in enumerate(chunks):
+                    wav = wav_bytes_from_audio(ch, sr)
+                    payload = {"index": i, "sample_rate": sr, "wav_b64": b64encode_bytes(wav), "is_last": False}
+                    yield f"data: {json.dumps(payload)}\n\n".encode("utf-8")
+                yield f"data: {json.dumps({'is_last': True})}\n\n".encode("utf-8")
+            except Exception as e:
+                err = {"error": str(e), "is_last": True}
+                yield f"data: {json.dumps(err)}\n\n".encode("utf-8")
+        return StreamingResponse(gen(), media_type="text/event-stream")
+    finally:
+        if speaker_path and os.path.exists(speaker_path):
+            try:
+                os.remove(speaker_path)
+            except Exception:
+                pass
+# -------- Parler endpoints (separate service) --------
+@app.post("/v1/parler/synthesize")
+async def parler_synthesize(req: ParlerJsonRequest):
+    """
+    Returns a single WAV file. Style/emotion are controlled via the `description` field.
+    """
+    try:
+        def work():
+            sr, chunks = _parler_synthesize_chunks(
+                text=req.text,
+                description=req.description,
+                seed=req.seed,
+                temperature=req.temperature,
+                max_new_tokens=req.max_new_tokens,
+                chunking=req.chunking,
+                chunk_max_chars=req.chunk_max_chars,
+                chunk_max_words=req.chunk_max_words,
+                chunk_max_sentences=req.chunk_max_sentences,
+            )
+            audio = concat_audio(chunks, sr, silence_ms=req.join_silence_ms)
+            return sr, audio
+        sr, audio = await asyncio.to_thread(work)
+        wav = wav_bytes_from_audio(audio, sr)
+        return Response(content=wav, media_type="audio/wav", headers={"X-Sample-Rate": str(sr)})
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/v1/parler/stream")
+async def parler_stream(req: ParlerJsonRequest):
+    """
+    Streams chunk-by-chunk via SSE (each chunk is a standalone WAV).
+    """
+    def gen() -> Generator[bytes, None, None]:
+        try:
+            sr, chunks = _parler_synthesize_chunks(
+                text=req.text,
+                description=req.description,
+                seed=req.seed,
+                temperature=req.temperature,
+                max_new_tokens=req.max_new_tokens,
+                chunking=req.chunking,
+                chunk_max_chars=req.chunk_max_chars,
+                chunk_max_words=req.chunk_max_words,
+                chunk_max_sentences=req.chunk_max_sentences,
+            )
+            for i, ch in enumerate(chunks):
+                wav = wav_bytes_from_audio(ch, sr)
+                payload = {"index": i, "sample_rate": sr, "wav_b64": b64encode_bytes(wav), "is_last": False}
+                yield f"data: {json.dumps(payload)}\n\n".encode("utf-8")
+            yield f"data: {json.dumps({'is_last': True})}\n\n".encode("utf-8")
+        except Exception as e:
+            err = {"error": str(e), "is_last": True}
+            yield f"data: {json.dumps(err)}\n\n".encode("utf-8")
+    return StreamingResponse(gen(), media_type="text/event-stream")
+# -------- Piper endpoints (separate service) --------
+@app.post("/v1/piper/synthesize")
+async def piper_synthesize(req: PiperJsonRequest):
+    """
+    Returns a single WAV file from a specified Piper voice.
+    Provide the voice .onnx files via:
+      - PIPER_VOICES_JSON='{"pl_PL-gosia-medium": "/data/piper/pl_PL-gosia-medium.onnx", ...}'
+      - or mount them into /data/piper and call GET /v1/piper/voices.
+    """
+    try:
+        def work():
+            sr, audio = _piper_synthesize(req.text, req.voice_id, req.length_scale, req.noise_scale, req.noise_w)
+            return sr, audio
+        sr, audio = await asyncio.to_thread(work)
+        wav = wav_bytes_from_audio(audio, sr)
+        return Response(content=wav, media_type="audio/wav", headers={"X-Sample-Rate": str(sr), "X-Voice-Id": req.voice_id})
+    except KeyError as e:
+        raise HTTPException(status_code=404, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))

requirements.txt ADDED Viewed

	@@ -0,0 +1,29 @@

+# Core API
+fastapi>=0.110
+uvicorn[standard]>=0.27
+pydantic>=2.6
+python-multipart>=0.0.9
+# Audio + numerics
+numpy>=1.24
+scipy>=1.10
+soundfile>=0.12
+# XTTS v2 (Coqui)
+TTS>=0.22.0
+# Parler-TTS library (install from git - validated by working Space)
+git+https://github.com/huggingface/parler-tts.git@main
+# Transformers - use range for flexibility (working Space uses unpinned)
+transformers>=4.39,<4.50
+# Core dependencies
+accelerate>=0.27
+safetensors>=0.4
+# CRITICAL: Required by Parler tokenizer
+sentencepiece>=0.1.99
+# Optional: Piper fallback (leave installed, or remove if you truly do not want it)
+piper-tts>=1.2.0