--- license: apache-2.0 language: - en pipeline_tag: text-to-speech tags: - text-to-speech - tts - styletts2 - kokoro - onnx - expressive - voicepack --- # susurro — expressive multi-register TTS Kokoro-class **StyleTTS2** text-to-speech for **3 voices × 6 registers**, where expressive register (neutral / breathless / playful / urgent / tender / whisper) lives in the **style space** and is selected via a *voicepack* — not baked into the text. Trained from scratch. - **Architecture:** StyleTTS2 (Kokoro-weight-compatible), 178-token misaki IPA vocabulary - **Sample rate:** 24 kHz mono · **G2P:** misaki[en] (English) - **Inference:** voicepack path — predict duration/F0/energy from the prosodic style, decode with the acoustic style (no diffusion sampler required) - **Two runtimes:** a self-contained **ONNX** path (onnxruntime + misaki, no PyTorch) and a **raw PyTorch** path (bundled StyleTTS2 code). ``` voice_a │ voice_b │ voice_c × neutral · breathless · playful · urgent · tender · whisper └────────────────── 18 voicepacks ──────────────────┘ ``` ## Files | File | What | |---|---| | `susurro.onnx` | single ONNX graph: `(input_ids, ref_s) → 24 kHz audio` (text→tokens & voicepack are inputs) | | `susurro.pth` | raw inference weights (`{'net': …}`; training scaffolding stripped) | | `voicepacks/__.pt` | 256-d style vector — `[0:128]` acoustic, `[128:256]` prosodic | | `voicepacks.npz` | all 18 voicepacks as numpy arrays (the ONNX path, torch-free) | | `infer_onnx.py` | dependency-light inference: onnxruntime + misaki only | | `infer.py` | raw PyTorch inference (uses bundled `styletts2/`) | | `export_onnx.py`, `onnx_stft.py` | reproduce `susurro.onnx` from `susurro.pth` | | `config.yml`, `kokoro_symbols.py` | model config + the 178-token phoneme map | | `styletts2/` | bundled StyleTTS2 model code + PLBERT/ASR/JDC assets (raw path) | | `samples/` | rendered demo clips | --- ## Quickstart — ONNX (recommended, no PyTorch) The ONNX graph is fully self-contained; you only need onnxruntime, misaki (G2P), and numpy. ```bash pip install -r requirements-onnx.txt python infer_onnx.py \ --voice voice_a --register tender \ --text "Hey, I wasn't expecting you tonight." \ --out hello.wav ``` In Python: ```python import numpy as np, onnxruntime as ort from misaki import en from kokoro_symbols import TextCleaner sess = ort.InferenceSession("susurro.onnx", providers=["CPUExecutionProvider"]) g2p, clean = en.G2P(trf=False, british=False, fallback=None), TextCleaner() ipa = g2p("The keys are on the table by the door.")[0].replace("ʏ", "y") input_ids = np.array([[0, *clean(ipa), 0]], dtype=np.int64) # BOS/EOS = 0 ref_s = np.load("voicepacks.npz")["voice_c__whisper"].reshape(1, 256).astype(np.float32) audio = sess.run(None, {"input_ids": input_ids, "ref_s": ref_s})[0] # float32, 24 kHz ``` **Inputs:** `input_ids [1, T] int64` (phoneme token ids wrapped with `0`), `ref_s [1, 256]` (a voicepack). **Output:** `audio [N] float32` at 24 kHz. The token axis and audio length are dynamic. ## Quickstart — raw PyTorch Bundles the StyleTTS2 model code and the PLBERT/ASR/JDC utility-net assets under `styletts2/`, so a plain clone runs without fetching anything else. ```bash pip install -r requirements.txt python infer.py \ --voicepack voicepacks/voice_a__tender.pt \ --text "Hey, I wasn't expecting you tonight." \ --out hello.wav ``` Runs on CPU or CUDA (auto-detected; `--device cpu|cuda`). `transformers` is pinned to 4.x in `requirements-raw.txt` because the bundled PLBERT loader targets `AlbertModel` as it was at train time. ## Voices & registers `voice_a`, `voice_b`, `voice_c` × `{neutral, breathless, playful, urgent, tender, whisper}`. Pick any combination by name. **whisper** and **urgent** are the most acoustically distinct; **breathless / neutral / playful / tender** cluster more tightly in style space (a subtle- register limit inherited from the synthetic source — see Limitations). --- ## Training data | Source | Hours | License | Role | |---|---|---|---| | LibriTTS-R (train-clean-100, 247 spk) | 44.2 | **CC BY 4.0** | real-speech base — duration/F0 robustness | | Synthetic data (3 target voices) | 24.5 | - | the 3 voices + 6 registers | | **Mixed total** | **70.3** | - | 250 speakers, reference-based multispeaker | Holdouts sealed pre-training: `eval_text`, `eval_xreg`, `calibration` (synthetic only). ## Evaluation Scored vs the ground-truth Higgs ceiling (CER 0.004 / UTMOS 4.25); best checkpoint selected by eval (not by max epoch — stage 2 is non-monotonic). | Metric | susurro | GT ceiling | Notes | |---|---|---|---| | CER (faster-whisper, eval_text) | **0.011** | 0.004 | intelligibility round-trip; near ceiling | | UTMOS | **4.32** | 4.25 | no-reference naturalness; above the synthetic-data ceiling | | register separation | see note | see note | report per-register centroid cosine + ears (silhouette is speaker-confounded) | Winner checkpoint: **`epoch_2nd_00024`** (selected over epochs 18–24). ## Intended use & limitations - **Use:** expressive English narration/dialogue for the 3 provided voices. - **Not:** voice cloning of arbitrary speakers; non-English text (English G2P only). - **Limitations:** synthetic-voice timbre is bounded by the source quality. Register strength is uneven — **whisper and urgent are clearly distinct; breathless, neutral, playful, tender are subtle** (close in style space, matching the source). Intelligibility/naturalness are strong across all registers and voices. ## Reproducing the ONNX ```bash pip install -r requirements-raw.txt onnx onnxruntime python export_onnx.py # susurro.pth -> susurro.onnx, prints ONNX-vs-PyTorch parity ``` ## Licensing - **Weights (`susurro.pth`, `susurro.onnx`, voicepacks):** **Apache-2.0** (from-scratch model). See `LICENSE`. - **Bundled `styletts2/` model code:** **MIT** — StyleTTS2, © 2023 Aaron (Yinghao) Li. See `styletts2/LICENSE`. - **Bundled utility nets:** PLBERT / Kokoro lineage (**Apache-2.0**, hexgrad); ASR & JDC (StyleTTS2 MIT). - **Training data attribution:** LibriTTS-R — CC BY 4.0 (Koizumi et al., 2023). misaki[en] G2P. ## Citation ```bibtex @software{susurro_2026, title = {susurro: expressive multi-register TTS}, author = {Aimeri}, year = {2026}, note = {Kokoro-inspired StyleTTS2, trained on LibriTTS-R (CC BY 4.0) + synthetic registers} } ```