susurro β€” expressive multi-register TTS

Kokoro-class StyleTTS2 text-to-speech for 3 voices Γ— 6 registers, where expressive register (neutral / breathless / playful / urgent / tender / whisper) lives in the style space and is selected via a voicepack β€” not baked into the text. Trained from scratch.

  • Architecture: StyleTTS2 (Kokoro-weight-compatible), 178-token misaki IPA vocabulary
  • Sample rate: 24 kHz mono Β· G2P: misaki[en] (English)
  • Inference: voicepack path β€” predict duration/F0/energy from the prosodic style, decode with the acoustic style (no diffusion sampler required)
  • Two runtimes: a self-contained ONNX path (onnxruntime + misaki, no PyTorch) and a raw PyTorch path (bundled StyleTTS2 code).
voice_a β”‚ voice_b β”‚ voice_c    Γ—    neutral Β· breathless Β· playful Β· urgent Β· tender Β· whisper
                                    └────────────────── 18 voicepacks β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Files

File What
susurro.onnx single ONNX graph: (input_ids, ref_s) → 24 kHz audio (text→tokens & voicepack are inputs)
susurro.pth raw inference weights ({'net': …}; training scaffolding stripped)
voicepacks/<voice>__<register>.pt 256-d style vector β€” [0:128] acoustic, [128:256] prosodic
voicepacks.npz all 18 voicepacks as numpy arrays (the ONNX path, torch-free)
infer_onnx.py dependency-light inference: onnxruntime + misaki only
infer.py raw PyTorch inference (uses bundled styletts2/)
export_onnx.py, onnx_stft.py reproduce susurro.onnx from susurro.pth
config.yml, kokoro_symbols.py model config + the 178-token phoneme map
styletts2/ bundled StyleTTS2 model code + PLBERT/ASR/JDC assets (raw path)
samples/ rendered demo clips

Quickstart β€” ONNX (recommended, no PyTorch)

The ONNX graph is fully self-contained; you only need onnxruntime, misaki (G2P), and numpy.

pip install -r requirements-onnx.txt
python infer_onnx.py \
  --voice voice_a --register tender \
  --text "Hey, I wasn't expecting you tonight." \
  --out hello.wav

In Python:

import numpy as np, onnxruntime as ort
from misaki import en
from kokoro_symbols import TextCleaner

sess = ort.InferenceSession("susurro.onnx", providers=["CPUExecutionProvider"])
g2p, clean = en.G2P(trf=False, british=False, fallback=None), TextCleaner()

ipa = g2p("The keys are on the table by the door.")[0].replace("ʏ", "y")
input_ids = np.array([[0, *clean(ipa), 0]], dtype=np.int64)          # BOS/EOS = 0
ref_s = np.load("voicepacks.npz")["voice_c__whisper"].reshape(1, 256).astype(np.float32)

audio = sess.run(None, {"input_ids": input_ids, "ref_s": ref_s})[0]  # float32, 24 kHz

Inputs: input_ids [1, T] int64 (phoneme token ids wrapped with 0), ref_s [1, 256] (a voicepack). Output: audio [N] float32 at 24 kHz. The token axis and audio length are dynamic.

Quickstart β€” raw PyTorch

Bundles the StyleTTS2 model code and the PLBERT/ASR/JDC utility-net assets under styletts2/, so a plain clone runs without fetching anything else.

pip install -r requirements.txt
python infer.py \
  --voicepack voicepacks/voice_a__tender.pt \
  --text "Hey, I wasn't expecting you tonight." \
  --out hello.wav

Runs on CPU or CUDA (auto-detected; --device cpu|cuda). transformers is pinned to 4.x in requirements-raw.txt because the bundled PLBERT loader targets AlbertModel as it was at train time.

Voices & registers

voice_a, voice_b, voice_c Γ— {neutral, breathless, playful, urgent, tender, whisper}. Pick any combination by name. whisper and urgent are the most acoustically distinct; breathless / neutral / playful / tender cluster more tightly in style space (a subtle- register limit inherited from the synthetic source β€” see Limitations).


Training data

Source Hours License Role
LibriTTS-R (train-clean-100, 247 spk) 44.2 CC BY 4.0 real-speech base β€” duration/F0 robustness
Synthetic data (3 target voices) 24.5 - the 3 voices + 6 registers
Mixed total 70.3 - 250 speakers, reference-based multispeaker

Holdouts sealed pre-training: eval_text, eval_xreg, calibration (synthetic only).

Evaluation

Scored vs the ground-truth Higgs ceiling (CER 0.004 / UTMOS 4.25); best checkpoint selected by eval (not by max epoch β€” stage 2 is non-monotonic).

Metric susurro GT ceiling Notes
CER (faster-whisper, eval_text) 0.011 0.004 intelligibility round-trip; near ceiling
UTMOS 4.32 4.25 no-reference naturalness; above the synthetic-data ceiling
register separation see note see note report per-register centroid cosine + ears (silhouette is speaker-confounded)

Winner checkpoint: epoch_2nd_00024 (selected over epochs 18–24).

Intended use & limitations

  • Use: expressive English narration/dialogue for the 3 provided voices.
  • Not: voice cloning of arbitrary speakers; non-English text (English G2P only).
  • Limitations: synthetic-voice timbre is bounded by the source quality. Register strength is uneven β€” whisper and urgent are clearly distinct; breathless, neutral, playful, tender are subtle (close in style space, matching the source). Intelligibility/naturalness are strong across all registers and voices.

Reproducing the ONNX

pip install -r requirements-raw.txt onnx onnxruntime
python export_onnx.py      # susurro.pth -> susurro.onnx, prints ONNX-vs-PyTorch parity

Licensing

  • Weights (susurro.pth, susurro.onnx, voicepacks): Apache-2.0 (from-scratch model). See LICENSE.
  • Bundled styletts2/ model code: MIT β€” StyleTTS2, Β© 2023 Aaron (Yinghao) Li. See styletts2/LICENSE.
  • Bundled utility nets: PLBERT / Kokoro lineage (Apache-2.0, hexgrad); ASR & JDC (StyleTTS2 MIT).
  • Training data attribution: LibriTTS-R β€” CC BY 4.0 (Koizumi et al., 2023). misaki[en] G2P.

Citation

@software{susurro_2026,
  title  = {susurro: expressive multi-register TTS},
  author = {Aimeri},
  year   = {2026},
  note   = {Kokoro-inspired StyleTTS2, trained on LibriTTS-R (CC BY 4.0) + synthetic registers}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support