susurro β expressive multi-register TTS
Kokoro-class StyleTTS2 text-to-speech for 3 voices Γ 6 registers, where expressive register (neutral / breathless / playful / urgent / tender / whisper) lives in the style space and is selected via a voicepack β not baked into the text. Trained from scratch.
- Architecture: StyleTTS2 (Kokoro-weight-compatible), 178-token misaki IPA vocabulary
- Sample rate: 24 kHz mono Β· G2P: misaki[en] (English)
- Inference: voicepack path β predict duration/F0/energy from the prosodic style, decode with the acoustic style (no diffusion sampler required)
- Two runtimes: a self-contained ONNX path (onnxruntime + misaki, no PyTorch) and a raw PyTorch path (bundled StyleTTS2 code).
voice_a β voice_b β voice_c Γ neutral Β· breathless Β· playful Β· urgent Β· tender Β· whisper
βββββββββββββββββββ 18 voicepacks βββββββββββββββββββ
Files
| File | What |
|---|---|
susurro.onnx |
single ONNX graph: (input_ids, ref_s) β 24 kHz audio (textβtokens & voicepack are inputs) |
susurro.pth |
raw inference weights ({'net': β¦}; training scaffolding stripped) |
voicepacks/<voice>__<register>.pt |
256-d style vector β [0:128] acoustic, [128:256] prosodic |
voicepacks.npz |
all 18 voicepacks as numpy arrays (the ONNX path, torch-free) |
infer_onnx.py |
dependency-light inference: onnxruntime + misaki only |
infer.py |
raw PyTorch inference (uses bundled styletts2/) |
export_onnx.py, onnx_stft.py |
reproduce susurro.onnx from susurro.pth |
config.yml, kokoro_symbols.py |
model config + the 178-token phoneme map |
styletts2/ |
bundled StyleTTS2 model code + PLBERT/ASR/JDC assets (raw path) |
samples/ |
rendered demo clips |
Quickstart β ONNX (recommended, no PyTorch)
The ONNX graph is fully self-contained; you only need onnxruntime, misaki (G2P), and numpy.
pip install -r requirements-onnx.txt
python infer_onnx.py \
--voice voice_a --register tender \
--text "Hey, I wasn't expecting you tonight." \
--out hello.wav
In Python:
import numpy as np, onnxruntime as ort
from misaki import en
from kokoro_symbols import TextCleaner
sess = ort.InferenceSession("susurro.onnx", providers=["CPUExecutionProvider"])
g2p, clean = en.G2P(trf=False, british=False, fallback=None), TextCleaner()
ipa = g2p("The keys are on the table by the door.")[0].replace("Κ", "y")
input_ids = np.array([[0, *clean(ipa), 0]], dtype=np.int64) # BOS/EOS = 0
ref_s = np.load("voicepacks.npz")["voice_c__whisper"].reshape(1, 256).astype(np.float32)
audio = sess.run(None, {"input_ids": input_ids, "ref_s": ref_s})[0] # float32, 24 kHz
Inputs: input_ids [1, T] int64 (phoneme token ids wrapped with 0), ref_s [1, 256]
(a voicepack). Output: audio [N] float32 at 24 kHz. The token axis and audio length are
dynamic.
Quickstart β raw PyTorch
Bundles the StyleTTS2 model code and the PLBERT/ASR/JDC utility-net assets under styletts2/,
so a plain clone runs without fetching anything else.
pip install -r requirements.txt
python infer.py \
--voicepack voicepacks/voice_a__tender.pt \
--text "Hey, I wasn't expecting you tonight." \
--out hello.wav
Runs on CPU or CUDA (auto-detected; --device cpu|cuda). transformers is pinned to 4.x in
requirements-raw.txt because the bundled PLBERT loader targets AlbertModel as it was at
train time.
Voices & registers
voice_a, voice_b, voice_c Γ {neutral, breathless, playful, urgent, tender, whisper}.
Pick any combination by name. whisper and urgent are the most acoustically distinct;
breathless / neutral / playful / tender cluster more tightly in style space (a subtle-
register limit inherited from the synthetic source β see Limitations).
Training data
| Source | Hours | License | Role |
|---|---|---|---|
| LibriTTS-R (train-clean-100, 247 spk) | 44.2 | CC BY 4.0 | real-speech base β duration/F0 robustness |
| Synthetic data (3 target voices) | 24.5 | - | the 3 voices + 6 registers |
| Mixed total | 70.3 | - | 250 speakers, reference-based multispeaker |
Holdouts sealed pre-training: eval_text, eval_xreg, calibration (synthetic only).
Evaluation
Scored vs the ground-truth Higgs ceiling (CER 0.004 / UTMOS 4.25); best checkpoint selected by eval (not by max epoch β stage 2 is non-monotonic).
| Metric | susurro | GT ceiling | Notes |
|---|---|---|---|
| CER (faster-whisper, eval_text) | 0.011 | 0.004 | intelligibility round-trip; near ceiling |
| UTMOS | 4.32 | 4.25 | no-reference naturalness; above the synthetic-data ceiling |
| register separation | see note | see note | report per-register centroid cosine + ears (silhouette is speaker-confounded) |
Winner checkpoint: epoch_2nd_00024 (selected over epochs 18β24).
Intended use & limitations
- Use: expressive English narration/dialogue for the 3 provided voices.
- Not: voice cloning of arbitrary speakers; non-English text (English G2P only).
- Limitations: synthetic-voice timbre is bounded by the source quality. Register strength is uneven β whisper and urgent are clearly distinct; breathless, neutral, playful, tender are subtle (close in style space, matching the source). Intelligibility/naturalness are strong across all registers and voices.
Reproducing the ONNX
pip install -r requirements-raw.txt onnx onnxruntime
python export_onnx.py # susurro.pth -> susurro.onnx, prints ONNX-vs-PyTorch parity
Licensing
- Weights (
susurro.pth,susurro.onnx, voicepacks): Apache-2.0 (from-scratch model). SeeLICENSE. - Bundled
styletts2/model code: MIT β StyleTTS2, Β© 2023 Aaron (Yinghao) Li. Seestyletts2/LICENSE. - Bundled utility nets: PLBERT / Kokoro lineage (Apache-2.0, hexgrad); ASR & JDC (StyleTTS2 MIT).
- Training data attribution: LibriTTS-R β CC BY 4.0 (Koizumi et al., 2023). misaki[en] G2P.
Citation
@software{susurro_2026,
title = {susurro: expressive multi-register TTS},
author = {Aimeri},
year = {2026},
note = {Kokoro-inspired StyleTTS2, trained on LibriTTS-R (CC BY 4.0) + synthetic registers}
}