susurro — expressive multi-register TTS

Kokoro-class StyleTTS2 text-to-speech for 3 voices × 6 registers, where expressive register (neutral / breathless / playful / urgent / tender / whisper) lives in the style space and is selected via a voicepack — not baked into the text. Trained from scratch.

Architecture: StyleTTS2 (Kokoro-weight-compatible), 178-token misaki IPA vocabulary
Sample rate: 24 kHz mono · G2P: misaki[en] (English)
Inference: voicepack path — predict duration/F0/energy from the prosodic style, decode with the acoustic style (no diffusion sampler required)
Two runtimes: a self-contained ONNX path (onnxruntime + misaki, no PyTorch) and a raw PyTorch path (bundled StyleTTS2 code).

voice_a │ voice_b │ voice_c    ×    neutral · breathless · playful · urgent · tender · whisper
                                    └────────────────── 18 voicepacks ──────────────────┘

Files

File	What
`susurro.onnx`	single ONNX graph: `(input_ids, ref_s) → 24 kHz audio` (text→tokens & voicepack are inputs)
`susurro.pth`	raw inference weights (`{'net': …}`; training scaffolding stripped)
`voicepacks/<voice>__<register>.pt`	256-d style vector — `[0:128]` acoustic, `[128:256]` prosodic
`voicepacks.npz`	all 18 voicepacks as numpy arrays (the ONNX path, torch-free)
`infer_onnx.py`	dependency-light inference: onnxruntime + misaki only
`infer.py`	raw PyTorch inference (uses bundled `styletts2/`)
`export_onnx.py`, `onnx_stft.py`	reproduce `susurro.onnx` from `susurro.pth`
`config.yml`, `kokoro_symbols.py`	model config + the 178-token phoneme map
`styletts2/`	bundled StyleTTS2 model code + PLBERT/ASR/JDC assets (raw path)
`samples/`	rendered demo clips

Quickstart — ONNX (recommended, no PyTorch)

The ONNX graph is fully self-contained; you only need onnxruntime, misaki (G2P), and numpy.

pip install -r requirements-onnx.txt
python infer_onnx.py \
  --voice voice_a --register tender \
  --text "Hey, I wasn't expecting you tonight." \
  --out hello.wav

In Python:

import numpy as np, onnxruntime as ort
from misaki import en
from kokoro_symbols import TextCleaner

sess = ort.InferenceSession("susurro.onnx", providers=["CPUExecutionProvider"])
g2p, clean = en.G2P(trf=False, british=False, fallback=None), TextCleaner()

ipa = g2p("The keys are on the table by the door.")[0].replace("ʏ", "y")
input_ids = np.array([[0, *clean(ipa), 0]], dtype=np.int64)          # BOS/EOS = 0
ref_s = np.load("voicepacks.npz")["voice_c__whisper"].reshape(1, 256).astype(np.float32)

audio = sess.run(None, {"input_ids": input_ids, "ref_s": ref_s})[0]  # float32, 24 kHz

Inputs: input_ids [1, T] int64 (phoneme token ids wrapped with 0), ref_s [1, 256] (a voicepack). Output: audio [N] float32 at 24 kHz. The token axis and audio length are dynamic.

Quickstart — raw PyTorch

Bundles the StyleTTS2 model code and the PLBERT/ASR/JDC utility-net assets under styletts2/, so a plain clone runs without fetching anything else.

pip install -r requirements.txt
python infer.py \
  --voicepack voicepacks/voice_a__tender.pt \
  --text "Hey, I wasn't expecting you tonight." \
  --out hello.wav

Runs on CPU or CUDA (auto-detected; --device cpu|cuda). transformers is pinned to 4.x in requirements-raw.txt because the bundled PLBERT loader targets AlbertModel as it was at train time.

Voices & registers

voice_a, voice_b, voice_c × {neutral, breathless, playful, urgent, tender, whisper}. Pick any combination by name. whisper and urgent are the most acoustically distinct; breathless / neutral / playful / tender cluster more tightly in style space (a subtle- register limit inherited from the synthetic source — see Limitations).

Training data

Source	Hours	License	Role
LibriTTS-R (train-clean-100, 247 spk)	44.2	CC BY 4.0	real-speech base — duration/F0 robustness
Synthetic data (3 target voices)	24.5	-	the 3 voices + 6 registers
Mixed total	70.3	-	250 speakers, reference-based multispeaker

Holdouts sealed pre-training: eval_text, eval_xreg, calibration (synthetic only).

Evaluation

Scored vs the ground-truth Higgs ceiling (CER 0.004 / UTMOS 4.25); best checkpoint selected by eval (not by max epoch — stage 2 is non-monotonic).

Metric	susurro	GT ceiling	Notes
CER (faster-whisper, eval_text)	0.011	0.004	intelligibility round-trip; near ceiling
UTMOS	4.32	4.25	no-reference naturalness; above the synthetic-data ceiling
register separation	see note	see note	report per-register centroid cosine + ears (silhouette is speaker-confounded)

Winner checkpoint: epoch_2nd_00024 (selected over epochs 18–24).

Intended use & limitations

Use: expressive English narration/dialogue for the 3 provided voices.
Not: voice cloning of arbitrary speakers; non-English text (English G2P only).
Limitations: synthetic-voice timbre is bounded by the source quality. Register strength is uneven — whisper and urgent are clearly distinct; breathless, neutral, playful, tender are subtle (close in style space, matching the source). Intelligibility/naturalness are strong across all registers and voices.

Reproducing the ONNX

pip install -r requirements-raw.txt onnx onnxruntime
python export_onnx.py      # susurro.pth -> susurro.onnx, prints ONNX-vs-PyTorch parity

Licensing

Weights (susurro.pth, susurro.onnx, voicepacks): Apache-2.0 (from-scratch model). See LICENSE.
Bundled utility nets: PLBERT / Kokoro lineage (Apache-2.0, hexgrad); ASR & JDC (StyleTTS2 MIT).
Training data attribution: LibriTTS-R — CC BY 4.0 (Koizumi et al., 2023). misaki[en] G2P.

Citation

@software{susurro_2026,
  title  = {susurro: expressive multi-register TTS},
  author = {Aimeri},
  year   = {2026},
  note   = {Kokoro-inspired StyleTTS2, trained on LibriTTS-R (CC BY 4.0) + synthetic registers}
}

Downloads last month: -; Downloads are not tracked for this model. How to track