---
license: apache-2.0
language:
- en
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- styletts2
- kokoro
- onnx
- expressive
- voicepack
---

# susurro — expressive multi-register TTS

Kokoro-class **StyleTTS2** text-to-speech for **3 voices × 6 registers**, where expressive
register (neutral / breathless / playful / urgent / tender / whisper) lives in the **style
space** and is selected via a *voicepack* — not baked into the text. Trained from scratch.

- **Architecture:** StyleTTS2 (Kokoro-weight-compatible), 178-token misaki IPA vocabulary
- **Sample rate:** 24 kHz mono · **G2P:** misaki[en] (English)
- **Inference:** voicepack path — predict duration/F0/energy from the prosodic style, decode
  with the acoustic style (no diffusion sampler required)
- **Two runtimes:** a self-contained **ONNX** path (onnxruntime + misaki, no PyTorch) and a
  **raw PyTorch** path (bundled StyleTTS2 code).

```
voice_a │ voice_b │ voice_c    ×    neutral · breathless · playful · urgent · tender · whisper
                                    └────────────────── 18 voicepacks ──────────────────┘
```

## Files

| File | What |
|---|---|
| `susurro.onnx` | single ONNX graph: `(input_ids, ref_s) → 24 kHz audio` (text→tokens & voicepack are inputs) |
| `susurro.pth` | raw inference weights (`{'net': …}`; training scaffolding stripped) |
| `voicepacks/<voice>__<register>.pt` | 256-d style vector — `[0:128]` acoustic, `[128:256]` prosodic |
| `voicepacks.npz` | all 18 voicepacks as numpy arrays (the ONNX path, torch-free) |
| `infer_onnx.py` | dependency-light inference: onnxruntime + misaki only |
| `infer.py` | raw PyTorch inference (uses bundled `styletts2/`) |
| `export_onnx.py`, `onnx_stft.py` | reproduce `susurro.onnx` from `susurro.pth` |
| `config.yml`, `kokoro_symbols.py` | model config + the 178-token phoneme map |
| `styletts2/` | bundled StyleTTS2 model code + PLBERT/ASR/JDC assets (raw path) |
| `samples/` | rendered demo clips |

---

## Quickstart — ONNX (recommended, no PyTorch)

The ONNX graph is fully self-contained; you only need onnxruntime, misaki (G2P), and numpy.

```bash
pip install -r requirements-onnx.txt
python infer_onnx.py \
  --voice voice_a --register tender \
  --text "Hey, I wasn't expecting you tonight." \
  --out hello.wav
```

In Python:

```python
import numpy as np, onnxruntime as ort
from misaki import en
from kokoro_symbols import TextCleaner

sess = ort.InferenceSession("susurro.onnx", providers=["CPUExecutionProvider"])
g2p, clean = en.G2P(trf=False, british=False, fallback=None), TextCleaner()

ipa = g2p("The keys are on the table by the door.")[0].replace("ʏ", "y")
input_ids = np.array([[0, *clean(ipa), 0]], dtype=np.int64)          # BOS/EOS = 0
ref_s = np.load("voicepacks.npz")["voice_c__whisper"].reshape(1, 256).astype(np.float32)

audio = sess.run(None, {"input_ids": input_ids, "ref_s": ref_s})[0]  # float32, 24 kHz
```

**Inputs:** `input_ids [1, T] int64` (phoneme token ids wrapped with `0`), `ref_s [1, 256]`
(a voicepack). **Output:** `audio [N] float32` at 24 kHz. The token axis and audio length are
dynamic.

## Quickstart — raw PyTorch

Bundles the StyleTTS2 model code and the PLBERT/ASR/JDC utility-net assets under `styletts2/`,
so a plain clone runs without fetching anything else.

```bash
pip install -r requirements.txt
python infer.py \
  --voicepack voicepacks/voice_a__tender.pt \
  --text "Hey, I wasn't expecting you tonight." \
  --out hello.wav
```

Runs on CPU or CUDA (auto-detected; `--device cpu|cuda`). `transformers` is pinned to 4.x in
`requirements-raw.txt` because the bundled PLBERT loader targets `AlbertModel` as it was at
train time.

## Voices & registers

`voice_a`, `voice_b`, `voice_c` × `{neutral, breathless, playful, urgent, tender, whisper}`.
Pick any combination by name. **whisper** and **urgent** are the most acoustically distinct;
**breathless / neutral / playful / tender** cluster more tightly in style space (a subtle-
register limit inherited from the synthetic source — see Limitations).

---

## Training data

| Source | Hours | License | Role |
|---|---|---|---|
| LibriTTS-R (train-clean-100, 247 spk) | 44.2 | **CC BY 4.0** | real-speech base — duration/F0 robustness |
| Synthetic data (3 target voices) | 24.5 | - | the 3 voices + 6 registers |
| **Mixed total** | **70.3** | - | 250 speakers, reference-based multispeaker |

Holdouts sealed pre-training: `eval_text`, `eval_xreg`, `calibration` (synthetic only).

## Evaluation

Scored vs the ground-truth Higgs ceiling (CER 0.004 / UTMOS 4.25); best checkpoint selected by
eval (not by max epoch — stage 2 is non-monotonic).

| Metric | susurro | GT ceiling | Notes |
|---|---|---|---|
| CER (faster-whisper, eval_text) | **0.011** | 0.004 | intelligibility round-trip; near ceiling |
| UTMOS | **4.32** | 4.25 | no-reference naturalness; above the synthetic-data ceiling |
| register separation | see note | see note | report per-register centroid cosine + ears (silhouette is speaker-confounded) |

Winner checkpoint: **`epoch_2nd_00024`** (selected over epochs 18–24).

## Intended use & limitations

- **Use:** expressive English narration/dialogue for the 3 provided voices.
- **Not:** voice cloning of arbitrary speakers; non-English text (English G2P only).
- **Limitations:** synthetic-voice timbre is bounded by the source quality. Register strength is
  uneven — **whisper and urgent are clearly distinct; breathless, neutral, playful, tender are
  subtle** (close in style space, matching the source). Intelligibility/naturalness are strong
  across all registers and voices.

## Reproducing the ONNX

```bash
pip install -r requirements-raw.txt onnx onnxruntime
python export_onnx.py      # susurro.pth -> susurro.onnx, prints ONNX-vs-PyTorch parity
```


## Licensing

- **Weights (`susurro.pth`, `susurro.onnx`, voicepacks):** **Apache-2.0** (from-scratch model). See `LICENSE`.
- **Bundled `styletts2/` model code:** **MIT** — StyleTTS2, © 2023 Aaron (Yinghao) Li. See `styletts2/LICENSE`.
- **Bundled utility nets:** PLBERT / Kokoro lineage (**Apache-2.0**, hexgrad); ASR & JDC (StyleTTS2 MIT).
- **Training data attribution:** LibriTTS-R — CC BY 4.0 (Koizumi et al., 2023). misaki[en] G2P.

## Citation

```bibtex
@software{susurro_2026,
  title  = {susurro: expressive multi-register TTS},
  author = {Aimeri},
  year   = {2026},
  note   = {Kokoro-inspired StyleTTS2, trained on LibriTTS-R (CC BY 4.0) + synthetic registers}
}
```