PrimeTTS — tiny bilingual zh‑TW + English TTS (8 kHz, CPU)
A 4.63M‑parameter Mandarin (Taiwan) + English text‑to‑speech model that runs entirely on CPU and emits 8 kHz audio — sized for G.711 telephony and on‑device (Jetson‑class) use. One model, one voice: Chinese, English, and code‑mix through a single frontend (no language routing).
🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1 · 🧩 Base:
owensong/Inflect-Nano-v1(fine‑tune, same frozen architecture)
| Parameters | 4.63M (3.47M acoustic + 1.17M vocoder) |
| Sample rate | 8 kHz (telephony‑band) |
| Runtime | onnxruntime, CPU‑only, torch‑free at inference |
| Languages | zh‑TW (Traditional) + English + code‑mix |
| Architecture | FastSpeech‑style (no attention) + Snake‑HiFiGAN — frozen, no NAS |
| License | Apache‑2.0 |
Quickstart (inference, CPU)
pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS
# from inside the PrimeTTS dir (uses the bundled frontend + scripts)
import sys; sys.path.insert(0, "scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate # numpy length‑regulator
meta = json.load(open("meta.json"))
enc = ort.InferenceSession("acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession("vocoder.onnx", providers=["CPUExecutionProvider"])
o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.") # text -> phone/tone/lang ids
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn,
"lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in
["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])
The whole pipeline — encoder.onnx → numpy length‑regulator → decoder.onnx → vocoder.onnx — is
torch‑free and runs as‑is on a Jetson Nano CPU. See scripts/synth_from_text.py for the full runtime.
Why it works — the two levers
Inflect‑Nano‑v1's 4.63M architecture is not capacity‑limited for this task (the original English checkpoint already scores ~0.05 CER). Our first retrains were still unintelligible — not because of size, but because of two fixable things. Both fixes keep the architecture frozen:
| Lever | What went wrong / the fix | Held‑out Mandarin CER |
|---|---|---|
| 1. Phone‑level alignment | Crude char/letter‑CTC → wrong per‑phone durations → over‑smoothed, garbled mel. Replace with true phone‑level forced alignment (align_durations_v4.py). |
0.88 → 0.40 |
| 2. Diverse training text | A narrow corpus (~234 Han chars) leaves ~39% of held‑out characters unseen — the model can't pronounce what it never saw. Expand coverage (select_diverse_text.py). |
0.40 → 0.06 |
Applied to both languages in one single‑voice corpus, the same recipe yields a genuinely bilingual model (zh ≈ 0.07, English ≈ 0.04–0.12 WER) — no routing. Everything else is Inflect‑Nano‑v1's defaults.
Takeaway for your own fine‑tune: alignment quality and vocabulary coverage dominate. Get those two right and a tiny frozen model is enough.
Architecture
- Acoustic —
MicroFastSpeech(~3.47M): depthwise Conv‑FFN, no attention, external durations + length regulator, frame‑pitch, BiGRU, postnet. - Vocoder — Snake‑HiFiGAN (~1.17M), 8 kHz variant
snake_8k(sr 8000, n_fft 512, hop 128, 80 mels). - Frontend —
g2pw(Taiwan bopomofo + polyphone disambiguation) +g2p_en(arpabet), merged into one phone sequence with per‑phone language ids → handles zh, en, and code‑mix in a single pass.
Train / fine‑tune your own (voice or language)
The pipeline is 5 steps: data → align → train acoustic → train vocoder → export. Repo layout:
acoustic_encoder.onnx acoustic_decoder.onnx vocoder.onnx meta.json symbol_table.json ← deployable weights
acoustic_zh_v2_35k.pt ← checkpoint (resume/fine‑tune)
scripts/ frontend, aligner, corpus‑gen, diverse‑text, train, export, eval
inflect_nano/ the trainer (acoustic.py + vocoder.py), forked from Inflect‑Nano‑v1 (LICENSE included)
Prerequisites: Python 3.12, a GPU for training; pip install torch torchaudio transformers onnxruntime soundfile librosa g2pw g2p_en cn2an opencc. A single‑speaker teacher TTS (or clean
recordings) for the audio. Put the trainer on your path: PYTHONPATH=. python -m inflect_nano.acoustic ….
1 · Diverse text → train_zh.tsv, train_en.tsv
python scripts/select_diverse_text.py --lang zh --n 6000 --out train_zh.tsv
python scripts/select_diverse_text.py --lang en --n 6000 --out train_en.tsv
Tatoeba → OpenCC s2twp (zh) → greedy char/word‑coverage selection. Coverage is the #1 driver of
held‑out quality — don't skimp here.
2 · Teacher corpus → corpus/{*.wav, manifest.jsonl}
python scripts/gen_breezy_corpus.py --corpus train_zh.tsv --out-dir corpus --cer-thresh 0.30
Synthesizes each line in one voice, keeping a clip only if an ASR transcript matches the text (here BreezyVoice + Breeze‑ASR‑25, t2s‑normalized). Any clean single‑speaker source works — for native code‑mix, use the same voice for zh and en.
3 · Phone‑level alignment → align.jsonl ⭐ the key step
python scripts/align_durations_v4.py --manifest corpus/manifest.jsonl --out align.jsonl
Per‑phone durations from the real audio via espeak phoneme‑CTC + torchaudio.forced_align.
Skipping or approximating this is what makes tiny TTS sound garbled.
4 · Train the acoustic model → acoustic_8k/…pt
PYTHONPATH=. python -m inflect_nano.acoustic --durations-jsonl align.jsonl \
--out-dir acoustic_8k --vocoder-variant snake_8k --sample-rate 8000 \
--steps 60000 --batch-size 16 --vocoder-checkpoint <vocoder.pt> --vocoder-mel-weight 1.0
Mix languages in one corpus, single speaker (see scripts/run_bilingual.sh).
5 · Train the 8 kHz vocoder → vocoder_8k/…pt
PYTHONPATH=. python -m inflect_nano.vocoder --train-jsonl voc_rows.jsonl \
--out-dir vocoder_8k --variant snake_8k --steps 40000 --stft-weight 2.5
Train on the same diverse audio. Higher --stft-weight → crisper waveform (see scripts/run_voc_retrain.sh).
6 · Export to ONNX → the deployable weights
python scripts/export_8k.py --acoustic-ckpt acoustic_8k/…pt --vocoder-ckpt vocoder_8k/…pt --out-dir onnx/
7 · Evaluate
python scripts/synth_from_text.py --onnx-dir onnx --out-dir syn --texts eval.jsonl
python scripts/assess_big.py --synth-dir syn # offline X‑ASR CER/WER
Use ≥30 held‑out sentences — small eval sets are too noisy to trust.
Note on the published weights: these are the zh‑focused checkpoint (strong Mandarin). The more balanced bilingual checkpoint (zh ≈ 0.07 / English ≈ 0.04 WER) replaces them as training completes; the demo always loads whatever is here.
Credits & licenses
- Base model / trainer:
owensong/Inflect-Nano-v1(Apache‑2.0; seeinflect_nano/LICENSE.inflect-nano) - Teacher / gate ASR: BreezyVoice ·
Breeze-ASR-25(MediaTek Research) - Aligner:
facebook/wav2vec2-lv-60-espeak-cv-ft+torchaudio.forced_align - Frontend:
g2pw(Taiwan readings) +g2p_en· Eval ASR: sherpa‑onnx X‑ASR (zh‑en Zipformer) - Text: Tatoeba (CC‑BY 2.0 FR)
This repository: Apache‑2.0.
Model tree for Luigi/PrimeTTS
Base model
owensong/Inflect-Nano-v1