PrimeTTS — tiny bilingual zh‑TW + English TTS (8 kHz, CPU)

A 4.63M‑parameter Mandarin (Taiwan) + English text‑to‑speech model that runs entirely on CPU and emits 8 kHz audio — sized for G.711 telephony and on‑device (Jetson‑class) use. One model, one voice: Chinese, English, and code‑mix through a single frontend (no language routing).

🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1 · 🧩 Base: owensong/Inflect-Nano-v1 (fine‑tune, same frozen architecture)

Parameters 4.63M (3.47M acoustic + 1.17M vocoder)
Sample rate 8 kHz (telephony‑band)
Runtime onnxruntime, CPU‑only, torch‑free at inference
Languages zh‑TW (Traditional) + English + code‑mix
Architecture FastSpeech‑style (no attention) + Snake‑HiFiGAN — frozen, no NAS
License Apache‑2.0

Quickstart (inference, CPU)

pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS
# from inside the PrimeTTS dir (uses the bundled frontend + scripts)
import sys; sys.path.insert(0, "scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate          # numpy length‑regulator

meta = json.load(open("meta.json"))
enc = ort.InferenceSession("acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession("vocoder.onnx",          providers=["CPUExecutionProvider"])

o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")   # text -> phone/tone/lang ids
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn,
                                  "lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in
      ["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])

The whole pipeline — encoder.onnx → numpy length‑regulator → decoder.onnx → vocoder.onnx — is torch‑free and runs as‑is on a Jetson Nano CPU. See scripts/synth_from_text.py for the full runtime.


Why it works — the two levers

Inflect‑Nano‑v1's 4.63M architecture is not capacity‑limited for this task (the original English checkpoint already scores ~0.05 CER). Our first retrains were still unintelligible — not because of size, but because of two fixable things. Both fixes keep the architecture frozen:

Lever What went wrong / the fix Held‑out Mandarin CER
1. Phone‑level alignment Crude char/letter‑CTC → wrong per‑phone durations → over‑smoothed, garbled mel. Replace with true phone‑level forced alignment (align_durations_v4.py). 0.88 → 0.40
2. Diverse training text A narrow corpus (~234 Han chars) leaves ~39% of held‑out characters unseen — the model can't pronounce what it never saw. Expand coverage (select_diverse_text.py). 0.40 → 0.06

Applied to both languages in one single‑voice corpus, the same recipe yields a genuinely bilingual model (zh ≈ 0.07, English ≈ 0.04–0.12 WER) — no routing. Everything else is Inflect‑Nano‑v1's defaults.

Takeaway for your own fine‑tune: alignment quality and vocabulary coverage dominate. Get those two right and a tiny frozen model is enough.


Architecture

  • AcousticMicroFastSpeech (~3.47M): depthwise Conv‑FFN, no attention, external durations + length regulator, frame‑pitch, BiGRU, postnet.
  • Vocoder — Snake‑HiFiGAN (~1.17M), 8 kHz variant snake_8k (sr 8000, n_fft 512, hop 128, 80 mels).
  • Frontendg2pw (Taiwan bopomofo + polyphone disambiguation) + g2p_en (arpabet), merged into one phone sequence with per‑phone language ids → handles zh, en, and code‑mix in a single pass.

Train / fine‑tune your own (voice or language)

The pipeline is 5 steps: data → align → train acoustic → train vocoder → export. Repo layout:

acoustic_encoder.onnx  acoustic_decoder.onnx  vocoder.onnx  meta.json  symbol_table.json   ← deployable weights
acoustic_zh_v2_35k.pt                                                                       ← checkpoint (resume/fine‑tune)
scripts/        frontend, aligner, corpus‑gen, diverse‑text, train, export, eval
inflect_nano/   the trainer (acoustic.py + vocoder.py), forked from Inflect‑Nano‑v1 (LICENSE included)

Prerequisites: Python 3.12, a GPU for training; pip install torch torchaudio transformers onnxruntime soundfile librosa g2pw g2p_en cn2an opencc. A single‑speaker teacher TTS (or clean recordings) for the audio. Put the trainer on your path: PYTHONPATH=. python -m inflect_nano.acoustic ….

1 · Diverse text → train_zh.tsv, train_en.tsv

python scripts/select_diverse_text.py --lang zh --n 6000 --out train_zh.tsv
python scripts/select_diverse_text.py --lang en --n 6000 --out train_en.tsv

Tatoeba → OpenCC s2twp (zh) → greedy char/word‑coverage selection. Coverage is the #1 driver of held‑out quality — don't skimp here.

2 · Teacher corpus → corpus/{*.wav, manifest.jsonl}

python scripts/gen_breezy_corpus.py --corpus train_zh.tsv --out-dir corpus --cer-thresh 0.30

Synthesizes each line in one voice, keeping a clip only if an ASR transcript matches the text (here BreezyVoice + Breeze‑ASR‑25, t2s‑normalized). Any clean single‑speaker source works — for native code‑mix, use the same voice for zh and en.

3 · Phone‑level alignment → align.jsonlthe key step

python scripts/align_durations_v4.py --manifest corpus/manifest.jsonl --out align.jsonl

Per‑phone durations from the real audio via espeak phoneme‑CTC + torchaudio.forced_align. Skipping or approximating this is what makes tiny TTS sound garbled.

4 · Train the acoustic model → acoustic_8k/…pt

PYTHONPATH=. python -m inflect_nano.acoustic --durations-jsonl align.jsonl \
  --out-dir acoustic_8k --vocoder-variant snake_8k --sample-rate 8000 \
  --steps 60000 --batch-size 16 --vocoder-checkpoint <vocoder.pt> --vocoder-mel-weight 1.0

Mix languages in one corpus, single speaker (see scripts/run_bilingual.sh).

5 · Train the 8 kHz vocoder → vocoder_8k/…pt

PYTHONPATH=. python -m inflect_nano.vocoder --train-jsonl voc_rows.jsonl \
  --out-dir vocoder_8k --variant snake_8k --steps 40000 --stft-weight 2.5

Train on the same diverse audio. Higher --stft-weight → crisper waveform (see scripts/run_voc_retrain.sh).

6 · Export to ONNX → the deployable weights

python scripts/export_8k.py --acoustic-ckpt acoustic_8k/…pt --vocoder-ckpt vocoder_8k/…pt --out-dir onnx/

7 · Evaluate

python scripts/synth_from_text.py --onnx-dir onnx --out-dir syn --texts eval.jsonl
python scripts/assess_big.py --synth-dir syn        # offline X‑ASR CER/WER

Use ≥30 held‑out sentences — small eval sets are too noisy to trust.

Note on the published weights: these are the zh‑focused checkpoint (strong Mandarin). The more balanced bilingual checkpoint (zh ≈ 0.07 / English ≈ 0.04 WER) replaces them as training completes; the demo always loads whatever is here.


Credits & licenses

  • Base model / trainer: owensong/Inflect-Nano-v1 (Apache‑2.0; see inflect_nano/LICENSE.inflect-nano)
  • Teacher / gate ASR: BreezyVoice · Breeze-ASR-25 (MediaTek Research)
  • Aligner: facebook/wav2vec2-lv-60-espeak-cv-ft + torchaudio.forced_align
  • Frontend: g2pw (Taiwan readings) + g2p_en · Eval ASR: sherpa‑onnx X‑ASR (zh‑en Zipformer)
  • Text: Tatoeba (CC‑BY 2.0 FR)

This repository: Apache‑2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Luigi/PrimeTTS

Finetuned
(1)
this model

Space using Luigi/PrimeTTS 1