--- license: apache-2.0 language: - zh - en tags: - text-to-speech - tts - onnx - gguf - on-device - jetson - telephony - vits - mb-istft-vits - multi-speaker - mandarin - taiwanese-mandarin base_model: owensong/Inflect-Nano-v1 base_model_relation: finetune library_name: onnxruntime pipeline_tag: text-to-speech --- # PrimeTTS — on-device zh-TW + English TTS Taiwan-Mandarin + English text-to-speech built for on-device use (contact-centre, GPS, transit): one frontend handles Chinese, English, and **code-mix** with no language routing, and reads **entities** correctly — phone numbers, emails, addresses, prices, dates, temperatures, %, serials. **Two models to know:** | | **PrimeTTS v2.1** — flagship | **PrimeTTS v1** — leanest CPU | |---|---|---| | Folder | [`v21_mbistft_16k/`](./v21_mbistft_16k) | [`v1b_16k/`](./v1b_16k) · [`v1b_8k/`](./v1b_8k) | | Architecture | MB-iSTFT-VITS (end-to-end, multi-speaker) | FastSpeech + Snake-HiFiGAN (+ pitch refiner) | | Params | 37.9M | ~5.0M (16 kHz) / 4.09M (8 kHz) | | Voices | **3 selectable** — Xinran ♀, Anchen ♂, Bowen ♂ | 1 — young ♀ zh-TW | | Sample rate | 16 kHz | 16 kHz / 8 kHz | | Held-out CER | **0.059** (zh/mix/en, 3-voice avg) | 0.11–0.15 (zh) | | Best on | Jetson Nano **GPU** (also any CPU) | pure **CPU** — Nano **RTF 0.35** (8 kHz, 1 thread) | Pick **v2.1** for multiple voices; pick **v1** when the budget is CPU-only and tight. **The full family** (all MB-iSTFT-VITS except v1; all 16 kHz; single Xinran voice unless noted): | model | folder | params (deploy) | CER | use when | |---|---|---|---|---| | **v2** | `v2_mbistft_16k/` | 34.7M (17.5M) | **0.027** | you want the cleanest single Xinran voice | | **v2.1** | `v21_mbistft_16k/` | 37.9M (~18M) | 0.059 | you want a choice of 3 voices | | **V2 Lite** | `v2lite_mbistft_16k/` | 24.8M (17.5M) | 0.041 | a lighter, still-good single voice for tighter GPU budgets | | **v1** | `v1b_16k/`,`v1b_8k/` | ~5M | 0.11–0.15 | pure-CPU, real-time on a Jetson Nano | (`v3_4.6M/` and the top-level `*.onnx` are legacy 24 kHz variants, kept for provenance.) V2 Lite uses the exact same ONNX I/O + frontend as v2 — it's a drop-in, smaller replacement. > 🔊 **Live demo:** https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1 — pick a model, pick a voice, type text. --- ## PrimeTTS v2.1 (`v21_mbistft_16k/`) End-to-end **MB-iSTFT-VITS** (VAE + normalizing flow + adversarial multi-band iSTFT head; conv-only, no LSTM) with **3 selectable Taiwan-Mandarin voices**, chosen by an integer `sid` input (0 = Xinran ♀, 1 = Anchen ♂, 2 = Bowen ♂). 37.9M generator params, 16 kHz, `gin_channels=256` speaker conditioning. **Quality** (36 held-out zh / code-mix / en sentences, X-ASR normalized CER): | voice (`sid`) | CER | note | |---|---|---| | Xinran ♀ (0) | **0.059** | flagship voice, cleanest teacher | | Anchen ♂ (1) | 0.069 | slight accent | | Bowen ♂ (2) | 0.066 | slight accent | ### On-device deployment (measured, Jetson Nano gen-1 / Tegra X1) Same runtime profile as the single-voice v2 (identical architecture). RTF = compute-time ÷ audio-time (lower is faster; < 1.0 = real-time). | Tier | Runtime | Precision | RTF | Notes | |---|---|---|---|---| | **GPU** | RapidSpeech.cpp ggml-CUDA, 1 CPU thread | fp32 | **0.42** (2.4× RT) | launch-bound floor on Maxwell (sm_53, no CUDA-graph replay) | | **CPU** *(default)* | onnxruntime, 4 threads | fp32 | **0.52** (1.9× RT) | full quality, 117 MB | | **CPU** | onnxruntime, 2 threads | fp32 | **0.77** (1.3× RT) | fewer cores, leaves headroom | Both tiers are full-fidelity and need no GPU. On this ARMv8.0 Cortex-A57, **fp32 is the fast format**: **int8 is not a speed lever** (static-int8 shifts the voice; dynamic-int8 preserves it but runs *slower* than fp32 — no dot-product / no FP16 arithmetic on this core), fp16 casts to fp32 (no speedup), and XNNPACK ≈ MLAS. The only on-device speed lever is a smaller/faster **architecture**, not quantization. ### Files ``` v21_mbistft_16k/primetts_v21_3voice.onnx 3-voice fp32 (117 MB) — full quality, all runtimes ``` ### Quickstart ```bash pip install onnxruntime numpy soundfile g2pw g2p_en cn2an huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS ``` ```python import sys; sys.path.insert(0, "PrimeTTS/scripts") import numpy as np, onnxruntime as ort, soundfile as sf import frontend_bopomofo as F # g2pw bopomofo + g2p_en, one pass sess = ort.InferenceSession("PrimeTTS/v21_mbistft_16k/primetts_v21_3voice.onnx", providers=["CPUExecutionProvider"]) o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.") blank = lambda s: np.array([[0] + [v for x in s for v in (x, 0)]], np.int64) # add_blank=true sid = 0 # 0 Xinran ♀ · 1 Anchen ♂ · 2 Bowen ♂ wav = sess.run(None, { "x": blank(o["phone_ids"]), "tone": blank(o["tone_ids"]), "lang": blank(o["lang_ids"]), "x_lengths": np.array([2*len(o["phone_ids"])+1], np.int64), "sid": np.array([sid], np.int64), "noise_scale": np.array([0.667], np.float32), "length_scale": np.array([1.0], np.float32)})[0].reshape(-1) sf.write("out.wav", wav, 16000) ``` --- ## PrimeTTS v1 (`v1b_16k/`, `v1b_8k/`) — tiny, CPU-only FastSpeech-style acoustic (**no attention**: depthwise gated Conv-FFN + external durations + length regulator + BiGRU + postnet) with a **97K-param frame-pitch refiner** that turns per-phoneme pitch into the per-frame F0 contour = Mandarin tones (ablating it costs +18% relative zh-CER), and a **Snake-HiFiGAN** vocoder. Torch-free ONNX; runs real-time on a Jetson Nano CPU. One young-female zh-TW voice across zh / en / code-mix. | | flagship `v1b_16k/` | leanest `v1b_8k/` | |---|---|---| | Params | **~5.0M** (3.56M acoustic + 1.43M vocoder) | **4.09M** (+ 0.53M vocoder) | | Sample rate | 16 kHz (0–8 kHz band) | 8 kHz (telephone band) | | Jetson Nano RTF | (heavier) | **0.35** (1 thread) | | `.gguf` (ggml) | — | `inflect_combined_v1b.gguf` | Pipeline is `encoder → numpy length-regulator → decoder → vocoder`: ```python import sys; sys.path.insert(0, "PrimeTTS/scripts") import json, numpy as np, onnxruntime as ort, soundfile as sf import frontend_bopomofo as F from synth_from_text import host_regulate D = "PrimeTTS/v1b_16k" # or v1b_8k for the leanest Nano RTF meta = json.load(open(f"{D}/meta.json")) enc = ort.InferenceSession(f"{D}/acoustic_encoder.onnx", providers=["CPUExecutionProvider"]) dec = ort.InferenceSession(f"{D}/acoustic_decoder.onnx", providers=["CPUExecutionProvider"]) voc = ort.InferenceSession(f"{D}/vocoder.onnx", providers=["CPUExecutionProvider"]) o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.") ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids")) cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn, "lang": lg, "speaker": np.zeros(1, np.int64)}) reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"]) mel = dec.run(None, {k: reg[k] for k in ["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0] wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1) sf.write("out.wav", wav, meta["sample_rate"]) ``` `scripts/synth_long.py` adds punctuation auto-chunking for long text. --- ## Shared frontend `g2pw` (Taiwan bopomofo + polyphone disambiguation) + `g2p_en` (arpabet) merge into one phone sequence with per-phone **language ids** — zh / en / code-mix in a single pass, **88-symbol table**. Entity normalization (`scripts/text_norm.py`) reads numbers / dates / prices / emails / addresses / serials and spells acronyms (VIP → V-I-P), applied identically in training and inference. Both model families consume the *same* `frontend_bopomofo.text_to_ids()` output (phone / tone / lang ids). --- ## Reproduce from this repo Everything needed to rebuild both models is here: the frontend, entity normalizer, aligner, corpus-gen and text-selection scripts, the eval sets + scorer, the export scripts, and the v1 trainer. ``` scripts/ frontend_bopomofo.py · text_norm.py · align_durations_v4.py · build_corpus_v3.py gen_codemix*.py · gen_entity_texts.py · select_diverse_text.py · asr_filter.py synth_from_text.py · synth_long.py · export_8k.py · export_onnx_primetts_v21.py xasr_offline.py · assess_big.py · rebuild_voice.sh · symbol_table.json data/ codemix_v2.txt · entity_texts.jsonl · voxcpm_texts.jsonl (corpus text sources) eval_big.jsonl · eval_entity.jsonl (held-out eval sets) inflect_nano/ the v1 trainer (acoustic.py + vocoder.py), forked from Inflect-Nano-v1 configs/ zhtw_mb_istft_16k_v21b.json (v2.1 3-voice training config) ``` **Common recipe (both models):** `teacher corpus → ASR/CER gate → phone-level align → train → export`. The three levers that matter for a tiny model: **phone-level alignment** (espeak phoneme-CTC + `torchaudio.forced_align` — sub-syllable boundaries separate speech from fluent babble), **broad coverage + diverse code-mix**, and **the teacher** (a student's language is only as good as its teacher's). **v1** (`inflect_nano/` trainer, all in-repo): 1. Generate corpus text — `scripts/gen_codemix_v2.py`, `gen_entity_texts.py`, `select_diverse_text.py`. 2. Synthesize with the teacher (VoxCPM2 cloning a CC0 zh-TW reference), gate with `asr_filter.py`. 3. Align — `scripts/align_durations_v4.py`. Train acoustic + vocoder (`inflect_nano/`). Export — `scripts/export_8k.py`. 4. One-shot: **`scripts/rebuild_voice.sh`** (swap in your own ~10 s reference clip). **v2.1** (MB-iSTFT-VITS; trainer is the upstream repo — see credits): 1. Synthesize the corpus with a **VibeVoice-Large** teacher across the 3 zh-capable voices. 2. **CER-gate the teacher audio** (X-ASR normalized CER < 0.05) — *not* voice-similarity — so only intelligible clips train the model. (This is the single most important QC step; ungated multi-voice teacher audio is the main failure mode.) 3. Train the 3-speaker MB-iSTFT-VITS (`configs/zhtw_mb_istft_16k_v21b.json`, `n_speakers=3`, `gin_channels=256`), warm-started from the single-voice v2 with **fresh** speaker-conditioning layers. 4. Export to ONNX — **`scripts/export_onnx_primetts_v21.py`** (opset 17, `dynamo=False`; the tiny gen-head iSTFT `n_fft=16, hop=4` is replaced by an exact irFFT + overlap-add matrix, verified vs `torch.istft`). 5. Score — `scripts/xasr_offline.py` + `assess_big.py` on `eval_big.jsonl`. --- ## Findings & lessons (what building tiny on-device zh/en TTS actually taught us) Transferable lessons from taking this from a babbling 5M model to a shippable family. Full analysis in [`docs/zh-en-tts-arch-survey-2026.md`](./docs) and [`docs/streaming-arch-design.md`](./docs). - **A tiny model's quality is bounded by its *inputs*, not its parameter count.** Held-out Mandarin CER fell **0.88 → 0.06 at a fixed ~5M** purely from **phone-level forced alignment** + broad character coverage — no architecture change. Sub-syllable (not character) boundaries are the difference between intelligible speech and fluent babble. **Gate on resynth CER, not on how balanced the duration histogram looks.** - **CER-gate the *teacher* audio, never voice-similarity alone.** Our first multi-speaker attempt trained on teacher clips filtered only for the right *voice*; four of the "voices" were speakers that can't actually pronounce Mandarin (teacher CER 0.45–0.79), and the student faithfully learned garbled speech. Filtering on intelligibility (teacher X-ASR CER < 0.05) fixed it. - **Deterministic (FastSpeech-class) models mean-regress prosody; distributional (VITS/flow) models don't.** This is the wall that caps a tiny deterministic model at "intelligible but flat" — and why the flagship is a VITS, not a bigger FastSpeech. - **On a launch-bound GPU (Maxwell sm_53, no CUDA-graph replay), RTF is set by kernel *count*, not FLOPs.** A smaller VITS is a smaller download but **not faster** (~0.42 RTF floor regardless of params). The lever for *speed* is an architecture with fewer, larger kernels (flow-matching + Vocos measured ~0.18) — a different axis from *size*. - **On an ARMv8.0 CPU (Cortex-A57): fp32 is the fast format.** No int8 dot-product and no fp16 arithmetic, so int8 either breaks the voice (static) or runs *slower* than fp32 (dynamic), fp16 casts to fp32, and XNNPACK ≈ MLAS. The only CPU speed lever is a smaller/faster architecture — quantization is a *download-size* option. - **"Lighter" and "faster" are different goals.** VITS deploy size is dominated by flow + decoder + encoder, which don't shrink with `hidden_channels`; below ~17M deploy, quality craters. **V2 Lite (17.5M) is the practical quality floor** for this arch — there is no free "smaller *and* still good." ## Credits & licenses - **v2.1 architecture:** [MB-iSTFT-VITS](https://github.com/MasayaKawamura/MB-iSTFT-VITS) (Kawamura et al., Apache-2.0) · Jetson-Nano ggml-CUDA runtime: [RapidSpeech.cpp](https://github.com/vieenrose/RapidSpeech.cpp) (`mbistft-vits` arch) - **v2.1 teacher:** VibeVoice-Large (Microsoft, **MIT**), 3 zh-capable presets (via the MIT [community repo](https://github.com/vibevoice-community/VibeVoice)) — synthesized / AI-generated voices; mark as such in products - **v1 base / trainer:** [`owensong/Inflect-Nano-v1`](https://huggingface.co/owensong/Inflect-Nano-v1) (Apache-2.0) · **v1 teacher:** [`openbmb/VoxCPM2`](https://huggingface.co/openbmb/VoxCPM2) · **v1 reference voice:** [Mozilla Common Voice zh-TW](https://commonvoice.mozilla.org/datasets) (**CC0 / public domain**) - **Gate ASR:** Breeze-ASR-25 (MediaTek Research) · Whisper · **Aligner:** `facebook/wav2vec2-lv-60-espeak-cv-ft` + `torchaudio.forced_align` · **Eval:** sherpa-onnx X-ASR This repository: **Apache-2.0**.