README.md · Luigi/PrimeTTS at main

PrimeTTS / README.md

Luigi

Upload README.md with huggingface_hub

316ce5d verified 2 days ago

preview code

Raw

History Blame Contribute Delete

14.1 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	tags:
	- text-to-speech
	- tts
	- onnx
	- gguf
	- on-device
	- jetson
	- telephony
	- vits
	- mb-istft-vits
	- multi-speaker
	- mandarin
	- taiwanese-mandarin
	base_model: owensong/Inflect-Nano-v1
	base_model_relation: finetune
	library_name: onnxruntime
	pipeline_tag: text-to-speech
	---

	# PrimeTTS — on-device zh-TW + English TTS

	Taiwan-Mandarin + English text-to-speech built for on-device use (contact-centre, GPS, transit): one
	frontend handles Chinese, English, and code-mix with no language routing, and reads entities
	correctly — phone numbers, emails, addresses, prices, dates, temperatures, %, serials.

	Two models to know:

	\| \| PrimeTTS v2.1 — flagship \| PrimeTTS v1 — leanest CPU \|
	\|---\|---\|---\|
	\| Folder \| [`v21_mbistft_16k/`](./v21_mbistft_16k) \| [`v1b_16k/`](./v1b_16k) · [`v1b_8k/`](./v1b_8k) \|
	\| Architecture \| MB-iSTFT-VITS (end-to-end, multi-speaker) \| FastSpeech + Snake-HiFiGAN (+ pitch refiner) \|
	\| Params \| 37.9M \| ~5.0M (16 kHz) / 4.09M (8 kHz) \|
	\| Voices \| 3 selectable — Xinran ♀, Anchen ♂, Bowen ♂ \| 1 — young ♀ zh-TW \|
	\| Sample rate \| 16 kHz \| 16 kHz / 8 kHz \|
	\| Held-out CER \| 0.059 (zh/mix/en, 3-voice avg) \| 0.11–0.15 (zh) \|
	\| Best on \| Jetson Nano GPU (also any CPU) \| pure CPU — Nano RTF 0.35 (8 kHz, 1 thread) \|

	Pick v2.1 for multiple voices; pick v1 when the budget is CPU-only and tight.

	The full family (all MB-iSTFT-VITS except v1; all 16 kHz; single Xinran voice unless noted):

	\| model \| folder \| params (deploy) \| CER \| use when \|
	\|---\|---\|---\|---\|---\|
	\| v2 \| `v2_mbistft_16k/` \| 34.7M (17.5M) \| 0.027 \| you want the cleanest single Xinran voice \|
	\| v2.1 \| `v21_mbistft_16k/` \| 37.9M (~18M) \| 0.059 \| you want a choice of 3 voices \|
	\| V2 Lite \| `v2lite_mbistft_16k/` \| 24.8M (17.5M) \| 0.041 \| a lighter, still-good single voice for tighter GPU budgets \|
	\| v1 \| `v1b_16k/`,`v1b_8k/` \| ~5M \| 0.11–0.15 \| pure-CPU, real-time on a Jetson Nano \|

	(`v3_4.6M/` and the top-level `*.onnx` are legacy 24 kHz variants, kept for provenance.) V2 Lite uses the exact
	same ONNX I/O + frontend as v2 — it's a drop-in, smaller replacement.

	> 🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1 — pick a model, pick a voice, type text.

	---

	## PrimeTTS v2.1 (`v21_mbistft_16k/`)

	End-to-end MB-iSTFT-VITS (VAE + normalizing flow + adversarial multi-band iSTFT head; conv-only, no LSTM)
	with 3 selectable Taiwan-Mandarin voices, chosen by an integer `sid` input (0 = Xinran ♀, 1 = Anchen ♂,
	2 = Bowen ♂). 37.9M generator params, 16 kHz, `gin_channels=256` speaker conditioning.

	Quality (36 held-out zh / code-mix / en sentences, X-ASR normalized CER):

	\| voice (`sid`) \| CER \| note \|
	\|---\|---\|---\|
	\| Xinran ♀ (0) \| 0.059 \| flagship voice, cleanest teacher \|
	\| Anchen ♂ (1) \| 0.069 \| slight accent \|
	\| Bowen ♂ (2) \| 0.066 \| slight accent \|

	### On-device deployment (measured, Jetson Nano gen-1 / Tegra X1)

	Same runtime profile as the single-voice v2 (identical architecture). RTF = compute-time ÷ audio-time
	(lower is faster; < 1.0 = real-time).

	\| Tier \| Runtime \| Precision \| RTF \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| GPU \| RapidSpeech.cpp ggml-CUDA, 1 CPU thread \| fp32 \| 0.42 (2.4× RT) \| launch-bound floor on Maxwell (sm_53, no CUDA-graph replay) \|
	\| CPU (default) \| onnxruntime, 4 threads \| fp32 \| 0.52 (1.9× RT) \| full quality, 117 MB \|
	\| CPU \| onnxruntime, 2 threads \| fp32 \| 0.77 (1.3× RT) \| fewer cores, leaves headroom \|

	Both tiers are full-fidelity and need no GPU. On this ARMv8.0 Cortex-A57, fp32 is the fast format:
	int8 is not a speed lever (static-int8 shifts the voice; dynamic-int8 preserves it but runs slower than
	fp32 — no dot-product / no FP16 arithmetic on this core), fp16 casts to fp32 (no speedup), and XNNPACK ≈ MLAS.
	The only on-device speed lever is a smaller/faster architecture, not quantization.

	### Files

	```
	v21_mbistft_16k/primetts_v21_3voice.onnx 3-voice fp32 (117 MB) — full quality, all runtimes
	```

	### Quickstart

	```bash
	pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
	huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS
	```
	```python
	import sys; sys.path.insert(0, "PrimeTTS/scripts")
	import numpy as np, onnxruntime as ort, soundfile as sf
	import frontend_bopomofo as F # g2pw bopomofo + g2p_en, one pass

	sess = ort.InferenceSession("PrimeTTS/v21_mbistft_16k/primetts_v21_3voice.onnx",
	providers=["CPUExecutionProvider"])
	o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")
	blank = lambda s: np.array([[0] + [v for x in s for v in (x, 0)]], np.int64) # add_blank=true
	sid = 0 # 0 Xinran ♀ · 1 Anchen ♂ · 2 Bowen ♂
	wav = sess.run(None, {
	"x": blank(o["phone_ids"]), "tone": blank(o["tone_ids"]), "lang": blank(o["lang_ids"]),
	"x_lengths": np.array([2*len(o["phone_ids"])+1], np.int64),
	"sid": np.array([sid], np.int64),
	"noise_scale": np.array([0.667], np.float32),
	"length_scale": np.array([1.0], np.float32)})[0].reshape(-1)
	sf.write("out.wav", wav, 16000)
	```

	---

	## PrimeTTS v1 (`v1b_16k/`, `v1b_8k/`) — tiny, CPU-only

	FastSpeech-style acoustic (no attention: depthwise gated Conv-FFN + external durations + length regulator
	+ BiGRU + postnet) with a 97K-param frame-pitch refiner that turns per-phoneme pitch into the per-frame F0
	contour = Mandarin tones (ablating it costs +18% relative zh-CER), and a Snake-HiFiGAN vocoder. Torch-free
	ONNX; runs real-time on a Jetson Nano CPU. One young-female zh-TW voice across zh / en / code-mix.

	\| \| flagship `v1b_16k/` \| leanest `v1b_8k/` \|
	\|---\|---\|---\|
	\| Params \| ~5.0M (3.56M acoustic + 1.43M vocoder) \| 4.09M (+ 0.53M vocoder) \|
	\| Sample rate \| 16 kHz (0–8 kHz band) \| 8 kHz (telephone band) \|
	\| Jetson Nano RTF \| (heavier) \| 0.35 (1 thread) \|
	\| `.gguf` (ggml) \| — \| `inflect_combined_v1b.gguf` \|

	Pipeline is `encoder → numpy length-regulator → decoder → vocoder`:

	```python
	import sys; sys.path.insert(0, "PrimeTTS/scripts")
	import json, numpy as np, onnxruntime as ort, soundfile as sf
	import frontend_bopomofo as F
	from synth_from_text import host_regulate

	D = "PrimeTTS/v1b_16k" # or v1b_8k for the leanest Nano RTF
	meta = json.load(open(f"{D}/meta.json"))
	enc = ort.InferenceSession(f"{D}/acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
	dec = ort.InferenceSession(f"{D}/acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
	voc = ort.InferenceSession(f"{D}/vocoder.onnx", providers=["CPUExecutionProvider"])
	o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")
	ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
	cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn, "lang": lg, "speaker": np.zeros(1, np.int64)})
	reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
	mel = dec.run(None, {k: reg[k] for k in ["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
	wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
	sf.write("out.wav", wav, meta["sample_rate"])
	```
	`scripts/synth_long.py` adds punctuation auto-chunking for long text.

	---

	## Shared frontend

	`g2pw` (Taiwan bopomofo + polyphone disambiguation) + `g2p_en` (arpabet) merge into one phone sequence with
	per-phone language ids — zh / en / code-mix in a single pass, 88-symbol table. Entity normalization
	(`scripts/text_norm.py`) reads numbers / dates / prices / emails / addresses / serials and spells acronyms
	(VIP → V-I-P), applied identically in training and inference. Both model families consume the same
	`frontend_bopomofo.text_to_ids()` output (phone / tone / lang ids).

	---

	## Reproduce from this repo

	Everything needed to rebuild both models is here: the frontend, entity normalizer, aligner, corpus-gen and
	text-selection scripts, the eval sets + scorer, the export scripts, and the v1 trainer.

	```
	scripts/ frontend_bopomofo.py · text_norm.py · align_durations_v4.py · build_corpus_v3.py
	gen_codemix*.py · gen_entity_texts.py · select_diverse_text.py · asr_filter.py
	synth_from_text.py · synth_long.py · export_8k.py · export_onnx_primetts_v21.py
	xasr_offline.py · assess_big.py · rebuild_voice.sh · symbol_table.json
	data/ codemix_v2.txt · entity_texts.jsonl · voxcpm_texts.jsonl (corpus text sources)
	eval_big.jsonl · eval_entity.jsonl (held-out eval sets)
	inflect_nano/ the v1 trainer (acoustic.py + vocoder.py), forked from Inflect-Nano-v1
	configs/ zhtw_mb_istft_16k_v21b.json (v2.1 3-voice training config)
	```

	Common recipe (both models): `teacher corpus → ASR/CER gate → phone-level align → train → export`.
	The three levers that matter for a tiny model: phone-level alignment (espeak phoneme-CTC +
	`torchaudio.forced_align` — sub-syllable boundaries separate speech from fluent babble), **broad coverage +
	diverse code-mix, and the teacher** (a student's language is only as good as its teacher's).

	v1 (`inflect_nano/` trainer, all in-repo):
	1. Generate corpus text — `scripts/gen_codemix_v2.py`, `gen_entity_texts.py`, `select_diverse_text.py`.
	2. Synthesize with the teacher (VoxCPM2 cloning a CC0 zh-TW reference), gate with `asr_filter.py`.
	3. Align — `scripts/align_durations_v4.py`. Train acoustic + vocoder (`inflect_nano/`). Export — `scripts/export_8k.py`.
	4. One-shot: `scripts/rebuild_voice.sh` (swap in your own ~10 s reference clip).

	v2.1 (MB-iSTFT-VITS; trainer is the upstream repo — see credits):
	1. Synthesize the corpus with a VibeVoice-Large teacher across the 3 zh-capable voices.
	2. CER-gate the teacher audio (X-ASR normalized CER < 0.05) — not voice-similarity — so only intelligible
	clips train the model. (This is the single most important QC step; ungated multi-voice teacher audio is the
	main failure mode.)
	3. Train the 3-speaker MB-iSTFT-VITS (`configs/zhtw_mb_istft_16k_v21b.json`, `n_speakers=3`, `gin_channels=256`),
	warm-started from the single-voice v2 with fresh speaker-conditioning layers.
	4. Export to ONNX — `scripts/export_onnx_primetts_v21.py` (opset 17, `dynamo=False`; the tiny gen-head
	iSTFT `n_fft=16, hop=4` is replaced by an exact irFFT + overlap-add matrix, verified vs `torch.istft`).
	5. Score — `scripts/xasr_offline.py` + `assess_big.py` on `eval_big.jsonl`.

	---

	## Findings & lessons (what building tiny on-device zh/en TTS actually taught us)

	Transferable lessons from taking this from a babbling 5M model to a shippable family. Full analysis in
	[`docs/zh-en-tts-arch-survey-2026.md`](./docs) and [`docs/streaming-arch-design.md`](./docs).

	- *A tiny model's quality is bounded by its inputs, not its parameter count.* Held-out Mandarin CER fell
	0.88 → 0.06 at a fixed ~5M purely from phone-level forced alignment + broad character coverage —
	no architecture change. Sub-syllable (not character) boundaries are the difference between intelligible
	speech and fluent babble. Gate on resynth CER, not on how balanced the duration histogram looks.
	- *CER-gate the teacher* audio, never voice-similarity alone.** Our first multi-speaker attempt trained on
	teacher clips filtered only for the right voice; four of the "voices" were speakers that can't actually
	pronounce Mandarin (teacher CER 0.45–0.79), and the student faithfully learned garbled speech. Filtering on
	intelligibility (teacher X-ASR CER < 0.05) fixed it.
	- Deterministic (FastSpeech-class) models mean-regress prosody; distributional (VITS/flow) models don't.
	This is the wall that caps a tiny deterministic model at "intelligible but flat" — and why the flagship is
	a VITS, not a bigger FastSpeech.
	- *On a launch-bound GPU (Maxwell sm_53, no CUDA-graph replay), RTF is set by kernel count, not FLOPs.* A
	smaller VITS is a smaller download but not faster (~0.42 RTF floor regardless of params). The lever for
	speed is an architecture with fewer, larger kernels (flow-matching + Vocos measured ~0.18) — a different
	axis from size.
	- On an ARMv8.0 CPU (Cortex-A57): fp32 is the fast format. No int8 dot-product and no fp16 arithmetic, so
	int8 either breaks the voice (static) or runs slower than fp32 (dynamic), fp16 casts to fp32, and XNNPACK
	≈ MLAS. The only CPU speed lever is a smaller/faster architecture — quantization is a download-size option.
	- "Lighter" and "faster" are different goals. VITS deploy size is dominated by flow + decoder + encoder,
	which don't shrink with `hidden_channels`; below ~17M deploy, quality craters. **V2 Lite (17.5M) is the
	practical quality floor** for this arch — there is no free "smaller and still good."

	## Credits & licenses

	- v2.1 architecture: [MB-iSTFT-VITS](https://github.com/MasayaKawamura/MB-iSTFT-VITS) (Kawamura et al., Apache-2.0) ·
	Jetson-Nano ggml-CUDA runtime: [RapidSpeech.cpp](https://github.com/vieenrose/RapidSpeech.cpp) (`mbistft-vits` arch)
	- v2.1 teacher: VibeVoice-Large (Microsoft, MIT), 3 zh-capable presets (via the MIT
	[community repo](https://github.com/vibevoice-community/VibeVoice)) — synthesized / AI-generated voices; mark as such in products
	- v1 base / trainer: [`owensong/Inflect-Nano-v1`](https://huggingface.co/owensong/Inflect-Nano-v1) (Apache-2.0) ·
	v1 teacher: [`openbmb/VoxCPM2`](https://huggingface.co/openbmb/VoxCPM2) ·
	v1 reference voice: [Mozilla Common Voice zh-TW](https://commonvoice.mozilla.org/datasets) (CC0 / public domain)
	- Gate ASR: Breeze-ASR-25 (MediaTek Research) · Whisper · Aligner:
	`facebook/wav2vec2-lv-60-espeak-cv-ft` + `torchaudio.forced_align` · Eval: sherpa-onnx X-ASR

	This repository: Apache-2.0.