mlboydaisuke
/

Matcha-TTS-LiteRT

Model card Files Files and versions

Matcha-TTS-LiteRT / README.md

mlboydaisuke's picture

Upload README.md with huggingface_hub

501e0a4 verified 7 days ago

|

History Blame Contribute Delete

3.34 kB

	---
	license: mit
	tags:
	- text-to-speech
	- tts
	- litert
	- tflite
	- on-device
	- matcha-tts
	- hifigan
	language:
	- en
	library_name: litert
	pipeline_tag: text-to-speech
	---

	# Matcha-TTS — LiteRT (on-device, FFT-free, GPU)

	On-device English text-to-speech for Android via LiteRT `CompiledModel`. This is the
	FFT-free TTS lane: [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) pairs a
	conditional flow-matching (CFM) acoustic model with a HiFi-GAN time-domain vocoder, so
	there is no FFT/iSTFT anywhere in the synthesis path. 22.05 kHz, LJSpeech voice.

	Converted from the official `matcha_ljspeech` + `hifigan_T2_v1` checkpoints with
	[litert-torch](https://github.com/google-ai-edge/litert), re-authored to be ML-Drift-GPU-clean
	(per-graph tflite-vs-torch corr 1.000000; end-to-end waveform corr ≥0.99). fp16 weights.

	## Files

	\| File \| Size \| In → Out \| Delegate (Pixel 8a) \|
	\|---\|---\|---\|---\|
	\| `matcha_textenc_fp16.tflite` \| 15 MB \| emb[1,256,192] + mask[1,1,256] → mu[1,80,256], logw[1,1,256] \| GPU \|
	\| `matcha_decoder_fp16.tflite` \| 23 MB \| x,mu[1,80,512] + t_sin[1,160] + mask[1,1,512] → v[1,80,512] \| CPU¹ \|
	\| `matcha_vocoder_fp16.tflite` \| 29 MB \| mel[1,80,512] → wav[1,1,131072] \| GPU \|
	\| `dp_g2p_matcha_fp16.tflite` \| 26 MB \| text[1,96] (char ids) → logits[1,96,64] (IPA) \| CPU \|
	\| `emb.bin` \| 0.1 MB \| phoneme embedding table (178×192 f32, host lookup) \| host \|
	\| `g2p_dict.txt.gz` \| 1.8 MB \| 275k-entry espeak-IPA dictionary (primary G2P) \| host \|
	\| `config.json`, `g2p_meta.json` \| — \| symbols, shapes, mel stats, G2P tokenizer tables \| host \|

	¹ The CFM decoder runs on the CompiledModel CPU delegate. It converts GPU-clean and is
	correct on CPU, but the Mali ML Drift GPU delegate **mis-fuses the decoder's transformer blocks
	at large activation magnitude** (the same block is correct as a standalone GPU graph, corr 0.984,
	but collapses to corr 0.006 fused — a graph-fusion bug, not a bad op). text encoder + vocoder run
	on the GPU; the GPU vocoder dominates wall time so the pipeline stays realtime (RTF ~0.8).

	## Pipeline (host orchestration)

	```
	text --G2P(CPU dict+neural)--> phoneme ids
	--host: embed + intersperse + pad--> text_encoder(GPU) -> mu, logw
	--host: durations + length-regulator--> mu_y[1,80,T]
	--host: Euler ODE loop (N steps)--> decoder(CPU) x N -> v
	--host: denormalize--> vocoder(GPU) -> waveform
	```

	Fixed shapes (256 phonemes, 512 mel frames ≈ 5.9 s); a runtime float mask makes padded positions
	a no-op so one compiled graph handles any length.

	## G2P (espeak-free)

	Matcha-LJSpeech is trained on espeak en-us IPA, but espeak is GPL. The clean replacement is a
	275k-entry espeak-IPA dictionary (from [OpenPhonemizer](https://github.com/NeuralVox/OpenPhonemizer),
	Clear BSD) as primary + [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) (MIT) on
	LiteRT CPU for out-of-dictionary words. Output IPA maps 1:1 onto the keithito 178-symbol set.

	## Sample

	See the LiteRT `compiled_model_api/text_to_speech` sample (Matcha-TTS) in
	[google-ai-edge/litert-samples](https://github.com/google-ai-edge/litert-samples) for the full
	Android app and the conversion scripts.

	## License

	Model: MIT (Matcha-TTS / HiFi-GAN). G2P dict: Clear BSD (OpenPhonemizer) + MIT (DeepPhonemizer).