Instructions to use mlboydaisuke/Matcha-TTS-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use mlboydaisuke/Matcha-TTS-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - text-to-speech | |
| - tts | |
| - litert | |
| - tflite | |
| - on-device | |
| - matcha-tts | |
| - hifigan | |
| language: | |
| - en | |
| library_name: litert | |
| pipeline_tag: text-to-speech | |
| # Matcha-TTS — LiteRT (on-device, FFT-free, GPU) | |
| On-device English text-to-speech for Android via LiteRT `CompiledModel`. This is the | |
| **FFT-free** TTS lane: [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) pairs a | |
| conditional flow-matching (CFM) acoustic model with a **HiFi-GAN time-domain vocoder**, so | |
| there is **no FFT/iSTFT anywhere** in the synthesis path. 22.05 kHz, LJSpeech voice. | |
| Converted from the official `matcha_ljspeech` + `hifigan_T2_v1` checkpoints with | |
| [litert-torch](https://github.com/google-ai-edge/litert), re-authored to be ML-Drift-GPU-clean | |
| (per-graph tflite-vs-torch corr **1.000000**; end-to-end waveform corr ≥0.99). fp16 weights. | |
| ## Files | |
| | File | Size | In → Out | Delegate (Pixel 8a) | | |
| |---|---|---|---| | |
| | `matcha_textenc_fp16.tflite` | 15 MB | emb[1,256,192] + mask[1,1,256] → mu[1,80,256], logw[1,1,256] | GPU | | |
| | `matcha_decoder_fp16.tflite` | 23 MB | x,mu[1,80,512] + t_sin[1,160] + mask[1,1,512] → v[1,80,512] | CPU¹ | | |
| | `matcha_vocoder_fp16.tflite` | 29 MB | mel[1,80,512] → wav[1,1,131072] | GPU | | |
| | `dp_g2p_matcha_fp16.tflite` | 26 MB | text[1,96] (char ids) → logits[1,96,64] (IPA) | CPU | | |
| | `emb.bin` | 0.1 MB | phoneme embedding table (178×192 f32, host lookup) | host | | |
| | `g2p_dict.txt.gz` | 1.8 MB | 275k-entry espeak-IPA dictionary (primary G2P) | host | | |
| | `config.json`, `g2p_meta.json` | — | symbols, shapes, mel stats, G2P tokenizer tables | host | | |
| ¹ The CFM decoder runs on the **CompiledModel CPU** delegate. It converts GPU-clean and is | |
| correct on CPU, but the Mali ML Drift GPU delegate **mis-fuses the decoder's transformer blocks | |
| at large activation magnitude** (the same block is correct as a standalone GPU graph, corr 0.984, | |
| but collapses to corr 0.006 fused — a graph-fusion bug, not a bad op). text encoder + vocoder run | |
| on the GPU; the GPU vocoder dominates wall time so the pipeline stays **realtime (RTF ~0.8)**. | |
| ## Pipeline (host orchestration) | |
| ``` | |
| text --G2P(CPU dict+neural)--> phoneme ids | |
| --host: embed + intersperse + pad--> text_encoder(GPU) -> mu, logw | |
| --host: durations + length-regulator--> mu_y[1,80,T] | |
| --host: Euler ODE loop (N steps)--> decoder(CPU) x N -> v | |
| --host: denormalize--> vocoder(GPU) -> waveform | |
| ``` | |
| Fixed shapes (256 phonemes, 512 mel frames ≈ 5.9 s); a runtime float mask makes padded positions | |
| a no-op so one compiled graph handles any length. | |
| ## G2P (espeak-free) | |
| Matcha-LJSpeech is trained on espeak en-us IPA, but espeak is GPL. The clean replacement is a | |
| 275k-entry espeak-IPA dictionary (from [OpenPhonemizer](https://github.com/NeuralVox/OpenPhonemizer), | |
| Clear BSD) as primary + [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) (MIT) on | |
| LiteRT CPU for out-of-dictionary words. Output IPA maps 1:1 onto the keithito 178-symbol set. | |
| ## Sample | |
| See the LiteRT `compiled_model_api/text_to_speech` sample (Matcha-TTS) in | |
| [google-ai-edge/litert-samples](https://github.com/google-ai-edge/litert-samples) for the full | |
| Android app and the conversion scripts. | |
| ## License | |
| Model: MIT (Matcha-TTS / HiFi-GAN). G2P dict: Clear BSD (OpenPhonemizer) + MIT (DeepPhonemizer). | |