docs(extension): rename Plus → GPU Lite tier (matches extension UI; mobile still Plus)

e09ad90 verified 4 days ago

11.6 kB

	# ThreadCast — Chrome Extension Neural Models Mirror

	Hugging Face transformers.js–format mirror of the on-device neural TTS models used by the ThreadCast Chrome extension. The Android counterpart lives in two siblings: [`../android/`](../android/README.md) (local dev staging — sherpa-onnx upstream artifacts) and [`../mobile-android/`](../mobile-android/README.md) (production zips downloaded by the Android app at runtime). See the [parent README](../README.md) for repository-wide context, branding, and license summary.

	> If you're an extension user, you don't need anything here — the extension downloads what it needs automatically the first time you select a Neural engine. This page is for transparency, contributors, and forks.

	---

	## Layout

	```
	extension/
	├── neural-28m/ # Piper voices for the CPU (Lite) engine
	│ └── en/en_US/<voice>/medium/
	│ ├── en_US-<voice>-medium.onnx
	│ └── en_US-<voice>-medium.onnx.json
	├── neural-melo-en/ # MeloTTS for the GPU Lite engine (mobile surfaces same engine as "Plus")
	│ ├── model.onnx # fp32 — production default
	│ ├── lexicon.txt # enriched CMUdict-style lexicon
	│ ├── tokens.txt # phoneme → ID map
	│ └── LICENSE
	└── neural-82m/ # Kokoro model + voices for the GPU (Studio) engine
	├── onnx/
	│ ├── model.onnx # fp32 — production default
	│ └── model_fp16.onnx # fp16 — experimental, blocked by upstream bugs
	├── tokenizer.json
	├── tokenizer_config.json
	├── config.json
	└── voices/ # 11 speaker embeddings
	├── af_bella.bin … bm_daniel.bin
	```

	> Naming note: `neural-28m` / `neural-82m` encode the parameter count in their folder name (CPU and GPU tiers, respectively). `neural-melo-en` breaks that convention — MeloTTS at ~52 M params would naturally be `neural-52m`, but the folder + file naming aligns with the local staging tree at [`AI Neural Models/android/neural-melo-en/`](../android/) and the mobile production bundle `threadcast-melo-en-v2.zip`. Same engine, same file, two surfaces. Tier identifier in docs / engine tables remains `neural-52m`.

	---

	## Engine tiers at a glance

	\| Tier \| Subtree \| Architecture \| Params \| Runtime \| First-use download \| Extension UI label \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Lite (CPU) \| `neural-28m/` \| Piper VITS \| ~28 M \| WASM single-thread \| ~63 MB per voice + ~10 MB shared espeak \| Neural · CPU \|
	\| GPU Lite \| `neural-melo-en/` \| MeloTTS VITS2 + BERT prosody assist \| ~52 M \| WebGPU (WASM fallback) \| ~177 MB single bundle (5 EN accents) \| Neural · GPU Lite \|
	\| Studio (GPU) \| `neural-82m/` \| Kokoro StyleTTS2 \| ~82 M \| WebGPU \| ~325 MB single bundle (11 voices) \| Neural · GPU \|

	GPU Lite sits between CPU and GPU on every axis — download size, VRAM, hardware floor, output quality. Designed for users whose hardware supports WebGPU but can't comfortably run the 82 M Studio model. Same engine as the mobile app's "Local AI Plus" tier — extension just surfaces it with a tier name that aligns with the existing CPU/GPU framing users already know.

	---

	## CPU tier — `neural-28m` — Piper (VITS · 28 M params · WASM)

	Five English voices, ~63 MB per voice. One voice loaded at a time. Single-thread WASM inference inside an MV3 offscreen document. Real-time on a modern laptop.

	\| Voice ID \| Speaker \| Notes \|
	\| ------------------------- \| ------------ \| --------------------------- \|
	\| `en_US-amy-medium` \| Amy \| Female · warm narrator \|
	\| `en_US-lessac-medium` \| Lessac \| Female · neutral, news-anchor \|
	\| `en_US-ryan-medium` \| Ryan \| Male · clear, newsreader \|
	\| `en_US-hfc_female-medium` \| HFC Female \| Female · crisp, modern \|
	\| `en_US-hfc_male-medium` \| HFC Male \| Male · crisp, modern \|

	Each voice ships as two files (`.onnx` + `.onnx.json`) under `neural-28m/en/en_US/<voice>/medium/`.

	Upstream: [`diffusionstudio/piper-voices`](https://huggingface.co/diffusionstudio/piper-voices) → curated subset mirrored here.

	---

	## GPU Lite tier — `neural-melo-en` — MeloTTS English (VITS2 + BERT · ~52 M · WebGPU)

	Single ~171 MB model serves all 5 English accents via speaker-ID lookup at synth time. BERT prosody assist is baked into the ONNX graph, so no separate BERT input or model. WebGPU-accelerated inference; on adapters without WebGPU support, ORT-Web falls back to single-thread WASM (slow but functional). MIT license.

	### Files

	\| File \| Size \| Purpose \|
	\|---\|---\|---\|
	\| `model.onnx` \| ~171 MB \| fp32 ONNX export — production default; same file the Android app ships via [`mobile-android/v1/threadcast-melo-en-v2.zip`](../mobile-android/v1/) \|
	\| `lexicon.txt` \| ~6 MB \| Enriched CMUdict-style lexicon (~250 k+ entries: base 129 k + CMUdict latest + g2p_en + Aquila-Resolve neural G2P + curated Reddit/tech/brand/modern-English terms + punctuation silence rules — including em-dash → short pause) \|
	\| `tokens.txt` \| ~1 KB \| Phoneme → integer-ID map (~219 entries, case-sensitive) \|
	\| `LICENSE` \| small \| MIT, retained from upstream \|

	No `espeak-ng-data/` here — MeloTTS embeds phonemization end-to-end via the CMUdict lexicon. Out-of-vocabulary tokens fall back to letter-by-letter spelling using single-letter lexicon entries.

	### Voices (5 EN accents — speaker IDs 0..4)

	\| `sid` \| Voice ID \| Name \| Accent \|
	\|---\|---\|---\|---\|
	\| 0 \| `default` \| Sarah \| Female · neutral, default \|
	\| 1 \| `en-us` \| Alice \| Female · American \|
	\| 2 \| `en-india` \| Priya \| Female · Indian English \|
	\| 3 \| `en-uk` \| Charlotte \| Female · British \|
	\| 4 \| `en-au` \| Olivia \| Female · Australian \|

	All speakers female today — accent diversity is the differentiator. To synth a specific accent, pass the corresponding `sid` to the model's input tensor.

	### Model input contract

	Standard sherpa-onnx Melo VITS2 ONNX signature:

	```
	x int64 (1, T) — phoneme IDs (from lexicon lookup via tokens.txt)
	x_lengths int64 (1,) — T
	tones int64 (1, T) — tone IDs (mostly 7–10 for English), parallel to x
	sid int64 (1,) — speaker ID (0..4)
	noise_scale float (1,) — 0.667 default
	noise_scale_w float (1,) — 0.8 default
	length_scale float (1,) — 1.0 / speed
	```

	Output: `y` float32 (1, 1, N) at 44 100 Hz mono.

	Upstream: [`csukuangfj/sherpa-onnx-vits-melo-tts-en`](https://huggingface.co/csukuangfj/sherpa-onnx-vits-melo-tts-en) (sherpa-onnx's MeloTTS English export). Original model: [`myshell-ai/MeloTTS-English`](https://huggingface.co/myshell-ai/MeloTTS-English) (PyTorch, MIT).

	### Why fp32 (not fp16)?

	Same architecture, same weights, same file as mobile's Plus tier — except mobile ships fp16 for the ARM NEON SIMD speed win on-device. The browser story is different:

	- ORT-Web WebGPU's fp16 path depends on the optional `shader-f16` extension, which a chunk of WebGPU adapters don't expose. On those, fp16 runs at fp32 speed anyway.
	- ORT-Web WASM has no native fp16 kernels — fp16 input gets up-cast at load time, gaining download size but losing nothing on inference speed.
	- Audio-quality A/B between fp16 and fp32 hasn't been run on a WebGPU listening setup yet. Vocoder-family models have documented fp16 sensitivity (subnormal weights can clamp on conversion → audible artifacts on sibilants), and a per-platform listening test was deferred.

	Net: fp32 is the safer browser choice. If a WebGPU + headphones A/B later validates fp16, the engine config flips with no other changes (the fp16 file already exists at [`AI Neural Models/android/neural-melo-en/model.fp16.onnx`](../android/) for upload when the time comes).

	---

	## GPU tier — `neural-82m` — Kokoro 82 M (ONNX · WebGPU)

	A single Kokoro model unlocks 11 distinct voices at once via 11 small speaker-embedding files. WebGPU-accelerated inference, ~10× real-time on a modern GPU.

	### Model file

	\| File \| Precision \| Size \| Status \|
	\| --------------------------------- \| --------- \| -------- \| ------ \|
	\| `neural-82m/onnx/model.onnx` \| fp32 \| ~325 MB \| ✅ Production default — stable on every WebGPU runtime \|
	\| `neural-82m/onnx/model_fp16.onnx` \| fp16 \| ~165 MB \| ⚠️ Reserved for future use — blocked today by upstream `onnxruntime-web` fp16 bugs ([microsoft/onnxruntime#23403](https://github.com/microsoft/onnxruntime/issues/23403), [#26732](https://github.com/microsoft/onnxruntime/issues/26732)) \|

	The fp16 file is staged here so once the upstream JS stack lands fp16+WebGPU fixes, ThreadCast can flip the default to fp16 with a single config change — halving the download and roughly doubling per-segment speed on capable GPUs.

	### Tokenizer + config

	`tokenizer.json`, `tokenizer_config.json`, `config.json` — small files used by [`@huggingface/transformers`](https://www.npmjs.com/package/@huggingface/transformers) (transformers.js) when loading the model.

	### Voices (`neural-82m/voices/*.bin`, ~520 KB each)

	\| Voice ID \| Name \| Accent \| Gender \|
	\| -------------- \| --------- \| --------- \| ------ \|
	\| `af_bella` \| Bella \| American \| Female \|
	\| `af_sarah` \| Sarah \| American \| Female \|
	\| `af_nova` \| Nova \| American \| Female \|
	\| `af_sky` \| Sky \| American \| Female \|
	\| `am_adam` \| Adam \| American \| Male \|
	\| `am_michael` \| Michael \| American \| Male \|
	\| `am_echo` \| Echo \| American \| Male \|
	\| `bf_emma` \| Emma \| British \| Female \|
	\| `bf_isabella` \| Isabella \| British \| Female \|
	\| `bm_george` \| George \| British \| Male \|
	\| `bm_daniel` \| Daniel \| British \| Male \|

	Voice IDs encode locale and gender: first letter = accent (`a` = American, `b` = British), second letter = gender (`f` = female, `m` = male).

	Upstream: model from [`onnx-community/Kokoro-82M-v1.0-ONNX-timestamped`](https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX-timestamped); voice embeddings from [`onnx-community/Kokoro-82M-v1.0-ONNX`](https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX).

	---

	## How the extension uses these files

	The ThreadCast extension fetches model files lazily, only when the user selects a Neural engine and presses Test/Play. Files are cached in the browser's Cache API and reused across sessions, so the user pays the download cost exactly once per profile.

	\| Engine \| Files fetched on first use \|
	\| --------- \| --------------------------------------------------------------- \|
	\| System voices \| None — uses OS / browser TTS \|
	\| Neural · CPU \| The selected voice's `.onnx` + `.onnx.json` (~63 MB total) \|
	\| Neural · GPU Lite \| `neural-melo-en/{model.onnx, lexicon.txt, tokens.txt}` (~177 MB total — all 5 EN accents in one bundle) \|
	\| Neural · GPU \| `onnx/model.onnx` + tokenizer (~326 MB) + 11 voice `.bin` (~5.7 MB) \|

	The WASM runtimes (ONNX Runtime, Piper phonemizer) are bundled inside the extension package itself — not served from this repo — to comply with Manifest V3 CSP and avoid CDN dependencies.

	---

	## License

	Per-project licenses retained from upstream — see the [parent README](../README.md#license) for the consolidated summary.