docs(extension): rename Plus → GPU Lite tier (matches extension UI; mobile still Plus)

e09ad90 verified 3 days ago

11.6 kB

ThreadCast — Chrome Extension Neural Models Mirror

Hugging Face transformers.js–format mirror of the on-device neural TTS models used by the ThreadCast Chrome extension. The Android counterpart lives in two siblings: ../android/ (local dev staging — sherpa-onnx upstream artifacts) and ../mobile-android/ (production zips downloaded by the Android app at runtime). See the parent README for repository-wide context, branding, and license summary.

If you're an extension user, you don't need anything here — the extension downloads what it needs automatically the first time you select a Neural engine. This page is for transparency, contributors, and forks.

Layout

extension/
├── neural-28m/                    # Piper voices for the CPU (Lite) engine
│   └── en/en_US/<voice>/medium/
│       ├── en_US-<voice>-medium.onnx
│       └── en_US-<voice>-medium.onnx.json
├── neural-melo-en/                # MeloTTS for the GPU Lite engine (mobile surfaces same engine as "Plus")
│   ├── model.onnx                  # fp32 — production default
│   ├── lexicon.txt                 # enriched CMUdict-style lexicon
│   ├── tokens.txt                  # phoneme → ID map
│   └── LICENSE
└── neural-82m/                    # Kokoro model + voices for the GPU (Studio) engine
    ├── onnx/
    │   ├── model.onnx              # fp32 — production default
    │   └── model_fp16.onnx         # fp16 — experimental, blocked by upstream bugs
    ├── tokenizer.json
    ├── tokenizer_config.json
    ├── config.json
    └── voices/                     # 11 speaker embeddings
        ├── af_bella.bin … bm_daniel.bin

Naming note: neural-28m / neural-82m encode the parameter count in their folder name (CPU and GPU tiers, respectively). neural-melo-en breaks that convention — MeloTTS at ~52 M params would naturally be neural-52m, but the folder + file naming aligns with the local staging tree at AI Neural Models/android/neural-melo-en/ and the mobile production bundle threadcast-melo-en-v2.zip. Same engine, same file, two surfaces. Tier identifier in docs / engine tables remains neural-52m.

Engine tiers at a glance

Tier	Subtree	Architecture	Params	Runtime	First-use download	Extension UI label
Lite (CPU)	`neural-28m/`	Piper VITS	~28 M	WASM single-thread	~63 MB per voice + ~10 MB shared espeak	Neural · CPU
GPU Lite	`neural-melo-en/`	MeloTTS VITS2 + BERT prosody assist	~52 M	WebGPU (WASM fallback)	~177 MB single bundle (5 EN accents)	Neural · GPU Lite
Studio (GPU)	`neural-82m/`	Kokoro StyleTTS2	~82 M	WebGPU	~325 MB single bundle (11 voices)	Neural · GPU

GPU Lite sits between CPU and GPU on every axis — download size, VRAM, hardware floor, output quality. Designed for users whose hardware supports WebGPU but can't comfortably run the 82 M Studio model. Same engine as the mobile app's "Local AI Plus" tier — extension just surfaces it with a tier name that aligns with the existing CPU/GPU framing users already know.

CPU tier — `neural-28m` — Piper (VITS · 28 M params · WASM)

Five English voices, ~63 MB per voice. One voice loaded at a time. Single-thread WASM inference inside an MV3 offscreen document. Real-time on a modern laptop.

Voice ID	Speaker	Notes
`en_US-amy-medium`	Amy	Female · warm narrator
`en_US-lessac-medium`	Lessac	Female · neutral, news-anchor
`en_US-ryan-medium`	Ryan	Male · clear, newsreader
`en_US-hfc_female-medium`	HFC Female	Female · crisp, modern
`en_US-hfc_male-medium`	HFC Male	Male · crisp, modern

Each voice ships as two files (*.onnx + *.onnx.json) under neural-28m/en/en_US/<voice>/medium/.

Upstream: diffusionstudio/piper-voices → curated subset mirrored here.

GPU Lite tier — `neural-melo-en` — MeloTTS English (VITS2 + BERT · ~52 M · WebGPU)

Single ~171 MB model serves all 5 English accents via speaker-ID lookup at synth time. BERT prosody assist is baked into the ONNX graph, so no separate BERT input or model. WebGPU-accelerated inference; on adapters without WebGPU support, ORT-Web falls back to single-thread WASM (slow but functional). MIT license.

Files

File	Size	Purpose
`model.onnx`	~171 MB	fp32 ONNX export — production default; same file the Android app ships via `mobile-android/v1/threadcast-melo-en-v2.zip`
`lexicon.txt`	~6 MB	Enriched CMUdict-style lexicon (~250 k+ entries: base 129 k + CMUdict latest + g2p_en + Aquila-Resolve neural G2P + curated Reddit/tech/brand/modern-English terms + punctuation silence rules — including em-dash → short pause)
`tokens.txt`	~1 KB	Phoneme → integer-ID map (~219 entries, case-sensitive)
`LICENSE`	small	MIT, retained from upstream

No espeak-ng-data/ here — MeloTTS embeds phonemization end-to-end via the CMUdict lexicon. Out-of-vocabulary tokens fall back to letter-by-letter spelling using single-letter lexicon entries.

Voices (5 EN accents — speaker IDs 0..4)

`sid`	Voice ID	Name	Accent
0	`default`	Sarah	Female · neutral, default
1	`en-us`	Alice	Female · American
2	`en-india`	Priya	Female · Indian English
3	`en-uk`	Charlotte	Female · British
4	`en-au`	Olivia	Female · Australian

All speakers female today — accent diversity is the differentiator. To synth a specific accent, pass the corresponding sid to the model's input tensor.

Model input contract

Standard sherpa-onnx Melo VITS2 ONNX signature:

x          int64 (1, T)  — phoneme IDs (from lexicon lookup via tokens.txt)
x_lengths  int64 (1,)    — T
tones      int64 (1, T)  — tone IDs (mostly 7–10 for English), parallel to x
sid        int64 (1,)    — speaker ID (0..4)
noise_scale       float (1,) — 0.667 default
noise_scale_w     float (1,) — 0.8 default
length_scale      float (1,) — 1.0 / speed

Output: y float32 (1, 1, N) at 44 100 Hz mono.

Upstream: csukuangfj/sherpa-onnx-vits-melo-tts-en (sherpa-onnx's MeloTTS English export). Original model: myshell-ai/MeloTTS-English (PyTorch, MIT).

Why fp32 (not fp16)?

Same architecture, same weights, same file as mobile's Plus tier — except mobile ships fp16 for the ARM NEON SIMD speed win on-device. The browser story is different:

ORT-Web WebGPU's fp16 path depends on the optional shader-f16 extension, which a chunk of WebGPU adapters don't expose. On those, fp16 runs at fp32 speed anyway.
ORT-Web WASM has no native fp16 kernels — fp16 input gets up-cast at load time, gaining download size but losing nothing on inference speed.
Audio-quality A/B between fp16 and fp32 hasn't been run on a WebGPU listening setup yet. Vocoder-family models have documented fp16 sensitivity (subnormal weights can clamp on conversion → audible artifacts on sibilants), and a per-platform listening test was deferred.

Net: fp32 is the safer browser choice. If a WebGPU + headphones A/B later validates fp16, the engine config flips with no other changes (the fp16 file already exists at AI Neural Models/android/neural-melo-en/model.fp16.onnx for upload when the time comes).

GPU tier — `neural-82m` — Kokoro 82 M (ONNX · WebGPU)

A single Kokoro model unlocks 11 distinct voices at once via 11 small speaker-embedding files. WebGPU-accelerated inference, ~10× real-time on a modern GPU.

Model file

File	Precision	Size	Status
`neural-82m/onnx/model.onnx`	fp32	~325 MB	✅ Production default — stable on every WebGPU runtime
`neural-82m/onnx/model_fp16.onnx`	fp16	~165 MB	⚠️ Reserved for future use — blocked today by upstream `onnxruntime-web` fp16 bugs (microsoft/onnxruntime#23403, #26732)

The fp16 file is staged here so once the upstream JS stack lands fp16+WebGPU fixes, ThreadCast can flip the default to fp16 with a single config change — halving the download and roughly doubling per-segment speed on capable GPUs.

Tokenizer + config

tokenizer.json, tokenizer_config.json, config.json — small files used by @huggingface/transformers (transformers.js) when loading the model.

Voices (`neural-82m/voices/*.bin`, ~520 KB each)

Voice ID	Name	Accent	Gender
`af_bella`	Bella	American	Female
`af_sarah`	Sarah	American	Female
`af_nova`	Nova	American	Female
`af_sky`	Sky	American	Female
`am_adam`	Adam	American	Male
`am_michael`	Michael	American	Male
`am_echo`	Echo	American	Male
`bf_emma`	Emma	British	Female
`bf_isabella`	Isabella	British	Female
`bm_george`	George	British	Male
`bm_daniel`	Daniel	British	Male

Voice IDs encode locale and gender: first letter = accent (a = American, b = British), second letter = gender (f = female, m = male).

Upstream: model from onnx-community/Kokoro-82M-v1.0-ONNX-timestamped; voice embeddings from onnx-community/Kokoro-82M-v1.0-ONNX.

How the extension uses these files

The ThreadCast extension fetches model files lazily, only when the user selects a Neural engine and presses Test/Play. Files are cached in the browser's Cache API and reused across sessions, so the user pays the download cost exactly once per profile.

Engine	Files fetched on first use
System voices	None — uses OS / browser TTS
Neural · CPU	The selected voice's `.onnx` + `.onnx.json` (~63 MB total)
Neural · GPU Lite	`neural-melo-en/{model.onnx, lexicon.txt, tokens.txt}` (~177 MB total — all 5 EN accents in one bundle)
Neural · GPU	`onnx/model.onnx` + tokenizer (~~326 MB) + 11 voice `.bin` (~~5.7 MB)

The WASM runtimes (ONNX Runtime, Piper phonemizer) are bundled inside the extension package itself — not served from this repo — to comply with Manifest V3 CSP and avoid CDN dependencies.

License

Per-project licenses retained from upstream — see the parent README for the consolidated summary.