Pixel-Labs's picture
docs(extension): rename Plus β†’ GPU Lite tier (matches extension UI; mobile still Plus)
e09ad90 verified

ThreadCast β€” Chrome Extension Neural Models Mirror

Hugging Face transformers.js–format mirror of the on-device neural TTS models used by the ThreadCast Chrome extension. The Android counterpart lives in two siblings: ../android/ (local dev staging β€” sherpa-onnx upstream artifacts) and ../mobile-android/ (production zips downloaded by the Android app at runtime). See the parent README for repository-wide context, branding, and license summary.

If you're an extension user, you don't need anything here β€” the extension downloads what it needs automatically the first time you select a Neural engine. This page is for transparency, contributors, and forks.


Layout

extension/
β”œβ”€β”€ neural-28m/                    # Piper voices for the CPU (Lite) engine
β”‚   └── en/en_US/<voice>/medium/
β”‚       β”œβ”€β”€ en_US-<voice>-medium.onnx
β”‚       └── en_US-<voice>-medium.onnx.json
β”œβ”€β”€ neural-melo-en/                # MeloTTS for the GPU Lite engine (mobile surfaces same engine as "Plus")
β”‚   β”œβ”€β”€ model.onnx                  # fp32 β€” production default
β”‚   β”œβ”€β”€ lexicon.txt                 # enriched CMUdict-style lexicon
β”‚   β”œβ”€β”€ tokens.txt                  # phoneme β†’ ID map
β”‚   └── LICENSE
└── neural-82m/                    # Kokoro model + voices for the GPU (Studio) engine
    β”œβ”€β”€ onnx/
    β”‚   β”œβ”€β”€ model.onnx              # fp32 β€” production default
    β”‚   └── model_fp16.onnx         # fp16 β€” experimental, blocked by upstream bugs
    β”œβ”€β”€ tokenizer.json
    β”œβ”€β”€ tokenizer_config.json
    β”œβ”€β”€ config.json
    └── voices/                     # 11 speaker embeddings
        β”œβ”€β”€ af_bella.bin … bm_daniel.bin

Naming note: neural-28m / neural-82m encode the parameter count in their folder name (CPU and GPU tiers, respectively). neural-melo-en breaks that convention β€” MeloTTS at ~52 M params would naturally be neural-52m, but the folder + file naming aligns with the local staging tree at AI Neural Models/android/neural-melo-en/ and the mobile production bundle threadcast-melo-en-v2.zip. Same engine, same file, two surfaces. Tier identifier in docs / engine tables remains neural-52m.


Engine tiers at a glance

Tier Subtree Architecture Params Runtime First-use download Extension UI label
Lite (CPU) neural-28m/ Piper VITS ~28 M WASM single-thread ~63 MB per voice + ~10 MB shared espeak Neural Β· CPU
GPU Lite neural-melo-en/ MeloTTS VITS2 + BERT prosody assist ~52 M WebGPU (WASM fallback) ~177 MB single bundle (5 EN accents) Neural Β· GPU Lite
Studio (GPU) neural-82m/ Kokoro StyleTTS2 ~82 M WebGPU ~325 MB single bundle (11 voices) Neural Β· GPU

GPU Lite sits between CPU and GPU on every axis β€” download size, VRAM, hardware floor, output quality. Designed for users whose hardware supports WebGPU but can't comfortably run the 82 M Studio model. Same engine as the mobile app's "Local AI Plus" tier β€” extension just surfaces it with a tier name that aligns with the existing CPU/GPU framing users already know.


CPU tier β€” neural-28m β€” Piper (VITS Β· 28 M params Β· WASM)

Five English voices, ~63 MB per voice. One voice loaded at a time. Single-thread WASM inference inside an MV3 offscreen document. Real-time on a modern laptop.

Voice ID Speaker Notes
en_US-amy-medium Amy Female Β· warm narrator
en_US-lessac-medium Lessac Female Β· neutral, news-anchor
en_US-ryan-medium Ryan Male Β· clear, newsreader
en_US-hfc_female-medium HFC Female Female Β· crisp, modern
en_US-hfc_male-medium HFC Male Male Β· crisp, modern

Each voice ships as two files (*.onnx + *.onnx.json) under neural-28m/en/en_US/<voice>/medium/.

Upstream: diffusionstudio/piper-voices β†’ curated subset mirrored here.


GPU Lite tier β€” neural-melo-en β€” MeloTTS English (VITS2 + BERT Β· ~52 M Β· WebGPU)

Single ~171 MB model serves all 5 English accents via speaker-ID lookup at synth time. BERT prosody assist is baked into the ONNX graph, so no separate BERT input or model. WebGPU-accelerated inference; on adapters without WebGPU support, ORT-Web falls back to single-thread WASM (slow but functional). MIT license.

Files

File Size Purpose
model.onnx ~171 MB fp32 ONNX export β€” production default; same file the Android app ships via mobile-android/v1/threadcast-melo-en-v2.zip
lexicon.txt ~6 MB Enriched CMUdict-style lexicon (~250 k+ entries: base 129 k + CMUdict latest + g2p_en + Aquila-Resolve neural G2P + curated Reddit/tech/brand/modern-English terms + punctuation silence rules β€” including em-dash β†’ short pause)
tokens.txt ~1 KB Phoneme β†’ integer-ID map (~219 entries, case-sensitive)
LICENSE small MIT, retained from upstream

No espeak-ng-data/ here β€” MeloTTS embeds phonemization end-to-end via the CMUdict lexicon. Out-of-vocabulary tokens fall back to letter-by-letter spelling using single-letter lexicon entries.

Voices (5 EN accents β€” speaker IDs 0..4)

sid Voice ID Name Accent
0 default Sarah Female Β· neutral, default
1 en-us Alice Female Β· American
2 en-india Priya Female Β· Indian English
3 en-uk Charlotte Female Β· British
4 en-au Olivia Female Β· Australian

All speakers female today β€” accent diversity is the differentiator. To synth a specific accent, pass the corresponding sid to the model's input tensor.

Model input contract

Standard sherpa-onnx Melo VITS2 ONNX signature:

x          int64 (1, T)  β€” phoneme IDs (from lexicon lookup via tokens.txt)
x_lengths  int64 (1,)    β€” T
tones      int64 (1, T)  β€” tone IDs (mostly 7–10 for English), parallel to x
sid        int64 (1,)    β€” speaker ID (0..4)
noise_scale       float (1,) β€” 0.667 default
noise_scale_w     float (1,) β€” 0.8 default
length_scale      float (1,) β€” 1.0 / speed

Output: y float32 (1, 1, N) at 44 100 Hz mono.

Upstream: csukuangfj/sherpa-onnx-vits-melo-tts-en (sherpa-onnx's MeloTTS English export). Original model: myshell-ai/MeloTTS-English (PyTorch, MIT).

Why fp32 (not fp16)?

Same architecture, same weights, same file as mobile's Plus tier β€” except mobile ships fp16 for the ARM NEON SIMD speed win on-device. The browser story is different:

  • ORT-Web WebGPU's fp16 path depends on the optional shader-f16 extension, which a chunk of WebGPU adapters don't expose. On those, fp16 runs at fp32 speed anyway.
  • ORT-Web WASM has no native fp16 kernels β€” fp16 input gets up-cast at load time, gaining download size but losing nothing on inference speed.
  • Audio-quality A/B between fp16 and fp32 hasn't been run on a WebGPU listening setup yet. Vocoder-family models have documented fp16 sensitivity (subnormal weights can clamp on conversion β†’ audible artifacts on sibilants), and a per-platform listening test was deferred.

Net: fp32 is the safer browser choice. If a WebGPU + headphones A/B later validates fp16, the engine config flips with no other changes (the fp16 file already exists at AI Neural Models/android/neural-melo-en/model.fp16.onnx for upload when the time comes).


GPU tier β€” neural-82m β€” Kokoro 82 M (ONNX Β· WebGPU)

A single Kokoro model unlocks 11 distinct voices at once via 11 small speaker-embedding files. WebGPU-accelerated inference, ~10Γ— real-time on a modern GPU.

Model file

File Precision Size Status
neural-82m/onnx/model.onnx fp32 ~325 MB βœ… Production default β€” stable on every WebGPU runtime
neural-82m/onnx/model_fp16.onnx fp16 ~165 MB ⚠️ Reserved for future use β€” blocked today by upstream onnxruntime-web fp16 bugs (microsoft/onnxruntime#23403, #26732)

The fp16 file is staged here so once the upstream JS stack lands fp16+WebGPU fixes, ThreadCast can flip the default to fp16 with a single config change β€” halving the download and roughly doubling per-segment speed on capable GPUs.

Tokenizer + config

tokenizer.json, tokenizer_config.json, config.json β€” small files used by @huggingface/transformers (transformers.js) when loading the model.

Voices (neural-82m/voices/*.bin, ~520 KB each)

Voice ID Name Accent Gender
af_bella Bella American Female
af_sarah Sarah American Female
af_nova Nova American Female
af_sky Sky American Female
am_adam Adam American Male
am_michael Michael American Male
am_echo Echo American Male
bf_emma Emma British Female
bf_isabella Isabella British Female
bm_george George British Male
bm_daniel Daniel British Male

Voice IDs encode locale and gender: first letter = accent (a = American, b = British), second letter = gender (f = female, m = male).

Upstream: model from onnx-community/Kokoro-82M-v1.0-ONNX-timestamped; voice embeddings from onnx-community/Kokoro-82M-v1.0-ONNX.


How the extension uses these files

The ThreadCast extension fetches model files lazily, only when the user selects a Neural engine and presses Test/Play. Files are cached in the browser's Cache API and reused across sessions, so the user pays the download cost exactly once per profile.

Engine Files fetched on first use
System voices None β€” uses OS / browser TTS
Neural Β· CPU The selected voice's .onnx + .onnx.json (~63 MB total)
Neural Β· GPU Lite neural-melo-en/{model.onnx, lexicon.txt, tokens.txt} (~177 MB total β€” all 5 EN accents in one bundle)
Neural Β· GPU onnx/model.onnx + tokenizer (326 MB) + 11 voice .bin (5.7 MB)

The WASM runtimes (ONNX Runtime, Piper phonemizer) are bundled inside the extension package itself β€” not served from this repo β€” to comply with Manifest V3 CSP and avoid CDN dependencies.


License

Per-project licenses retained from upstream β€” see the parent README for the consolidated summary.