Text-to-Speech
ONNX
KittenTTS
English
tts
kokoro
piper
melotts
vits
vits2
styletts2
sherpa-onnx
on-device
threadcast
Instructions to use Pixel-Labs/threadcast-neural-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- KittenTTS
How to use Pixel-Labs/threadcast-neural-models with KittenTTS:
from kittentts import KittenTTS m = KittenTTS("Pixel-Labs/threadcast-neural-models") audio = m.generate("This high quality TTS model works without a GPU") # Save the audio import soundfile as sf sf.write('output.wav', audio, 24000) - Notebooks
- Google Colab
- Kaggle
docs(extension): rename Plus β GPU Lite tier (matches extension UI; mobile still Plus)
e09ad90 verified | # ThreadCast β Chrome Extension Neural Models Mirror | |
| Hugging Face transformers.jsβformat mirror of the on-device neural TTS models used by the ThreadCast Chrome extension. The Android counterpart lives in two siblings: [`../android/`](../android/README.md) (local dev staging β sherpa-onnx upstream artifacts) and [`../mobile-android/`](../mobile-android/README.md) (production zips downloaded by the Android app at runtime). See the [parent README](../README.md) for repository-wide context, branding, and license summary. | |
| > If you're an extension user, you don't need anything here β the extension downloads what it needs automatically the first time you select a Neural engine. This page is for transparency, contributors, and forks. | |
| --- | |
| ## Layout | |
| ``` | |
| extension/ | |
| βββ neural-28m/ # Piper voices for the CPU (Lite) engine | |
| β βββ en/en_US/<voice>/medium/ | |
| β βββ en_US-<voice>-medium.onnx | |
| β βββ en_US-<voice>-medium.onnx.json | |
| βββ neural-melo-en/ # MeloTTS for the GPU Lite engine (mobile surfaces same engine as "Plus") | |
| β βββ model.onnx # fp32 β production default | |
| β βββ lexicon.txt # enriched CMUdict-style lexicon | |
| β βββ tokens.txt # phoneme β ID map | |
| β βββ LICENSE | |
| βββ neural-82m/ # Kokoro model + voices for the GPU (Studio) engine | |
| βββ onnx/ | |
| β βββ model.onnx # fp32 β production default | |
| β βββ model_fp16.onnx # fp16 β experimental, blocked by upstream bugs | |
| βββ tokenizer.json | |
| βββ tokenizer_config.json | |
| βββ config.json | |
| βββ voices/ # 11 speaker embeddings | |
| βββ af_bella.bin β¦ bm_daniel.bin | |
| ``` | |
| > **Naming note:** `neural-28m` / `neural-82m` encode the parameter count in their folder name (CPU and GPU tiers, respectively). `neural-melo-en` breaks that convention β MeloTTS at ~52 M params would naturally be `neural-52m`, but the folder + file naming aligns with the local staging tree at [`AI Neural Models/android/neural-melo-en/`](../android/) and the mobile production bundle `threadcast-melo-en-v2.zip`. Same engine, same file, two surfaces. Tier identifier in docs / engine tables remains `neural-52m`. | |
| --- | |
| ## Engine tiers at a glance | |
| | Tier | Subtree | Architecture | Params | Runtime | First-use download | Extension UI label | | |
| |---|---|---|---|---|---|---| | |
| | **Lite (CPU)** | `neural-28m/` | Piper VITS | ~28 M | WASM single-thread | ~63 MB per voice + ~10 MB shared espeak | Neural Β· CPU | | |
| | **GPU Lite** | `neural-melo-en/` | MeloTTS VITS2 + BERT prosody assist | ~52 M | WebGPU (WASM fallback) | ~177 MB single bundle (5 EN accents) | Neural Β· GPU Lite | | |
| | **Studio (GPU)** | `neural-82m/` | Kokoro StyleTTS2 | ~82 M | WebGPU | ~325 MB single bundle (11 voices) | Neural Β· GPU | | |
| GPU Lite sits between CPU and GPU on every axis β download size, VRAM, hardware floor, output quality. Designed for users whose hardware supports WebGPU but can't comfortably run the 82 M Studio model. **Same engine as the mobile app's "Local AI Plus" tier** β extension just surfaces it with a tier name that aligns with the existing CPU/GPU framing users already know. | |
| --- | |
| ## CPU tier β `neural-28m` β Piper (VITS Β· 28 M params Β· WASM) | |
| Five English voices, ~63 MB per voice. One voice loaded at a time. Single-thread WASM inference inside an MV3 offscreen document. Real-time on a modern laptop. | |
| | Voice ID | Speaker | Notes | | |
| | ------------------------- | ------------ | --------------------------- | | |
| | `en_US-amy-medium` | Amy | Female Β· warm narrator | | |
| | `en_US-lessac-medium` | Lessac | Female Β· neutral, news-anchor | | |
| | `en_US-ryan-medium` | Ryan | Male Β· clear, newsreader | | |
| | `en_US-hfc_female-medium` | HFC Female | Female Β· crisp, modern | | |
| | `en_US-hfc_male-medium` | HFC Male | Male Β· crisp, modern | | |
| Each voice ships as **two files** (`*.onnx` + `*.onnx.json`) under `neural-28m/en/en_US/<voice>/medium/`. | |
| **Upstream:** [`diffusionstudio/piper-voices`](https://huggingface.co/diffusionstudio/piper-voices) β curated subset mirrored here. | |
| --- | |
| ## GPU Lite tier β `neural-melo-en` β MeloTTS English (VITS2 + BERT Β· ~52 M Β· WebGPU) | |
| Single ~171 MB model serves **all 5 English accents** via speaker-ID lookup at synth time. BERT prosody assist is baked into the ONNX graph, so no separate BERT input or model. WebGPU-accelerated inference; on adapters without WebGPU support, ORT-Web falls back to single-thread WASM (slow but functional). MIT license. | |
| ### Files | |
| | File | Size | Purpose | | |
| |---|---|---| | |
| | `model.onnx` | ~171 MB | fp32 ONNX export β production default; same file the Android app ships via [`mobile-android/v1/threadcast-melo-en-v2.zip`](../mobile-android/v1/) | | |
| | `lexicon.txt` | ~6 MB | Enriched CMUdict-style lexicon (~250 k+ entries: base 129 k + CMUdict latest + g2p_en + Aquila-Resolve neural G2P + curated Reddit/tech/brand/modern-English terms + punctuation silence rules β including em-dash β short pause) | | |
| | `tokens.txt` | ~1 KB | Phoneme β integer-ID map (~219 entries, case-sensitive) | | |
| | `LICENSE` | small | MIT, retained from upstream | | |
| **No `espeak-ng-data/` here** β MeloTTS embeds phonemization end-to-end via the CMUdict lexicon. Out-of-vocabulary tokens fall back to letter-by-letter spelling using single-letter lexicon entries. | |
| ### Voices (5 EN accents β speaker IDs 0..4) | |
| | `sid` | Voice ID | Name | Accent | | |
| |---|---|---|---| | |
| | 0 | `default` | Sarah | Female Β· neutral, default | | |
| | 1 | `en-us` | Alice | Female Β· American | | |
| | 2 | `en-india` | Priya | Female Β· Indian English | | |
| | 3 | `en-uk` | Charlotte | Female Β· British | | |
| | 4 | `en-au` | Olivia | Female Β· Australian | | |
| All speakers female today β accent diversity is the differentiator. To synth a specific accent, pass the corresponding `sid` to the model's input tensor. | |
| ### Model input contract | |
| Standard sherpa-onnx Melo VITS2 ONNX signature: | |
| ``` | |
| x int64 (1, T) β phoneme IDs (from lexicon lookup via tokens.txt) | |
| x_lengths int64 (1,) β T | |
| tones int64 (1, T) β tone IDs (mostly 7β10 for English), parallel to x | |
| sid int64 (1,) β speaker ID (0..4) | |
| noise_scale float (1,) β 0.667 default | |
| noise_scale_w float (1,) β 0.8 default | |
| length_scale float (1,) β 1.0 / speed | |
| ``` | |
| Output: `y` float32 (1, 1, N) at **44 100 Hz** mono. | |
| **Upstream:** [`csukuangfj/sherpa-onnx-vits-melo-tts-en`](https://huggingface.co/csukuangfj/sherpa-onnx-vits-melo-tts-en) (sherpa-onnx's MeloTTS English export). Original model: [`myshell-ai/MeloTTS-English`](https://huggingface.co/myshell-ai/MeloTTS-English) (PyTorch, MIT). | |
| ### Why fp32 (not fp16)? | |
| Same architecture, same weights, same file as mobile's Plus tier β except mobile ships fp16 for the ARM NEON SIMD speed win on-device. The browser story is different: | |
| - **ORT-Web WebGPU's fp16 path** depends on the optional `shader-f16` extension, which a chunk of WebGPU adapters don't expose. On those, fp16 runs at fp32 speed anyway. | |
| - **ORT-Web WASM** has no native fp16 kernels β fp16 input gets up-cast at load time, gaining download size but losing nothing on inference speed. | |
| - **Audio-quality A/B** between fp16 and fp32 hasn't been run on a WebGPU listening setup yet. Vocoder-family models have documented fp16 sensitivity (subnormal weights can clamp on conversion β audible artifacts on sibilants), and a per-platform listening test was deferred. | |
| Net: fp32 is the safer browser choice. If a WebGPU + headphones A/B later validates fp16, the engine config flips with no other changes (the fp16 file already exists at [`AI Neural Models/android/neural-melo-en/model.fp16.onnx`](../android/) for upload when the time comes). | |
| --- | |
| ## GPU tier β `neural-82m` β Kokoro 82 M (ONNX Β· WebGPU) | |
| A single Kokoro model unlocks **11 distinct voices** at once via 11 small speaker-embedding files. WebGPU-accelerated inference, ~10Γ real-time on a modern GPU. | |
| ### Model file | |
| | File | Precision | Size | Status | | |
| | --------------------------------- | --------- | -------- | ------ | | |
| | `neural-82m/onnx/model.onnx` | fp32 | ~325 MB | β Production default β stable on every WebGPU runtime | | |
| | `neural-82m/onnx/model_fp16.onnx` | fp16 | ~165 MB | β οΈ Reserved for future use β blocked today by upstream `onnxruntime-web` fp16 bugs ([microsoft/onnxruntime#23403](https://github.com/microsoft/onnxruntime/issues/23403), [#26732](https://github.com/microsoft/onnxruntime/issues/26732)) | | |
| The fp16 file is staged here so once the upstream JS stack lands fp16+WebGPU fixes, ThreadCast can flip the default to fp16 with a single config change β halving the download and roughly doubling per-segment speed on capable GPUs. | |
| ### Tokenizer + config | |
| `tokenizer.json`, `tokenizer_config.json`, `config.json` β small files used by [`@huggingface/transformers`](https://www.npmjs.com/package/@huggingface/transformers) (transformers.js) when loading the model. | |
| ### Voices (`neural-82m/voices/*.bin`, ~520 KB each) | |
| | Voice ID | Name | Accent | Gender | | |
| | -------------- | --------- | --------- | ------ | | |
| | `af_bella` | Bella | American | Female | | |
| | `af_sarah` | Sarah | American | Female | | |
| | `af_nova` | Nova | American | Female | | |
| | `af_sky` | Sky | American | Female | | |
| | `am_adam` | Adam | American | Male | | |
| | `am_michael` | Michael | American | Male | | |
| | `am_echo` | Echo | American | Male | | |
| | `bf_emma` | Emma | British | Female | | |
| | `bf_isabella` | Isabella | British | Female | | |
| | `bm_george` | George | British | Male | | |
| | `bm_daniel` | Daniel | British | Male | | |
| Voice IDs encode locale and gender: first letter = accent (`a` = American, `b` = British), second letter = gender (`f` = female, `m` = male). | |
| **Upstream:** model from [`onnx-community/Kokoro-82M-v1.0-ONNX-timestamped`](https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX-timestamped); voice embeddings from [`onnx-community/Kokoro-82M-v1.0-ONNX`](https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX). | |
| --- | |
| ## How the extension uses these files | |
| The ThreadCast extension fetches model files **lazily**, only when the user selects a Neural engine and presses Test/Play. Files are cached in the browser's Cache API and reused across sessions, so the user pays the download cost exactly once per profile. | |
| | Engine | Files fetched on first use | | |
| | --------- | --------------------------------------------------------------- | | |
| | System voices | None β uses OS / browser TTS | | |
| | Neural Β· CPU | The selected voice's `.onnx` + `.onnx.json` (~63 MB total) | | |
| | Neural Β· GPU Lite | `neural-melo-en/{model.onnx, lexicon.txt, tokens.txt}` (~177 MB total β all 5 EN accents in one bundle) | | |
| | Neural Β· GPU | `onnx/model.onnx` + tokenizer (~326 MB) + 11 voice `.bin` (~5.7 MB) | | |
| The WASM runtimes (ONNX Runtime, Piper phonemizer) are **bundled inside the extension package** itself β not served from this repo β to comply with Manifest V3 CSP and avoid CDN dependencies. | |
| --- | |
| ## License | |
| Per-project licenses retained from upstream β see the [parent README](../README.md#license) for the consolidated summary. | |