idle-intelligence
/

pocket-tts-gguf

@@ -28,31 +28,35 @@ Q8_0 quantized version of [kyutai/pocket-tts-without-voice-cloning](https://hugg
 | | Original | GGUF Q8_0 |
 |---|---|---|
 | **File** | `tts_b6369a24.safetensors` | `pocket-tts-q8_0.gguf` |
-| **Size** | 236 MB (BF16) | 178 MB |
 | **Format** | safetensors | GGUF |
-| **Reduction** | — | 24% |
 ## Quantization
 Per-block Q8_0 quantization (block size 32): 2-byte f16 scale + 32 int8 values per block.
-**64 tensors quantized** (82% of params) — all linear/projection weights in the transformer backbone, flow matching network, and mimi codec transformers.
-**149 tensors kept as F32** — norms, biases, embeddings, SEANet convolutions, quantizer, and resampling convolutions.
-Validation SQNR: >43 dB on all non-zero tensors.
-## Runtime note
-Weights are **dequantized to F32 at load time** and matmuls run through candle's optimized `gemm` kernels. This is because candle's `QMatMul` for quantized tensors currently uses a naive triple loop that is ~1.7x slower than `gemm`'s SIMD-tiled F32 matmul on WASM. The GGUF Q8_0 format still saves 25% on download size vs F32 safetensors.
-Once candle ships an optimized quantized matmul kernel (tiled, cache-blocked), we can keep weights quantized at runtime for additional memory bandwidth savings on mobile.
 ## Files
 | File | Size | Description |
 |------|------|-------------|
-| `pocket-tts-q8_0.gguf` | 178 MB | Model weights (Q8_0 + F32) |
 | `tokenizer.model` | 58 KB | SentencePiece unigram tokenizer |
 Voice embeddings are unchanged — use them from the [original repo](https://huggingface.co/kyutai/pocket-tts-without-voice-cloning/tree/main/embeddings_v2).

 | | Original | GGUF Q8_0 |
 |---|---|---|
 | **File** | `tts_b6369a24.safetensors` | `pocket-tts-q8_0.gguf` |
+| **Size** | 236 MB (BF16) | 128 MB |
 | **Format** | safetensors | GGUF |
+| **Reduction** | — | 46% |
+## What's included
+This GGUF contains the **TTS decoder pipeline only**: the transformer backbone, flow matching network, mimi decoder + decoder transformer, and the DummyQuantizer output projection.
+The mimi **encoder** (SEANet encoder, encoder transformer, downsample conv) is **excluded** — TTS only needs the decoder path. This saves ~52 MB (28%) compared to a full-model GGUF.
 ## Quantization
 Per-block Q8_0 quantization (block size 32): 2-byte f16 scale + 32 int8 values per block.
+**56 tensors quantized** — all linear/projection weights in the transformer backbone, flow matching network, and mimi decoder transformer.
+**114 tensors kept as F32** — norms, biases, embeddings, SEANet decoder convolutions, quantizer, and resampling convolutions.
+Validation SQNR: >40 dB on all tensors.
+## Runtime
+Weights stay quantized as Q8_0 at runtime. Matmuls use a [tiled WASM SIMD128 quantized matmul kernel](https://github.com/ilnmtlbnm/candle/tree/quantized-matmul-wasm-simd-opt) (fork of candle) — achieving ~2x realtime on desktop (M-series Mac, Chrome).
 ## Files
 | File | Size | Description |
 |------|------|-------------|
+| `pocket-tts-q8_0.gguf` | 128 MB | Model weights (Q8_0 + F32, decoder only) |
 | `tokenizer.model` | 58 KB | SentencePiece unigram tokenizer |
 Voice embeddings are unchanged — use them from the [original repo](https://huggingface.co/kyutai/pocket-tts-without-voice-cloning/tree/main/embeddings_v2).

pocket-tts-q8_0.gguf CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:df46502b4a62abd4836ae10ca26ca9ef5de7215790d02b32f1e91555e6a5fe75
-size 186331968

 version https://git-lfs.github.com/spec/v1
+oid sha256:6861029e5a99fd082ce95854721b1f4a5097189a625a5fafa133c84c399ba304
+size 134356064