chatterbox-GGUF / README.md
cstr's picture
docs(card): sync language list with upstream ResembleAI/chatterbox (23 langs)
525ae43 verified
---
license: mit
language:
- ar
- da
- de
- el
- en
- es
- fi
- fr
- he
- hi
- it
- ja
- ko
- ms
- nl
- no
- pl
- pt
- ru
- sv
- sw
- tr
- zh
base_model:
- ResembleAI/chatterbox
pipeline_tag: text-to-speech
tags:
- tts
- text-to-speech
- chatterbox
- flow-matching
- hifi-gan
- gguf
- crispasr
library_name: ggml
---
# Chatterbox TTS β€” GGUF (ggml-quantised)
GGUF / ggml conversion of [`ResembleAI/chatterbox`](https://huggingface.co/ResembleAI/chatterbox) for use with **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**.
Chatterbox is a full TTS pipeline: character tokenizer β†’ T3 (30-layer Llama AR, 520M) β†’ speech tokens β†’ S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) β†’ HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) β†’ 24 kHz WAV. Distributed under **MIT license**.
Two GGUF files are needed: the **T3 model** (text β†’ speech tokens) and the **S3Gen model** (speech tokens β†’ audio).
## Files
| File | Quant | Size | Notes |
|---|---|---:|---|
| `chatterbox-t3-f16.gguf` | F16 | 1.1 GB | T3 AR model β€” reference quality |
| `chatterbox-t3-q8_0.gguf` | Q8_0 | 630 MB | T3 AR model β€” recommended |
| `chatterbox-t3-q4_k.gguf` | Q4_K | 374 MB | T3 AR model β€” smallest |
| `chatterbox-s3gen-f16.gguf` | F16 | 574 MB | S3Gen + vocoder β€” reference quality |
| `chatterbox-s3gen-q8_0.gguf` | Q8_0 | 358 MB | S3Gen + vocoder β€” recommended |
| `chatterbox-s3gen-q4_k.gguf` | Q4_K | 248 MB | S3Gen + vocoder β€” smallest |
Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers.
The T3 GGUF files include the BPE `tokenizer.ggml.tokens` + `tokenizer.ggml.merges` arrays. Earlier (pre-2026-05-08) uploads were missing the `merges` key, causing the CrispASR loader to fall back to a char-level tokenizer that dropped uppercase letters and spaces; if you see ASR roundtrip degradation against these files in a downstream check, re-pull them.
## Quick start
```bash
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build -j --target chatterbox
# 2. Pull both model files
huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir .
huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir .
# 3. Synthesise (C API / test binary β€” CLI adapter in progress)
# See tests/test_voc_wav.cpp for vocoder-only usage
```
## Architecture
```
Text β†’ Character tokenizer (704 tokens)
β†’ T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG)
β†’ 25 Hz speech tokens (6561 codebook)
β†’ Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads)
β†’ 80-channel mel spectrogram
β†’ UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps)
β†’ HiFTGenerator vocoder (3Γ— ConvTranspose1d + 9 ResBlocks + Snake + iSTFT)
β†’ 24 kHz mono WAV
```
## Quality verification
ASR roundtrip on Python reference mel (no source fusion, deterministic):
| Metric | Value |
|---|---|
| ASR output (moonshine-base) | **"Hello world"** (correct) |
| Per-stage cosine vs Python ref | **1.000** (conv_pre through rb_2) |
| Waveform cosine vs torch.istft | **0.93** |
| STFT range | [-0.82, 2.0] (ref [-1.1, 1.7]) |
All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel.
The `crispasr-diff chatterbox …` harness reports `[PASS] t3_cond_emb cosβ‰ˆ1.000`, `[PASS] t3_prefill_emb[0] cosβ‰ˆ1.000` against the F16 reference for all three quant levels.
## Conversion
```bash
python models/convert-chatterbox-to-gguf.py \
--input ResembleAI/chatterbox \
--output-dir .
```
Requires `pip install gguf safetensors torch huggingface_hub`.
## Related models
- [`cstr/lahgtna-chatterbox-v1-GGUF`](https://huggingface.co/cstr/lahgtna-chatterbox-v1-GGUF) β€” Arabic T3 variant (MIT, shares S3Gen)
- [`cstr/orpheus-3b-base-GGUF`](https://huggingface.co/cstr/orpheus-3b-base-GGUF) β€” Llama-3.2 + SNAC TTS
- [`cstr/qwen3-tts-0.6b-customvoice-GGUF`](https://huggingface.co/cstr/qwen3-tts-0.6b-customvoice-GGUF) β€” Qwen3-TTS with fixed speakers
## License
MIT β€” same as the upstream [ResembleAI/chatterbox](https://huggingface.co/ResembleAI/chatterbox).