Instructions to use cstr/chatterbox-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use cstr/chatterbox-GGUF with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - ar | |
| - da | |
| - de | |
| - el | |
| - en | |
| - es | |
| - fi | |
| - fr | |
| - he | |
| - hi | |
| - it | |
| - ja | |
| - ko | |
| - ms | |
| - nl | |
| - no | |
| - pl | |
| - pt | |
| - ru | |
| - sv | |
| - sw | |
| - tr | |
| - zh | |
| base_model: | |
| - ResembleAI/chatterbox | |
| pipeline_tag: text-to-speech | |
| tags: | |
| - tts | |
| - text-to-speech | |
| - chatterbox | |
| - flow-matching | |
| - hifi-gan | |
| - gguf | |
| - crispasr | |
| library_name: ggml | |
| # Chatterbox TTS β GGUF (ggml-quantised) | |
| GGUF / ggml conversion of [`ResembleAI/chatterbox`](https://huggingface.co/ResembleAI/chatterbox) for use with **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**. | |
| Chatterbox is a full TTS pipeline: character tokenizer β T3 (30-layer Llama AR, 520M) β speech tokens β S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) β HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) β 24 kHz WAV. Distributed under **MIT license**. | |
| Two GGUF files are needed: the **T3 model** (text β speech tokens) and the **S3Gen model** (speech tokens β audio). | |
| ## Files | |
| | File | Quant | Size | Notes | | |
| |---|---|---:|---| | |
| | `chatterbox-t3-f16.gguf` | F16 | 1.1 GB | T3 AR model β reference quality | | |
| | `chatterbox-t3-q8_0.gguf` | Q8_0 | 630 MB | T3 AR model β recommended | | |
| | `chatterbox-t3-q4_k.gguf` | Q4_K | 374 MB | T3 AR model β smallest | | |
| | `chatterbox-s3gen-f16.gguf` | F16 | 574 MB | S3Gen + vocoder β reference quality | | |
| | `chatterbox-s3gen-q8_0.gguf` | Q8_0 | 358 MB | S3Gen + vocoder β recommended | | |
| | `chatterbox-s3gen-q4_k.gguf` | Q4_K | 248 MB | S3Gen + vocoder β smallest | | |
| Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers. | |
| The T3 GGUF files include the BPE `tokenizer.ggml.tokens` + `tokenizer.ggml.merges` arrays. Earlier (pre-2026-05-08) uploads were missing the `merges` key, causing the CrispASR loader to fall back to a char-level tokenizer that dropped uppercase letters and spaces; if you see ASR roundtrip degradation against these files in a downstream check, re-pull them. | |
| ## Quick start | |
| ```bash | |
| # 1. Build CrispASR | |
| git clone https://github.com/CrispStrobe/CrispASR | |
| cd CrispASR | |
| cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF | |
| cmake --build build -j --target chatterbox | |
| # 2. Pull both model files | |
| huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir . | |
| huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir . | |
| # 3. Synthesise (C API / test binary β CLI adapter in progress) | |
| # See tests/test_voc_wav.cpp for vocoder-only usage | |
| ``` | |
| ## Architecture | |
| ``` | |
| Text β Character tokenizer (704 tokens) | |
| β T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG) | |
| β 25 Hz speech tokens (6561 codebook) | |
| β Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads) | |
| β 80-channel mel spectrogram | |
| β UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps) | |
| β HiFTGenerator vocoder (3Γ ConvTranspose1d + 9 ResBlocks + Snake + iSTFT) | |
| β 24 kHz mono WAV | |
| ``` | |
| ## Quality verification | |
| ASR roundtrip on Python reference mel (no source fusion, deterministic): | |
| | Metric | Value | | |
| |---|---| | |
| | ASR output (moonshine-base) | **"Hello world"** (correct) | | |
| | Per-stage cosine vs Python ref | **1.000** (conv_pre through rb_2) | | |
| | Waveform cosine vs torch.istft | **0.93** | | |
| | STFT range | [-0.82, 2.0] (ref [-1.1, 1.7]) | | |
| All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel. | |
| The `crispasr-diff chatterbox β¦` harness reports `[PASS] t3_cond_emb cosβ1.000`, `[PASS] t3_prefill_emb[0] cosβ1.000` against the F16 reference for all three quant levels. | |
| ## Conversion | |
| ```bash | |
| python models/convert-chatterbox-to-gguf.py \ | |
| --input ResembleAI/chatterbox \ | |
| --output-dir . | |
| ``` | |
| Requires `pip install gguf safetensors torch huggingface_hub`. | |
| ## Related models | |
| - [`cstr/lahgtna-chatterbox-v1-GGUF`](https://huggingface.co/cstr/lahgtna-chatterbox-v1-GGUF) β Arabic T3 variant (MIT, shares S3Gen) | |
| - [`cstr/orpheus-3b-base-GGUF`](https://huggingface.co/cstr/orpheus-3b-base-GGUF) β Llama-3.2 + SNAC TTS | |
| - [`cstr/qwen3-tts-0.6b-customvoice-GGUF`](https://huggingface.co/cstr/qwen3-tts-0.6b-customvoice-GGUF) β Qwen3-TTS with fixed speakers | |
| ## License | |
| MIT β same as the upstream [ResembleAI/chatterbox](https://huggingface.co/ResembleAI/chatterbox). | |