Chatterbox-TTS GGUF

GGUF conversions of Resemble AI's Chatterbox-TTS.

The Chatterbox stack is split into two halves so each can run inside the runtime that owns it:

  • LLM-part (T3) β€” text + conditioning to discrete speech token IDs, run by stock llama.cpp (llama arch).
  • Codec-part (S3G / S3T) β€” discrete speech tokens to/from 24 kHz PCM, run by codec.cpp.

Only payload speech token IDs (0..6560) cross the boundary between the two runtimes.

Files

LLM-part (T3)

t3-<quant>.gguf is a stock llama arch GGUF holding only T3's 30-layer Llama backbone (hidden 1024, MLP 4096, 16 heads, RoPE ΞΈ=10000) plus the speech-side embedding/head (speech_emb β†’ token_embd, speech_head β†’ output). The vocabulary contains 8194 entries (<|speech_0|> … <|speech_8193|>), every one marked as a CONTROL (special) token so llama.cpp's sampler / stop / grammar paths recognise each speech ID. BOS = 6561 (start_speech_token), EOS = 6562 (stop_speech_token); add_bos / add_eos default to false.

File Size
t3-f32.gguf 1984 MB
t3-f16.gguf 992 MB
t3-q8_0.gguf 527 MB
t3-q5_k_m.gguf 352 MB
t3-q4_k_m.gguf 299 MB

t3-extras.gguf (chatterbox_t3 arch, ~24 MB) carries the T3 wrapper tensors that aren't part of the Llama backbone β€” text embedding/head, learned positional embeddings, and the conditioning encoder (perceiver / speaker / emotion fc). The host application reads these through the gguf API to build the prompt prefix:

  • t3.text_emb.weight, t3.text_head.weight
  • t3.text_pos_emb.weight, t3.speech_pos_emb.weight
  • t3.cond_enc.* (perceiver attention/ff, spkr_enc, emotion_adv_fc)

The same file also exposes T3 metadata (chatterbox_t3.start_speech_token, speech_cond_prompt_len, etc.) needed to drive the LLM.

Codec-part β€” S3G decoder

codec[-<quant>].gguf (chatterbox_s3g arch) β€” Chatterbox S3Gen flow-matching decoder + HiFi-GAN vocoder, with built-in conditioning (conds.pt) baked in.

File Size
codec-f32.gguf 535 MB
codec-f16.gguf 268 MB
codec-q8_0.gguf 196 MB
codec-q5_k_m.gguf 167 MB
codec-q4_k_m.gguf 158 MB

Codec-part β€” S3T tokenizer

s3t.gguf (chatterbox_s3t arch, F16, 236 MB) β€” S3 audio tokenizer (24 kHz mel front-end, 25 Hz token rate, codebook 6561). Only needed for voice-cloning / reference-token paths.

Boundary contract

  • token rate: 25 Hz, n_q = 1
  • payload codebook size: 6561 (IDs 0..6560)
  • start_speech_token = 6561, stop_speech_token = 6562
  • speech vocabulary in T3: 8194 (speech_tokens_dict_size)
  • encode sample rate: 16 kHz, decode sample rate: 24 kHz

Notes

  • Loading the LLM-part: stock llama.cpp (β‰₯ b8095) loads t3-*.gguf directly. Because the model is no_vocab, the host application is responsible for: tokenizing text via tokenizer.json, looking up t3.text_emb for the text-side prefix, building the conditioning prefix from t3.cond_enc.*, adding the matching learned positional embeddings, and feeding the assembled embedding sequence to llama_decode. After the prefix is consumed, generation proceeds autoregressively over speech tokens (which is what the speech-side embedding/head in t3-*.gguf are for).
  • Quantization uses llama-quantize for the T3 backbone and codec.cpp's converter (with a fix that recognises .weight names) for the codec-part.
  • Source weights: t3_cfg.safetensors and s3gen.safetensors from ResembleAI/chatterbox. The multilingual T3 variants (t3_23lang, t3_mtl23ls_v2/v3) are not converted here.
Downloads last month
394
GGUF
Model size
0.1B params
Architecture
chatterbox_s3g
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hans00/Chatterbox-TTS-GGUF

Quantized
(18)
this model