Chatterbox-TTS GGUF

GGUF conversions of Resemble AI's Chatterbox-TTS.

The Chatterbox stack is split into two halves so each can run inside the runtime that owns it:

LLM-part (T3) — text + conditioning to discrete speech token IDs, run by stock llama.cpp (llama arch).
Codec-part (S3G / S3T) — discrete speech tokens to/from 24 kHz PCM, run by codec.cpp.

Only payload speech token IDs (0..6560) cross the boundary between the two runtimes.

Files

LLM-part (T3)

t3-<quant>.gguf is a stock llama arch GGUF holding only T3's 30-layer Llama backbone (hidden 1024, MLP 4096, 16 heads, RoPE θ=10000) plus the speech-side embedding/head (speech_emb → token_embd, speech_head → output). The vocabulary contains 8194 entries (<|speech_0|> … <|speech_8193|>), every one marked as a CONTROL (special) token so llama.cpp's sampler / stop / grammar paths recognise each speech ID. BOS = 6561 (start_speech_token), EOS = 6562 (stop_speech_token); add_bos / add_eos default to false.

File	Size
`t3-f32.gguf`	1984 MB
`t3-f16.gguf`	992 MB
`t3-q8_0.gguf`	527 MB
`t3-q5_k_m.gguf`	352 MB
`t3-q4_k_m.gguf`	299 MB

t3-extras.gguf (chatterbox_t3 arch, ~24 MB) carries the T3 wrapper tensors that aren't part of the Llama backbone — text embedding/head, learned positional embeddings, and the conditioning encoder (perceiver / speaker / emotion fc). The host application reads these through the gguf API to build the prompt prefix:

t3.text_emb.weight, t3.text_head.weight
t3.text_pos_emb.weight, t3.speech_pos_emb.weight
t3.cond_enc.* (perceiver attention/ff, spkr_enc, emotion_adv_fc)

The same file also exposes T3 metadata (chatterbox_t3.start_speech_token, speech_cond_prompt_len, etc.) needed to drive the LLM.

Codec-part — S3G decoder

codec[-<quant>].gguf (chatterbox_s3g arch) — Chatterbox S3Gen flow-matching decoder + HiFi-GAN vocoder, with built-in conditioning (conds.pt) baked in.

File	Size
`codec-f32.gguf`	535 MB
`codec-f16.gguf`	268 MB
`codec-q8_0.gguf`	196 MB
`codec-q5_k_m.gguf`	167 MB
`codec-q4_k_m.gguf`	158 MB

Codec-part — S3T tokenizer

s3t.gguf (chatterbox_s3t arch, F16, 236 MB) — S3 audio tokenizer (24 kHz mel front-end, 25 Hz token rate, codebook 6561). Only needed for voice-cloning / reference-token paths.

Boundary contract

token rate: 25 Hz, n_q = 1
payload codebook size: 6561 (IDs 0..6560)
start_speech_token = 6561, stop_speech_token = 6562
speech vocabulary in T3: 8194 (speech_tokens_dict_size)
encode sample rate: 16 kHz, decode sample rate: 24 kHz

Notes

Loading the LLM-part: stock llama.cpp (≥ b8095) loads t3-*.gguf directly. Because the model is no_vocab, the host application is responsible for: tokenizing text via tokenizer.json, looking up t3.text_emb for the text-side prefix, building the conditioning prefix from t3.cond_enc.*, adding the matching learned positional embeddings, and feeding the assembled embedding sequence to llama_decode. After the prefix is consumed, generation proceeds autoregressively over speech tokens (which is what the speech-side embedding/head in t3-*.gguf are for).
Quantization uses llama-quantize for the T3 backbone and codec.cpp's converter (with a fix that recognises .weight names) for the codec-part.
Source weights: t3_cfg.safetensors and s3gen.safetensors from ResembleAI/chatterbox. The multilingual T3 variants (t3_23lang, t3_mtl23ls_v2/v3) are not converted here.

Downloads last month: 394

GGUF

Model size

0.1B params

Architecture

chatterbox_s3g

Hardware compatibility

4-bit

5-bit

8-bit

16-bit

32-bit

View +2 variants

Model tree for hans00/Chatterbox-TTS-GGUF

Base model

ResembleAI/chatterbox

Quantized

(18)

this model