Instructions to use hans00/Chatterbox-TTS-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use hans00/Chatterbox-TTS-GGUF with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
Chatterbox-TTS GGUF
GGUF conversions of Resemble AI's Chatterbox-TTS.
The Chatterbox stack is split into two halves so each can run inside the runtime that owns it:
- LLM-part (T3) β text + conditioning to discrete speech token IDs,
run by stock llama.cpp (
llamaarch). - Codec-part (S3G / S3T) β discrete speech tokens to/from 24 kHz PCM, run by codec.cpp.
Only payload speech token IDs (0..6560) cross the boundary between the
two runtimes.
Files
LLM-part (T3)
t3-<quant>.gguf is a stock llama arch GGUF holding only T3's
30-layer Llama backbone (hidden 1024, MLP 4096, 16 heads, RoPE ΞΈ=10000)
plus the speech-side embedding/head (speech_emb β token_embd,
speech_head β output). The vocabulary contains 8194 entries
(<|speech_0|> β¦ <|speech_8193|>), every one marked as a CONTROL
(special) token so llama.cpp's sampler / stop / grammar paths recognise
each speech ID. BOS = 6561 (start_speech_token), EOS = 6562
(stop_speech_token); add_bos / add_eos default to false.
| File | Size |
|---|---|
t3-f32.gguf |
1984 MB |
t3-f16.gguf |
992 MB |
t3-q8_0.gguf |
527 MB |
t3-q5_k_m.gguf |
352 MB |
t3-q4_k_m.gguf |
299 MB |
t3-extras.gguf (chatterbox_t3 arch, ~24 MB) carries the T3 wrapper
tensors that aren't part of the Llama backbone β text embedding/head,
learned positional embeddings, and the conditioning encoder
(perceiver / speaker / emotion fc). The host application reads these
through the gguf API to build the prompt prefix:
t3.text_emb.weight,t3.text_head.weightt3.text_pos_emb.weight,t3.speech_pos_emb.weightt3.cond_enc.*(perceiver attention/ff,spkr_enc,emotion_adv_fc)
The same file also exposes T3 metadata (chatterbox_t3.start_speech_token,
speech_cond_prompt_len, etc.) needed to drive the LLM.
Codec-part β S3G decoder
codec[-<quant>].gguf (chatterbox_s3g arch) β Chatterbox S3Gen
flow-matching decoder + HiFi-GAN vocoder, with built-in conditioning
(conds.pt) baked in.
| File | Size |
|---|---|
codec-f32.gguf |
535 MB |
codec-f16.gguf |
268 MB |
codec-q8_0.gguf |
196 MB |
codec-q5_k_m.gguf |
167 MB |
codec-q4_k_m.gguf |
158 MB |
Codec-part β S3T tokenizer
s3t.gguf (chatterbox_s3t arch, F16, 236 MB) β S3 audio tokenizer
(24 kHz mel front-end, 25 Hz token rate, codebook 6561). Only needed
for voice-cloning / reference-token paths.
Boundary contract
- token rate: 25 Hz,
n_q = 1 - payload codebook size: 6561 (IDs
0..6560) start_speech_token = 6561,stop_speech_token = 6562- speech vocabulary in T3: 8194 (
speech_tokens_dict_size) - encode sample rate: 16 kHz, decode sample rate: 24 kHz
Notes
- Loading the LLM-part: stock
llama.cpp(β₯ b8095) loadst3-*.ggufdirectly. Because the model isno_vocab, the host application is responsible for: tokenizing text viatokenizer.json, looking upt3.text_embfor the text-side prefix, building the conditioning prefix fromt3.cond_enc.*, adding the matching learned positional embeddings, and feeding the assembled embedding sequence tollama_decode. After the prefix is consumed, generation proceeds autoregressively over speech tokens (which is what the speech-side embedding/head int3-*.ggufare for). - Quantization uses
llama-quantizefor the T3 backbone and codec.cpp's converter (with a fix that recognises.weightnames) for the codec-part. - Source weights:
t3_cfg.safetensorsands3gen.safetensorsfromResembleAI/chatterbox. The multilingual T3 variants (t3_23lang,t3_mtl23ls_v2/v3) are not converted here.
- Downloads last month
- 394
Model tree for hans00/Chatterbox-TTS-GGUF
Base model
ResembleAI/chatterbox