docs(card): sync language list with upstream ResembleAI/chatterbox (23 langs)

525ae43 verified 6 days ago

4.42 kB

	---
	license: mit
	language:
	- ar
	- da
	- de
	- el
	- en
	- es
	- fi
	- fr
	- he
	- hi
	- it
	- ja
	- ko
	- ms
	- nl
	- no
	- pl
	- pt
	- ru
	- sv
	- sw
	- tr
	- zh
	base_model:
	- ResembleAI/chatterbox
	pipeline_tag: text-to-speech
	tags:
	- tts
	- text-to-speech
	- chatterbox
	- flow-matching
	- hifi-gan
	- gguf
	- crispasr
	library_name: ggml
	---

	# Chatterbox TTS — GGUF (ggml-quantised)

	GGUF / ggml conversion of [`ResembleAI/chatterbox`](https://huggingface.co/ResembleAI/chatterbox) for use with [CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR).

	Chatterbox is a full TTS pipeline: character tokenizer → T3 (30-layer Llama AR, 520M) → speech tokens → S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) → HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) → 24 kHz WAV. Distributed under MIT license.

	Two GGUF files are needed: the T3 model (text → speech tokens) and the S3Gen model (speech tokens → audio).

	## Files

	\| File \| Quant \| Size \| Notes \|
	\|---\|---\|---:\|---\|
	\| `chatterbox-t3-f16.gguf` \| F16 \| 1.1 GB \| T3 AR model — reference quality \|
	\| `chatterbox-t3-q8_0.gguf` \| Q8_0 \| 630 MB \| T3 AR model — recommended \|
	\| `chatterbox-t3-q4_k.gguf` \| Q4_K \| 374 MB \| T3 AR model — smallest \|
	\| `chatterbox-s3gen-f16.gguf` \| F16 \| 574 MB \| S3Gen + vocoder — reference quality \|
	\| `chatterbox-s3gen-q8_0.gguf` \| Q8_0 \| 358 MB \| S3Gen + vocoder — recommended \|
	\| `chatterbox-s3gen-q4_k.gguf` \| Q4_K \| 248 MB \| S3Gen + vocoder — smallest \|

	Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers.

	The T3 GGUF files include the BPE `tokenizer.ggml.tokens` + `tokenizer.ggml.merges` arrays. Earlier (pre-2026-05-08) uploads were missing the `merges` key, causing the CrispASR loader to fall back to a char-level tokenizer that dropped uppercase letters and spaces; if you see ASR roundtrip degradation against these files in a downstream check, re-pull them.

	## Quick start

	```bash
	# 1. Build CrispASR
	git clone https://github.com/CrispStrobe/CrispASR
	cd CrispASR
	cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
	cmake --build build -j --target chatterbox

	# 2. Pull both model files
	huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir .
	huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir .

	# 3. Synthesise (C API / test binary — CLI adapter in progress)
	# See tests/test_voc_wav.cpp for vocoder-only usage
	```

	## Architecture

	```
	Text → Character tokenizer (704 tokens)
	→ T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG)
	→ 25 Hz speech tokens (6561 codebook)
	→ Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads)
	→ 80-channel mel spectrogram
	→ UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps)
	→ HiFTGenerator vocoder (3× ConvTranspose1d + 9 ResBlocks + Snake + iSTFT)
	→ 24 kHz mono WAV
	```

	## Quality verification

	ASR roundtrip on Python reference mel (no source fusion, deterministic):

	\| Metric \| Value \|
	\|---\|---\|
	\| ASR output (moonshine-base) \| "Hello world" (correct) \|
	\| Per-stage cosine vs Python ref \| 1.000 (conv_pre through rb_2) \|
	\| Waveform cosine vs torch.istft \| 0.93 \|
	\| STFT range \| [-0.82, 2.0] (ref [-1.1, 1.7]) \|

	All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel.

	The `crispasr-diff chatterbox …` harness reports `[PASS] t3_cond_emb cos≈1.000`, `[PASS] t3_prefill_emb[0] cos≈1.000` against the F16 reference for all three quant levels.

	## Conversion

	```bash
	python models/convert-chatterbox-to-gguf.py \
	--input ResembleAI/chatterbox \
	--output-dir .
	```

	Requires `pip install gguf safetensors torch huggingface_hub`.

	## Related models

	- [`cstr/lahgtna-chatterbox-v1-GGUF`](https://huggingface.co/cstr/lahgtna-chatterbox-v1-GGUF) — Arabic T3 variant (MIT, shares S3Gen)
	- [`cstr/orpheus-3b-base-GGUF`](https://huggingface.co/cstr/orpheus-3b-base-GGUF) — Llama-3.2 + SNAC TTS
	- [`cstr/qwen3-tts-0.6b-customvoice-GGUF`](https://huggingface.co/cstr/qwen3-tts-0.6b-customvoice-GGUF) — Qwen3-TTS with fixed speakers

	## License

	MIT — same as the upstream [ResembleAI/chatterbox](https://huggingface.co/ResembleAI/chatterbox).