CSM-1B GGUF

GGUF conversion of Sesame's CSM-1B โ€” a zero-shot conversational TTS model: Llama-3.2-1B backbone + Mimi 32-codebook audio codec + a 4-layer residual-depth-AR decoder that emits 32 RVQ codebook codes per AR step.

The stack splits across two runtimes, each owning the part it's good at:

  • Backbone (LLM-part) โ€” Llama-3.2-1B with CSM's embed_text_tokens mapped onto the standard model.embed_tokens.weight slot, so plain llama.cpp tokenizes + embeds text natively. Runs in llama.cpp (embeddings=true mode โ€” hidden state is read, not logits).
  • Codec + codec_lm (Audio-part) โ€” Mimi codec (24 kHz mono, 12.5 Hz frame rate) bundled with the residual_depth_ar codec_lm adaptor (32 audio embed tables + c0 head + 4-layer depth decoder + 31 codebooks_head slices). Runs in codec.cpp.

Inference shape: backbone hidden state โ†’ codec_lm_step_* state machine (1 c0 head + 31 depth-AR steps per frame โ†’ 32 codes) โ†’ codec_lm_compose_audio_embd โ†’ fed back into backbone as the next position's input embedding. Stop on codes[0] == 0 at step > 0 (training-time audio-EOS marker).

Files

Backbone (llama arch, vocab 128256, hidden 2048, 28 layers)

csm-1b-<quant>.gguf

File Size
csm-1b-f32.gguf 4.61 GB
csm-1b-f16.gguf 2.31 GB
csm-1b-bf16.gguf 2.31 GB
csm-1b-q8_0.gguf 1.23 GB
csm-1b-q6_k.gguf 974 MB
csm-1b-q5_1.gguf 909 MB
csm-1b-q5_k_m.gguf 869 MB
csm-1b-q5_k_s.gguf 851 MB
csm-1b-q5_0.gguf 851 MB
csm-1b-q4_1.gguf 793 MB
csm-1b-q4_k_m.gguf 770 MB
csm-1b-q4_k_s.gguf 739 MB
csm-1b-q4_0.gguf 735 MB
csm-1b-q3_k_l.gguf 698 MB
csm-1b-q3_k_m.gguf 659 MB
csm-1b-q3_k_s.gguf 612 MB
csm-1b-q2_k.gguf 554 MB

Codec + codec_lm (Mimi, 24 kHz mono, 32 RVQ codebooks ร— 2051; residual_depth_ar codec_lm)

codec[-<quant>].gguf

File Size
codec-f32.gguf 1.11 GB
codec-f16.gguf 871 MB
codec-q8_0.gguf 803 MB
codec-q5_k_m.gguf 776 MB
codec-q4_k_m.gguf 767 MB

Mimi is mostly small conv kernels whose row sizes don't meet the K-quant block-size requirements, so Q4_K_M / Q5_K_M save little over Q8_0. For minimum disk + RAM, pair the backbone quants with codec-q8_0.gguf.

Usage with llama.rn

llama.rn's TTS layer auto-detects this model via the codec.gguf's codec.lm.* metadata and routes through the codec_lm AR path:

import { initLlama, loadLlamaModelInfo } from 'llama.rn'

const ctx = await initLlama({
  model: 'csm-1b-q4_k_m.gguf',
  vocoder: { path: 'codec-q4_k_m.gguf' },
  n_ctx: 4096,
})

const fmt = await ctx.getFormattedAudioCompletion({
  prompt: 'Hello, world!',
  // CSM is zero-shot โ€” `speaker: { id: 0 }` or `{ id: 1 }` picks one of the
  // two trained speakers.  Omit to default to speaker 0.
})

// fmt.flow === 'codec_lm_ar' for CSM.
const { codes } = await ctx.generateAudioCodes({
  prompt: fmt.prompt,
  maxFrames: 500,
  temperature: 0.9,
  topP: 0.95,
  topK: 50,
})

const pcm = await ctx.decodeAudioTokens(codes)
// pcm is Float32-PCM at 24 kHz; feed it into your audio player of choice.

For users running CLI / parity tests against the HF reference, see examples/tts.py --model csm in codec.cpp.

Notes

  • Zero-shot: no speaker config / reference audio is needed. The model was trained on two speakers (IDs 0 and 1); the prompt format is <|begin_of_text|>[<speaker>]<text><|end_of_text|>.
  • Voice control: speaker timbre comes from the speaker tag ([0] vs [1]); finer control isn't exposed by CSM.
  • License: CSM is released under Apache-2.0 by Sesame AI. See the upstream model card for full terms.
  • Tokenizer hash patch: CSM's bundled Llama-3 tokenizer hits an unrecognised BPE pre-tokenizer hash in older convert_hf_to_gguf.py versions; codec.cpp's convert-backbone-to-gguf.py prep_csm injects a runtime patch mapping unknown hashes to llama-bpe (the regex family is identical, the tokenizer isn't used at codec_lm-driven inference anyway).

Sources

  • Original model: sesame/csm-1b
  • Conversion tooling: mybigday/codec.cpp (scripts: convert-backbone-to-gguf.py prep_csm + convert-to-gguf.py with the auto-dispatched CsmConverter)
  • Inference runtime: mybigday/llama.rn (codec_lm AR path lands in cpp/rn-tts.cpp::generateAudioCodes)
Downloads last month
272
GGUF
Model size
0.5B params
Architecture
mimi
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hans00/CSM-1B-GGUF

Base model

sesame/csm-1b
Quantized
(3)
this model