MOSS-TTS-Realtime GGUF

End-to-end GGUF conversion of OpenMOSS-Team/MOSS-TTS-Realtime, runnable as a backbone + codec_lm + codec stack via stock llama.cpp and codec.cpp.

The model is a Qwen3-2B language backbone + 4-layer MossTTSRealtimeLocalTransformer depth decoder + 17-channel emission (cb-0 = a text token sampled from the Qwen3 backbone's lm_head; cb-1..16 = 16 RVQ audio codebooks of 1027 entries each). codec.cpp's residual_depth_ar codec_lm runtime handles the depth decoder + audio embed tables + per-AR-step state machine; the backbone runs in llama.cpp as a stock qwen3 arch.

This replaces the earlier codec-only release: the LLM-part is now wired through codec.cpp's codec_lm infrastructure (see tts_remaining_plan.md in codec.cpp).

Files

Backbone (Qwen3 language_model, stock qwen3 arch β€” 28 layers, hidden 2048, vocab 151936)

moss-tts-realtime-<quant>.gguf β€” produced by codec.cpp's convert-backbone-to-gguf.py prep_moss_tts_realtime, which unwraps language_model.* into the standard Qwen3 layout. tie_word_embeddings=true is inferred from the absence of a standalone lm_head tensor.

File Size
moss-tts-realtime-f32.gguf 4.5 GB
moss-tts-realtime-f16.gguf 3.3 GB
moss-tts-realtime-bf16.gguf 3.3 GB
moss-tts-realtime-q8_0.gguf 1.8 GB
moss-tts-realtime-q6_k.gguf 1.4 GB
moss-tts-realtime-q5_1.gguf 1.3 GB
moss-tts-realtime-q5_k_m.gguf 1.2 GB
moss-tts-realtime-q5_k_s.gguf 1.2 GB
moss-tts-realtime-q5_0.gguf 1.2 GB
moss-tts-realtime-q4_1.gguf 1.1 GB
moss-tts-realtime-q4_k_m.gguf 1.1 GB
moss-tts-realtime-q4_k_s.gguf 1009 MB
moss-tts-realtime-q4_0.gguf 1004 MB
moss-tts-realtime-q3_k_l.gguf 955 MB
moss-tts-realtime-q3_k_m.gguf 894 MB
moss-tts-realtime-q3_k_s.gguf 825 MB
moss-tts-realtime-q2_k.gguf 740 MB

Codec + codec_lm (MOSS-Audio-Tokenizer 16 RVQ Γ— 1027 codebooks + residual_depth_ar adaptor)

codec[-<quant>].gguf β€” full MOSS-Audio-Tokenizer (1.6B, 24 kHz mono) bundled with the codec_lm adaptor: audio embed tables for the 16 RVQ heads, the 17th text_embd (~308 MB at F16), 4-layer depth decoder, and 16 codebooks_head slices. Produced by codec.cpp's convert-to-gguf.py --model-type moss_audio --lm-source OpenMOSS-Team/MOSS-TTS-Realtime.

File Size
codec-f32.gguf 6.1 GB
codec-f16.gguf 3.9 GB
codec-q8_0.gguf 2.4 GB
codec-q5_k_m.gguf 1.8 GB
codec-q4_k_m.gguf 1.6 GB

Inference shape (text-modality codec_lm AR)

backbone (Qwen3, embeddings=true) hidden state h
    β†’ caller samples text token via llama_get_logits_ith β†’ text_tok
    β†’ codec_lm_state_set_text_context(state, text_tok)
    β†’ codec_lm_step_begin(state, h)
    β†’ for cb in 0..16: codec_lm_step_logits β†’ sample β†’ codec_lm_step_push_code
    β†’ codec_lm_step_finish β†’ codes[17]
    β†’ codec_lm_compose_audio_embd(codes) β†’ next-step embedding
    β†’ feed to backbone via b.embd; loop until codes[0] == EOS_text

After generation, slice cb-0 out of the (T Γ— 17) code matrix (it's a text token, not a codec value) and feed cb-1..16 into codec_decode to get 24 kHz mono PCM.

Sources

Downloads last month
641
GGUF
Model size
2B params
Architecture
moss_audio_tokenizer
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hans00/MOSS-TTS-Realtime-GGUF

Quantized
(1)
this model