MOSS-TTS-Realtime GGUF
End-to-end GGUF conversion of OpenMOSS-Team/MOSS-TTS-Realtime, runnable as a backbone + codec_lm + codec stack via stock llama.cpp and codec.cpp.
The model is a Qwen3-2B language backbone + 4-layer MossTTSRealtimeLocalTransformer depth decoder + 17-channel emission (cb-0 = a text token sampled from the Qwen3 backbone's lm_head; cb-1..16 = 16 RVQ audio codebooks of 1027 entries each). codec.cpp's residual_depth_ar codec_lm runtime handles the depth decoder + audio embed tables + per-AR-step state machine; the backbone runs in llama.cpp as a stock qwen3 arch.
This replaces the earlier codec-only release: the LLM-part is now wired through codec.cpp's codec_lm infrastructure (see tts_remaining_plan.md in codec.cpp).
Files
Backbone (Qwen3 language_model, stock qwen3 arch β 28 layers, hidden 2048, vocab 151936)
moss-tts-realtime-<quant>.gguf β produced by codec.cpp's convert-backbone-to-gguf.py prep_moss_tts_realtime, which unwraps language_model.* into the standard Qwen3 layout. tie_word_embeddings=true is inferred from the absence of a standalone lm_head tensor.
| File | Size |
|---|---|
moss-tts-realtime-f32.gguf |
4.5 GB |
moss-tts-realtime-f16.gguf |
3.3 GB |
moss-tts-realtime-bf16.gguf |
3.3 GB |
moss-tts-realtime-q8_0.gguf |
1.8 GB |
moss-tts-realtime-q6_k.gguf |
1.4 GB |
moss-tts-realtime-q5_1.gguf |
1.3 GB |
moss-tts-realtime-q5_k_m.gguf |
1.2 GB |
moss-tts-realtime-q5_k_s.gguf |
1.2 GB |
moss-tts-realtime-q5_0.gguf |
1.2 GB |
moss-tts-realtime-q4_1.gguf |
1.1 GB |
moss-tts-realtime-q4_k_m.gguf |
1.1 GB |
moss-tts-realtime-q4_k_s.gguf |
1009 MB |
moss-tts-realtime-q4_0.gguf |
1004 MB |
moss-tts-realtime-q3_k_l.gguf |
955 MB |
moss-tts-realtime-q3_k_m.gguf |
894 MB |
moss-tts-realtime-q3_k_s.gguf |
825 MB |
moss-tts-realtime-q2_k.gguf |
740 MB |
Codec + codec_lm (MOSS-Audio-Tokenizer 16 RVQ Γ 1027 codebooks + residual_depth_ar adaptor)
codec[-<quant>].gguf β full MOSS-Audio-Tokenizer (1.6B, 24 kHz mono) bundled with the codec_lm adaptor: audio embed tables for the 16 RVQ heads, the 17th text_embd (~308 MB at F16), 4-layer depth decoder, and 16 codebooks_head slices. Produced by codec.cpp's convert-to-gguf.py --model-type moss_audio --lm-source OpenMOSS-Team/MOSS-TTS-Realtime.
| File | Size |
|---|---|
codec-f32.gguf |
6.1 GB |
codec-f16.gguf |
3.9 GB |
codec-q8_0.gguf |
2.4 GB |
codec-q5_k_m.gguf |
1.8 GB |
codec-q4_k_m.gguf |
1.6 GB |
Inference shape (text-modality codec_lm AR)
backbone (Qwen3, embeddings=true) hidden state h
β caller samples text token via llama_get_logits_ith β text_tok
β codec_lm_state_set_text_context(state, text_tok)
β codec_lm_step_begin(state, h)
β for cb in 0..16: codec_lm_step_logits β sample β codec_lm_step_push_code
β codec_lm_step_finish β codes[17]
β codec_lm_compose_audio_embd(codes) β next-step embedding
β feed to backbone via b.embd; loop until codes[0] == EOS_text
After generation, slice cb-0 out of the (T Γ 17) code matrix (it's a text token, not a codec value) and feed cb-1..16 into codec_decode to get 24 kHz mono PCM.
Sources
- Upstream model:
OpenMOSS-Team/MOSS-TTS-Realtime - Audio codec:
OpenMOSS-Team/MOSS-Audio-Tokenizer - Conversion tooling:
mybigday/codec.cpp - Inference runtime:
mybigday/llama.rn(codec_lm AR text-modality path incpp/rn-tts.cpp)
- Downloads last month
- 641
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
Model tree for hans00/MOSS-TTS-Realtime-GGUF
Base model
OpenMOSS-Team/MOSS-TTS-Realtime