ViBES-Audio — a distilled 0.5B GLM-4-Voice
ViBES-Audio is a lightweight (~0.5B) speech-language model distilled from GLM-4-Voice-9B. It keeps the teacher's exact speech-in / speech-out pipeline and interleaved text+audio streaming — only the language model is shrunk and distilled. It is drop-in compatible with the official GLM-4-Voice serving code: the frozen speech tokenizer (Whisper-VQ, 12.5 Hz) and decoder (CosyVoice flow-matching + HiFi-GAN) are reused unchanged.
It is the speech/text backbone of ViBES (our
speech-language-behavior model) — a lightweight, low-latency alternative to the GLM-4-Voice-9B base.
The motion experts are released separately: ViBES-Face.
Model
- Architecture: same
ChatGLMForConditionalGenerationfamily as the teacher, scaled down — hidden 1024 · 24 layers · FFN 3456 · 8 attention heads / 2 KV groups (head_dim 128) · RMSNorm · RoPE · SwiGLU · GQA · vocab 168960 (identical to the teacher). Tied embeddings → ~0.49B unique trainable params (stored untied on disk, ~0.66B, so the official server loads it unchanged). - Frozen & reused unchanged: the GLM-4-Voice speech tokenizer and decoder.
- Distillation: white-box top-K (K=64) logit knowledge distillation — student and teacher share the exact 168960-token vocabulary, so logits are directly comparable. The student's embedding is SVD-warm-started from the teacher's 4096-dim embedding (preserves token geometry), then trained on teacher-generated interleaved data and refined on-policy.
- Modes: speech→speech (S2S), speech→text (S2T), text→speech (T2S), text→text (T2T) — all via the same interleaved generation, selected at decode time.
Usage
ViBES-Audio only replaces the LLM; pair it with the official GLM-4-Voice tokenizer + decoder.
import torch
from transformers import AutoModel, AutoTokenizer
M = "JuzeZhang/ViBES-Audio"
tok = AutoTokenizer.from_pretrained(M, trust_remote_code=True)
model = AutoModel.from_pretrained(M, trust_remote_code=True,
torch_dtype=torch.bfloat16).to("cuda").eval()
SYS = ("User will provide you with a text instruction. Do it step by step. First, think about the "
"instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.")
prompt = f"<|system|>\n{SYS}<|user|>\nTell me a joke.<|assistant|>streaming_transcription\n"
enc = tok([prompt], return_tensors="pt").to("cuda")
out = model.generate(**enc, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.8)
gen = out[0, enc["input_ids"].shape[1]:].tolist()
# split the interleaved stream into text vs. speech tokens
audio_offset = tok.convert_tokens_to_ids("<|audio_0|>") # 152353
text_ids = [t for t in gen if t < audio_offset and t not in (151329, 151336, 151338)]
audio_ids = [t - audio_offset for t in gen if t >= audio_offset]
print(tok.decode(text_ids)) # the spoken transcript
# feed `audio_ids` to the GLM-4-Voice decoder (CosyVoice flow + HiFi-GAN) to synthesize the wav
For full speech-in / speech-out and a streaming web UI, use it with the official
GLM-4-Voice repo (web_demo.py / model_server.py) — point
--model-path at this checkpoint and keep the official --tokenizer-path / --flow-path.
Capabilities & limitations (honest)
- Strong: the teacher's voice/prosody and conversational style; common facts and everyday conversational queries; low latency (~15× smaller than the 9B → fast time-to-first-token).
- Weak: multi-step reasoning (e.g. arithmetic) and novel / abstract / long-tail queries — it stays fluent but can ramble or hallucinate. This is the expected ceiling of a 0.5B model distilled from a 9B teacher.
License & attribution
Distilled from GLM-4-Voice-9B (© THUDM); please also observe the GLM-4-Voice license/terms. Architecture and the frozen tokenizer/decoder are from GLM-4-Voice. If you use this model, please cite ViBES and GLM-4-Voice.
- Downloads last month
- -
Model tree for JuzeZhang/ViBES-Audio
Base model
zai-org/glm-4-voice-9b