ViBES-Audio / README.md
JuzeZhang's picture
card: drop the 9B/0.5B variant table (lives in the GitHub repo)
8c1c080 verified
metadata
license: apache-2.0
language:
  - en
tags:
  - speech
  - text-to-speech
  - speech-to-speech
  - glm-4-voice
  - knowledge-distillation
  - vibes
pipeline_tag: text-to-speech
base_model: THUDM/glm-4-voice-9b

ViBES-Audio — a distilled 0.5B GLM-4-Voice

ViBES-Audio is a lightweight (~0.5B) speech-language model distilled from GLM-4-Voice-9B. It keeps the teacher's exact speech-in / speech-out pipeline and interleaved text+audio streaming — only the language model is shrunk and distilled. It is drop-in compatible with the official GLM-4-Voice serving code: the frozen speech tokenizer (Whisper-VQ, 12.5 Hz) and decoder (CosyVoice flow-matching + HiFi-GAN) are reused unchanged.

It is the speech/text backbone of ViBES (our speech-language-behavior model) — a lightweight, low-latency alternative to the GLM-4-Voice-9B base. The motion experts are released separately: ViBES-Face.

Model

  • Architecture: same ChatGLMForConditionalGeneration family as the teacher, scaled down — hidden 1024 · 24 layers · FFN 3456 · 8 attention heads / 2 KV groups (head_dim 128) · RMSNorm · RoPE · SwiGLU · GQA · vocab 168960 (identical to the teacher). Tied embeddings → ~0.49B unique trainable params (stored untied on disk, ~0.66B, so the official server loads it unchanged).
  • Frozen & reused unchanged: the GLM-4-Voice speech tokenizer and decoder.
  • Distillation: white-box top-K (K=64) logit knowledge distillation — student and teacher share the exact 168960-token vocabulary, so logits are directly comparable. The student's embedding is SVD-warm-started from the teacher's 4096-dim embedding (preserves token geometry), then trained on teacher-generated interleaved data and refined on-policy.
  • Modes: speech→speech (S2S), speech→text (S2T), text→speech (T2S), text→text (T2T) — all via the same interleaved generation, selected at decode time.

Usage

ViBES-Audio only replaces the LLM; pair it with the official GLM-4-Voice tokenizer + decoder.

import torch
from transformers import AutoModel, AutoTokenizer

M = "JuzeZhang/ViBES-Audio"
tok = AutoTokenizer.from_pretrained(M, trust_remote_code=True)
model = AutoModel.from_pretrained(M, trust_remote_code=True,
                                  torch_dtype=torch.bfloat16).to("cuda").eval()

SYS = ("User will provide you with a text instruction. Do it step by step. First, think about the "
       "instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.")
prompt = f"<|system|>\n{SYS}<|user|>\nTell me a joke.<|assistant|>streaming_transcription\n"

enc = tok([prompt], return_tensors="pt").to("cuda")
out = model.generate(**enc, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.8)
gen = out[0, enc["input_ids"].shape[1]:].tolist()

# split the interleaved stream into text vs. speech tokens
audio_offset = tok.convert_tokens_to_ids("<|audio_0|>")   # 152353
text_ids = [t for t in gen if t < audio_offset and t not in (151329, 151336, 151338)]
audio_ids = [t - audio_offset for t in gen if t >= audio_offset]
print(tok.decode(text_ids))         # the spoken transcript
# feed `audio_ids` to the GLM-4-Voice decoder (CosyVoice flow + HiFi-GAN) to synthesize the wav

For full speech-in / speech-out and a streaming web UI, use it with the official GLM-4-Voice repo (web_demo.py / model_server.py) — point --model-path at this checkpoint and keep the official --tokenizer-path / --flow-path.

Capabilities & limitations (honest)

  • Strong: the teacher's voice/prosody and conversational style; common facts and everyday conversational queries; low latency (~15× smaller than the 9B → fast time-to-first-token).
  • Weak: multi-step reasoning (e.g. arithmetic) and novel / abstract / long-tail queries — it stays fluent but can ramble or hallucinate. This is the expected ceiling of a 0.5B model distilled from a 9B teacher.

License & attribution

Distilled from GLM-4-Voice-9B (© THUDM); please also observe the GLM-4-Voice license/terms. Architecture and the frozen tokenizer/decoder are from GLM-4-Voice. If you use this model, please cite ViBES and GLM-4-Voice.