card: drop the 9B/0.5B variant table (lives in the GitHub repo)

8c1c080 verified about 19 hours ago

4.45 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- speech
	- text-to-speech
	- speech-to-speech
	- glm-4-voice
	- knowledge-distillation
	- vibes
	pipeline_tag: text-to-speech
	base_model: THUDM/glm-4-voice-9b
	---

	# ViBES-Audio — a distilled 0.5B GLM-4-Voice

	ViBES-Audio is a lightweight (~0.5B) speech-language model **distilled from
	[GLM-4-Voice-9B](https://huggingface.co/THUDM/glm-4-voice-9b)*. It keeps the teacher's exact*
	speech-in / speech-out pipeline and interleaved text+audio streaming — only the language model is
	shrunk and distilled. It is drop-in compatible with the official GLM-4-Voice serving code: the
	frozen speech tokenizer (Whisper-VQ, 12.5 Hz) and decoder (CosyVoice flow-matching + HiFi-GAN)
	are reused unchanged.

	It is the speech/text backbone of [ViBES](https://github.com/Juzezhang/ViBES) (our
	speech-language-behavior model) — a lightweight, low-latency alternative to the GLM-4-Voice-9B base.
	The motion experts are released separately: [`ViBES-Face`](https://huggingface.co/JuzeZhang/ViBES-Face).

	## Model

	- Architecture: same `ChatGLMForConditionalGeneration` family as the teacher, scaled down —
	hidden 1024 · 24 layers · FFN 3456 · 8 attention heads / 2 KV groups (head_dim 128) · RMSNorm ·
	RoPE · SwiGLU · GQA · vocab 168960 (identical to the teacher). Tied embeddings → **~0.49B
	unique trainable params** (stored untied on disk, ~0.66B, so the official server loads it unchanged).
	- Frozen & reused unchanged: the GLM-4-Voice speech tokenizer and decoder.
	- Distillation: white-box top-K (K=64) logit knowledge distillation — student and teacher
	share the exact 168960-token vocabulary, so logits are directly comparable. The student's
	embedding is SVD-warm-started from the teacher's 4096-dim embedding (preserves token geometry),
	then trained on teacher-generated interleaved data and refined on-policy.
	- Modes: speech→speech (S2S), speech→text (S2T), text→speech (T2S), text→text (T2T) — all via the
	same interleaved generation, selected at decode time.

	## Usage

	`ViBES-Audio` only replaces the LLM; pair it with the official GLM-4-Voice tokenizer + decoder.

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	M = "JuzeZhang/ViBES-Audio"
	tok = AutoTokenizer.from_pretrained(M, trust_remote_code=True)
	model = AutoModel.from_pretrained(M, trust_remote_code=True,
	torch_dtype=torch.bfloat16).to("cuda").eval()

	SYS = ("User will provide you with a text instruction. Do it step by step. First, think about the "
	"instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.")
	prompt = f"<\|system\|>\n{SYS}<\|user\|>\nTell me a joke.<\|assistant\|>streaming_transcription\n"

	enc = tok([prompt], return_tensors="pt").to("cuda")
	out = model.generate(**enc, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.8)
	gen = out[0, enc["input_ids"].shape[1]:].tolist()

	# split the interleaved stream into text vs. speech tokens
	audio_offset = tok.convert_tokens_to_ids("<\|audio_0\|>") # 152353
	text_ids = [t for t in gen if t < audio_offset and t not in (151329, 151336, 151338)]
	audio_ids = [t - audio_offset for t in gen if t >= audio_offset]
	print(tok.decode(text_ids)) # the spoken transcript
	# feed `audio_ids` to the GLM-4-Voice decoder (CosyVoice flow + HiFi-GAN) to synthesize the wav
	```

	For full speech-in / speech-out and a streaming web UI, use it with the official
	[GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) repo (`web_demo.py` / `model_server.py`) — point
	`--model-path` at this checkpoint and keep the official `--tokenizer-path` / `--flow-path`.

	## Capabilities & limitations (honest)

	- Strong: the teacher's voice/prosody and conversational style; common facts and everyday
	conversational queries; low latency (~15× smaller than the 9B → fast time-to-first-token).
	- Weak: multi-step reasoning (e.g. arithmetic) and novel / abstract / long-tail queries — it stays
	fluent but can ramble or hallucinate. This is the expected ceiling of a 0.5B model distilled from a
	9B teacher.

	## License & attribution

	Distilled from GLM-4-Voice-9B (© THUDM); please also observe the
	[GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) license/terms. Architecture and the frozen
	tokenizer/decoder are from GLM-4-Voice. If you use this model, please cite ViBES and GLM-4-Voice.