ViBES-Audio / README.md
JuzeZhang's picture
card: drop the 9B/0.5B variant table (lives in the GitHub repo)
8c1c080 verified
---
license: apache-2.0
language:
- en
tags:
- speech
- text-to-speech
- speech-to-speech
- glm-4-voice
- knowledge-distillation
- vibes
pipeline_tag: text-to-speech
base_model: THUDM/glm-4-voice-9b
---
# ViBES-Audio — a distilled 0.5B GLM-4-Voice
**ViBES-Audio** is a lightweight (~0.5B) speech-language model **distilled from
[GLM-4-Voice-9B](https://huggingface.co/THUDM/glm-4-voice-9b)**. It keeps the teacher's *exact*
speech-in / speech-out pipeline and interleaved text+audio streaming — only the language model is
shrunk and distilled. It is **drop-in compatible** with the official GLM-4-Voice serving code: the
frozen speech **tokenizer** (Whisper-VQ, 12.5 Hz) and **decoder** (CosyVoice flow-matching + HiFi-GAN)
are reused unchanged.
It is the **speech/text backbone** of [ViBES](https://github.com/Juzezhang/ViBES) (our
speech-language-behavior model) — a lightweight, low-latency alternative to the GLM-4-Voice-9B base.
The motion experts are released separately: [`ViBES-Face`](https://huggingface.co/JuzeZhang/ViBES-Face).
## Model
- **Architecture:** same `ChatGLMForConditionalGeneration` family as the teacher, scaled down —
hidden 1024 · 24 layers · FFN 3456 · 8 attention heads / 2 KV groups (head_dim 128) · RMSNorm ·
RoPE · SwiGLU · GQA · vocab **168960** (identical to the teacher). Tied embeddings → **~0.49B
unique trainable params** (stored untied on disk, ~0.66B, so the official server loads it unchanged).
- **Frozen & reused unchanged:** the GLM-4-Voice speech tokenizer and decoder.
- **Distillation:** white-box **top-K (K=64) logit knowledge distillation** — student and teacher
share the exact 168960-token vocabulary, so logits are directly comparable. The student's
embedding is **SVD-warm-started** from the teacher's 4096-dim embedding (preserves token geometry),
then trained on teacher-generated interleaved data and refined on-policy.
- **Modes:** speech→speech (S2S), speech→text (S2T), text→speech (T2S), text→text (T2T) — all via the
same interleaved generation, selected at decode time.
## Usage
`ViBES-Audio` only replaces the LLM; pair it with the official GLM-4-Voice tokenizer + decoder.
```python
import torch
from transformers import AutoModel, AutoTokenizer
M = "JuzeZhang/ViBES-Audio"
tok = AutoTokenizer.from_pretrained(M, trust_remote_code=True)
model = AutoModel.from_pretrained(M, trust_remote_code=True,
torch_dtype=torch.bfloat16).to("cuda").eval()
SYS = ("User will provide you with a text instruction. Do it step by step. First, think about the "
"instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.")
prompt = f"<|system|>\n{SYS}<|user|>\nTell me a joke.<|assistant|>streaming_transcription\n"
enc = tok([prompt], return_tensors="pt").to("cuda")
out = model.generate(**enc, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.8)
gen = out[0, enc["input_ids"].shape[1]:].tolist()
# split the interleaved stream into text vs. speech tokens
audio_offset = tok.convert_tokens_to_ids("<|audio_0|>") # 152353
text_ids = [t for t in gen if t < audio_offset and t not in (151329, 151336, 151338)]
audio_ids = [t - audio_offset for t in gen if t >= audio_offset]
print(tok.decode(text_ids)) # the spoken transcript
# feed `audio_ids` to the GLM-4-Voice decoder (CosyVoice flow + HiFi-GAN) to synthesize the wav
```
For full speech-in / speech-out and a streaming web UI, use it with the official
[GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) repo (`web_demo.py` / `model_server.py`) — point
`--model-path` at this checkpoint and keep the official `--tokenizer-path` / `--flow-path`.
## Capabilities & limitations (honest)
- **Strong:** the teacher's voice/prosody and conversational style; common facts and everyday
conversational queries; **low latency** (~15× smaller than the 9B → fast time-to-first-token).
- **Weak:** multi-step reasoning (e.g. arithmetic) and novel / abstract / long-tail queries — it stays
fluent but can ramble or hallucinate. This is the expected ceiling of a 0.5B model distilled from a
9B teacher.
## License & attribution
Distilled from **GLM-4-Voice-9B** (© THUDM); please also observe the
[GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) license/terms. Architecture and the frozen
tokenizer/decoder are from GLM-4-Voice. If you use this model, please cite ViBES and GLM-4-Voice.