| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - speech |
| - text-to-speech |
| - speech-to-speech |
| - glm-4-voice |
| - knowledge-distillation |
| - vibes |
| pipeline_tag: text-to-speech |
| base_model: THUDM/glm-4-voice-9b |
| --- |
| |
| # ViBES-Audio — a distilled 0.5B GLM-4-Voice |
|
|
| **ViBES-Audio** is a lightweight (~0.5B) speech-language model **distilled from |
| [GLM-4-Voice-9B](https://huggingface.co/THUDM/glm-4-voice-9b)**. It keeps the teacher's *exact* |
| speech-in / speech-out pipeline and interleaved text+audio streaming — only the language model is |
| shrunk and distilled. It is **drop-in compatible** with the official GLM-4-Voice serving code: the |
| frozen speech **tokenizer** (Whisper-VQ, 12.5 Hz) and **decoder** (CosyVoice flow-matching + HiFi-GAN) |
| are reused unchanged. |
|
|
| It is the **speech/text backbone** of [ViBES](https://github.com/Juzezhang/ViBES) (our |
| speech-language-behavior model) — a lightweight, low-latency alternative to the GLM-4-Voice-9B base. |
| The motion experts are released separately: [`ViBES-Face`](https://huggingface.co/JuzeZhang/ViBES-Face). |
|
|
| ## Model |
|
|
| - **Architecture:** same `ChatGLMForConditionalGeneration` family as the teacher, scaled down — |
| hidden 1024 · 24 layers · FFN 3456 · 8 attention heads / 2 KV groups (head_dim 128) · RMSNorm · |
| RoPE · SwiGLU · GQA · vocab **168960** (identical to the teacher). Tied embeddings → **~0.49B |
| unique trainable params** (stored untied on disk, ~0.66B, so the official server loads it unchanged). |
| - **Frozen & reused unchanged:** the GLM-4-Voice speech tokenizer and decoder. |
| - **Distillation:** white-box **top-K (K=64) logit knowledge distillation** — student and teacher |
| share the exact 168960-token vocabulary, so logits are directly comparable. The student's |
| embedding is **SVD-warm-started** from the teacher's 4096-dim embedding (preserves token geometry), |
| then trained on teacher-generated interleaved data and refined on-policy. |
| - **Modes:** speech→speech (S2S), speech→text (S2T), text→speech (T2S), text→text (T2T) — all via the |
| same interleaved generation, selected at decode time. |
| |
| ## Usage |
| |
| `ViBES-Audio` only replaces the LLM; pair it with the official GLM-4-Voice tokenizer + decoder. |
| |
| ```python |
| import torch |
| from transformers import AutoModel, AutoTokenizer |
| |
| M = "JuzeZhang/ViBES-Audio" |
| tok = AutoTokenizer.from_pretrained(M, trust_remote_code=True) |
| model = AutoModel.from_pretrained(M, trust_remote_code=True, |
| torch_dtype=torch.bfloat16).to("cuda").eval() |
|
|
| SYS = ("User will provide you with a text instruction. Do it step by step. First, think about the " |
| "instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.") |
| prompt = f"<|system|>\n{SYS}<|user|>\nTell me a joke.<|assistant|>streaming_transcription\n" |
| |
| enc = tok([prompt], return_tensors="pt").to("cuda") |
| out = model.generate(**enc, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.8) |
| gen = out[0, enc["input_ids"].shape[1]:].tolist() |
| |
| # split the interleaved stream into text vs. speech tokens |
| audio_offset = tok.convert_tokens_to_ids("<|audio_0|>") # 152353 |
| text_ids = [t for t in gen if t < audio_offset and t not in (151329, 151336, 151338)] |
| audio_ids = [t - audio_offset for t in gen if t >= audio_offset] |
| print(tok.decode(text_ids)) # the spoken transcript |
| # feed `audio_ids` to the GLM-4-Voice decoder (CosyVoice flow + HiFi-GAN) to synthesize the wav |
| ``` |
| |
| For full speech-in / speech-out and a streaming web UI, use it with the official |
| [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) repo (`web_demo.py` / `model_server.py`) — point |
| `--model-path` at this checkpoint and keep the official `--tokenizer-path` / `--flow-path`. |
| |
| ## Capabilities & limitations (honest) |
| |
| - **Strong:** the teacher's voice/prosody and conversational style; common facts and everyday |
| conversational queries; **low latency** (~15× smaller than the 9B → fast time-to-first-token). |
| - **Weak:** multi-step reasoning (e.g. arithmetic) and novel / abstract / long-tail queries — it stays |
| fluent but can ramble or hallucinate. This is the expected ceiling of a 0.5B model distilled from a |
| 9B teacher. |
| |
| ## License & attribution |
| |
| Distilled from **GLM-4-Voice-9B** (© THUDM); please also observe the |
| [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) license/terms. Architecture and the frozen |
| tokenizer/decoder are from GLM-4-Voice. If you use this model, please cite ViBES and GLM-4-Voice. |
| |