--- license: apache-2.0 language: - en tags: - speech - text-to-speech - speech-to-speech - glm-4-voice - knowledge-distillation - vibes pipeline_tag: text-to-speech base_model: THUDM/glm-4-voice-9b --- # ViBES-Audio — a distilled 0.5B GLM-4-Voice **ViBES-Audio** is a lightweight (~0.5B) speech-language model **distilled from [GLM-4-Voice-9B](https://huggingface.co/THUDM/glm-4-voice-9b)**. It keeps the teacher's *exact* speech-in / speech-out pipeline and interleaved text+audio streaming — only the language model is shrunk and distilled. It is **drop-in compatible** with the official GLM-4-Voice serving code: the frozen speech **tokenizer** (Whisper-VQ, 12.5 Hz) and **decoder** (CosyVoice flow-matching + HiFi-GAN) are reused unchanged. It is the **speech/text backbone** of [ViBES](https://github.com/Juzezhang/ViBES) (our speech-language-behavior model) — a lightweight, low-latency alternative to the GLM-4-Voice-9B base. The motion experts are released separately: [`ViBES-Face`](https://huggingface.co/JuzeZhang/ViBES-Face). ## Model - **Architecture:** same `ChatGLMForConditionalGeneration` family as the teacher, scaled down — hidden 1024 · 24 layers · FFN 3456 · 8 attention heads / 2 KV groups (head_dim 128) · RMSNorm · RoPE · SwiGLU · GQA · vocab **168960** (identical to the teacher). Tied embeddings → **~0.49B unique trainable params** (stored untied on disk, ~0.66B, so the official server loads it unchanged). - **Frozen & reused unchanged:** the GLM-4-Voice speech tokenizer and decoder. - **Distillation:** white-box **top-K (K=64) logit knowledge distillation** — student and teacher share the exact 168960-token vocabulary, so logits are directly comparable. The student's embedding is **SVD-warm-started** from the teacher's 4096-dim embedding (preserves token geometry), then trained on teacher-generated interleaved data and refined on-policy. - **Modes:** speech→speech (S2S), speech→text (S2T), text→speech (T2S), text→text (T2T) — all via the same interleaved generation, selected at decode time. ## Usage `ViBES-Audio` only replaces the LLM; pair it with the official GLM-4-Voice tokenizer + decoder. ```python import torch from transformers import AutoModel, AutoTokenizer M = "JuzeZhang/ViBES-Audio" tok = AutoTokenizer.from_pretrained(M, trust_remote_code=True) model = AutoModel.from_pretrained(M, trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda").eval() SYS = ("User will provide you with a text instruction. Do it step by step. First, think about the " "instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.") prompt = f"<|system|>\n{SYS}<|user|>\nTell me a joke.<|assistant|>streaming_transcription\n" enc = tok([prompt], return_tensors="pt").to("cuda") out = model.generate(**enc, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.8) gen = out[0, enc["input_ids"].shape[1]:].tolist() # split the interleaved stream into text vs. speech tokens audio_offset = tok.convert_tokens_to_ids("<|audio_0|>") # 152353 text_ids = [t for t in gen if t < audio_offset and t not in (151329, 151336, 151338)] audio_ids = [t - audio_offset for t in gen if t >= audio_offset] print(tok.decode(text_ids)) # the spoken transcript # feed `audio_ids` to the GLM-4-Voice decoder (CosyVoice flow + HiFi-GAN) to synthesize the wav ``` For full speech-in / speech-out and a streaming web UI, use it with the official [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) repo (`web_demo.py` / `model_server.py`) — point `--model-path` at this checkpoint and keep the official `--tokenizer-path` / `--flow-path`. ## Capabilities & limitations (honest) - **Strong:** the teacher's voice/prosody and conversational style; common facts and everyday conversational queries; **low latency** (~15× smaller than the 9B → fast time-to-first-token). - **Weak:** multi-step reasoning (e.g. arithmetic) and novel / abstract / long-tail queries — it stays fluent but can ramble or hallucinate. This is the expected ceiling of a 0.5B model distilled from a 9B teacher. ## License & attribution Distilled from **GLM-4-Voice-9B** (© THUDM); please also observe the [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) license/terms. Architecture and the frozen tokenizer/decoder are from GLM-4-Voice. If you use this model, please cite ViBES and GLM-4-Voice.