Instructions to use zai-org/GLM-ASR-Nano-2512 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-ASR-Nano-2512 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="zai-org/GLM-ASR-Nano-2512")# Load model directly from transformers import AutoProcessor, AutoModelForSeq2SeqLM processor = AutoProcessor.from_pretrained("zai-org/GLM-ASR-Nano-2512") model = AutoModelForSeq2SeqLM.from_pretrained("zai-org/GLM-ASR-Nano-2512") - Notebooks
- Google Colab
- Kaggle
GGUF + pure-C++ runtime in CrispASR — GLM-ASR-Nano
We've built a complete C++ runtime for GLM-ASR-Nano in CrispASR. One binary, one GGUF — no Python, no transformers.
src/glm_asr.cpp follows the architecture closely:
- Whisper encoder (1280d, 32L, partial RoPE, LayerNorm with bias) — the partial-RoPE part bit me first; it's not full RoPE on all heads.
- 4-frame-stack projector (
5120 → 4096, GELU → 4096 → 2048) — frames are stacked then projected, not pooled. - Llama LLM (2048d, 28L, GQA 16/4, SwiGLU, RMSNorm).
It's structurally a sibling of our voxtral.cpp (3B) runtime — same building blocks, different sizes — so we share core/attention.h (Llama-style self-attention with NEOX RoPE + GQA + flash-attn) and core/ffn.h (SwiGLU) with Voxtral / Qwen3 / Granite. KV-cached prefill+decode, native flash attention.
Q4_K / Q5_0 / Q8_0 / F16 quantisation. 17 languages including Mandarin, English, Cantonese — uses the LLM's native multilingual capacity.
Pre-quantised GGUFs (MIT): cstr/glm-asr-nano-GGUF
git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend glm-asr -m glm-asr-nano-q4_k.gguf -f audio.wav
Word timestamps via forced alignment (-am qwen3-forced-aligner.gguf); temperature sampling, best-of-N, streaming, VAD, diarisation, all output formats wired.