GGUF + pure-C++ runtime in CrispASR — GLM-ASR-Nano

#13
by cstr - opened

We've built a complete C++ runtime for GLM-ASR-Nano in CrispASR. One binary, one GGUF — no Python, no transformers.

src/glm_asr.cpp follows the architecture closely:

  • Whisper encoder (1280d, 32L, partial RoPE, LayerNorm with bias) — the partial-RoPE part bit me first; it's not full RoPE on all heads.
  • 4-frame-stack projector (5120 → 4096, GELU → 4096 → 2048) — frames are stacked then projected, not pooled.
  • Llama LLM (2048d, 28L, GQA 16/4, SwiGLU, RMSNorm).

It's structurally a sibling of our voxtral.cpp (3B) runtime — same building blocks, different sizes — so we share core/attention.h (Llama-style self-attention with NEOX RoPE + GQA + flash-attn) and core/ffn.h (SwiGLU) with Voxtral / Qwen3 / Granite. KV-cached prefill+decode, native flash attention.

Q4_K / Q5_0 / Q8_0 / F16 quantisation. 17 languages including Mandarin, English, Cantonese — uses the LLM's native multilingual capacity.

Pre-quantised GGUFs (MIT): cstr/glm-asr-nano-GGUF

git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend glm-asr -m glm-asr-nano-q4_k.gguf -f audio.wav

Word timestamps via forced alignment (-am qwen3-forced-aligner.gguf); temperature sampling, best-of-N, streaming, VAD, diarisation, all output formats wired.

Sign up or log in to comment