Automatic Speech Recognition

GGUF + pure-C++ runtime in CrispASR — OmniASR-LLM-1B

#2
by cstr - opened

We've added OmniASR-LLM-1B to CrispASR's omniasr backend. Same src/omniasr.cpp runtime as the CTC variants — dispatched by GGUF metadata to the LLM decode path when the LLaMA decoder weights are present.

Architecture: same 48-layer encoder (d=1280) as CTC-1B + a 12-layer LLaMA decoder (d=4096, SwiGLU, RoPE) + enc_proj projector. Autoregressive — KV-cached decode with flash attention, native punctuation/capitalisation from the LM (unlike the CTC variants which need --punc-model).

Smoke test: 8.5 GB .pt → 4.55 GB F16 GGUF (918 tensors). JFK on Q4_K transcribes:

"fellow americas ask not what your country can do for you"

(LM cosmetic differences from the CTC reference are expected — autoregressive models pick different but valid punctuation/spelling.)

Pre-quantised GGUFs (Apache-2.0): cstr/omniasr-llm-1b-GGUF

./build/bin/crispasr --backend omniasr -m omniasr-llm-1b-q4_k.gguf -f audio.wav -osrt

CTC siblings (faster, no native punctuation): CTC-300M, CTC-1B. Smaller LLM variant: cstr/omniasr-llm-300m-v2-GGUF. Dynamic language selection (1693 FLORES-200 codes).

Sign up or log in to comment