docs: add LLM quantization tiers table (q4km/q5km/q8)

c1629cb verified 10 days ago

2.61 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	library_name: gguf
	tags:
	- automatic-speech-recognition
	- asr
	- fun-asr
	- funasr
	- qwen3
	- llama.cpp
	- ggml
	- cpu
	- chinese
	pipeline_tag: automatic-speech-recognition
	---

	# Fun-ASR-Nano · GGUF (FunASR llama.cpp runtime)

	GGUF build of Fun-ASR-Nano (SenseVoice SAN-M encoder + adaptor + Qwen3-0.6B LLM decoder) for the zero-Python, CPU/edge [FunASR llama.cpp runtime](https://github.com/FunAudioLLM/Fun-ASR/tree/main/runtime/llama.cpp) — the accuracy leader (LLM decoder), single C++ binary.

	## LLM quantization (pick by size vs accuracy)

	The Fun-ASR-Nano LLM (Qwen3-0.6B) ships in three tiers — all within 0.1% CER (184-file micro-CER). Pair any with `funasr-encoder-f16.gguf` (470 MB).

	\| LLM file \| size \| CER ↓ \| speed \|
	\|---\|---\|---\|---\|
	\| `qwen3-0.6b-q4km.gguf` \| 484 MB \| 8.35% \| 6.1× \| smallest \|
	\| `qwen3-0.6b-q5km.gguf` \| 551 MB \| 8.25% \| 5.7× \| best accuracy \|
	\| `qwen3-0.6b-q8_0.gguf` \| 805 MB \| 8.30% \| 6.0× \| \|

	Recommended: q4_K_M (smallest) or q5_K_M (best).

	## Get it running (no Python, no build)

	These are GGUF weights for the [FunASR llama.cpp runtime](https://github.com/modelscope/FunASR/tree/main/runtime/llama.cpp) — a whisper.cpp-style, single self-contained binary for CPU / edge. Grab a prebuilt binary, then fetch this model and run:

	- Prebuilt binaries (Linux / macOS / Windows) → [GitHub Releases](https://github.com/modelscope/FunASR/releases) (tag `runtime-llamacpp-v*`)
	- One-page quickstart & benchmarks → [funasr.com/llama-cpp](https://www.funasr.com/llama-cpp.html)

	```bash
	bash download-funasr-model.sh nano ./gguf
	llama-funasr-cli --enc ./gguf/funasr-encoder-f16.gguf -m ./gguf/qwen3-0.6b-q8_0.gguf --vad ./gguf/fsmn-vad.gguf -a audio.wav
	```

	## Files
	\| file \| size \| notes \|
	\|---\|---\|---\|
	\| `funasr-encoder-f16.gguf` \| 470 MB \| audio encoder + adaptor (f16) \|
	\| `qwen3-0.6b-q8_0.gguf` \| 805 MB \| LLM decoder, recommended (Q8_0) \|
	\| `qwen3-0.6b-q4km.gguf` \| 484 MB \| LLM decoder, smaller (Q4_K_M) \|

	## Usage (needs both the encoder and the LLM gguf)
	```bash
	llama-funasr-cli --enc funasr-encoder-f16.gguf -m qwen3-0.6b-q8_0.gguf -a audio.wav --vad fsmn-vad.gguf
	```
	On CPU: 8.30 % CER on the 184-clip Mandarin benchmark (vs whisper.cpp 22–31 %).

	## Links
	- 🧩 Runtime & build: [Fun-ASR · runtime/llama.cpp](https://github.com/FunAudioLLM/Fun-ASR/tree/main/runtime/llama.cpp) — ⭐ Star [Fun-ASR](https://github.com/FunAudioLLM/Fun-ASR)!
	- Source model: [FunAudioLLM/Fun-ASR-Nano-2512](https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512)