lilfugu-onnx

ONNX (CPU) version of lilfugu for Linux / Windows / macOS. Runs without CUDA or MLX. See the main model card for details.

Layout

Encoder FP32 + decoder INT8 dynamic quantization, following the split used in Daumee/Qwen3-ASR-0.6B-ONNX-CPU. The FP32 decoder variants are included alongside for reference.

File	Format	Size
`encoder.onnx`	FP32	1.2 GiB
`decoder_init.int8.onnx` + `.data`	INT8 dynamic	1.9 GiB
`decoder_step.int8.onnx` + `.data`	INT8 dynamic	1.6 GiB
`embed_tokens.bin`	FP32	1.2 GiB
`decoder_init.onnx` + `.data`	FP32	7.6 GiB
`decoder_step.onnx` + `.data`	FP32	6.4 GiB

INT8 distribution set (encoder + *.int8.onnx* + embed_tokens.bin + tokenizer/config): 5.9 GiB.

Usage

pip install -U "huggingface_hub[cli]" onnxruntime numpy soundfile librosa transformers torch
hf download holotherapper/lilfugu-onnx --local-dir lilfugu-onnx
python3 lilfugu-onnx/inference.py audio.wav

inference.py resamples to 16 kHz mono, runs the encoder, then greedy-decodes through decoder_init / decoder_step and returns a transcript. Options: --variant fp32, --max-new-tokens, --json.

The package follows the same split / file naming convention as andrewleech/qwen3-asr-onnx (encoder.onnx, decoder_init*, decoder_step*, embed_tokens.bin), so tooling built around that layout should work with little or no change.

Benchmarks

ADLIB-DevTerm (247 cases):

Model	CER	Term Accuracy (Exact)	Composite
lilfugu	26.3%	51.6%	0.6272
lilfugu-onnx (this)	29.6%	45.9%	0.5794
Qwen3-ASR-1.7B (base)	41.1%	24.6%	0.4203

Composite = 0.4 × (1 - CER) + 0.6 × Term Accuracy (includes exact and flexible matches). Benchmark: ADLIB. Scores above were measured on Apple Silicon. CPU architecture and onnxruntime build can shift token-level results, so validate on your target hardware if benchmark parity matters.

INT8 dynamic quantization costs roughly 0.05 on Composite vs the MLX build. Switch to the FP32 decoder variants (--variant fp32) if that drop matters for your use case.

Peak memory ~13 GB (INT8) / ~22 GB (FP32). Plan for 24 GB+ on desktops; 16 GB only works dedicated.

Variants

Repo	Format	Target
lilfugu	MLX bf16	Apple Silicon
lilfugu-8bit	MLX 8bit	Apple Silicon
lilfugu-transformers	safetensors fp16	CUDA / Linux
lilfugu-transformers-8bit	bitsandbytes int8	CUDA
lilfugu-lora	MLX LoRA adapter	—
lilfugu-onnx (this)	ONNX FP32 + INT8	CPU (Linux / Windows / macOS)

License

Apache 2.0

Downloads last month: 7

Model tree for holotherapper/lilfugu-onnx

Base model

Qwen/Qwen3-ASR-1.7B

Quantized

(46)

this model