lilfugu-onnx

ONNX (CPU) version of lilfugu for Linux / Windows / macOS. Runs without CUDA or MLX. See the main model card for details.

Layout

Encoder FP32 + decoder INT8 dynamic quantization, following the split used in Daumee/Qwen3-ASR-0.6B-ONNX-CPU. The FP32 decoder variants are included alongside for reference.

File Format Size
encoder.onnx FP32 1.2 GiB
decoder_init.int8.onnx + .data INT8 dynamic 1.9 GiB
decoder_step.int8.onnx + .data INT8 dynamic 1.6 GiB
embed_tokens.bin FP32 1.2 GiB
decoder_init.onnx + .data FP32 7.6 GiB
decoder_step.onnx + .data FP32 6.4 GiB

INT8 distribution set (encoder + *.int8.onnx* + embed_tokens.bin + tokenizer/config): 5.9 GiB.

Usage

pip install -U "huggingface_hub[cli]" onnxruntime numpy soundfile librosa transformers torch
hf download holotherapper/lilfugu-onnx --local-dir lilfugu-onnx
python3 lilfugu-onnx/inference.py audio.wav

inference.py resamples to 16 kHz mono, runs the encoder, then greedy-decodes through decoder_init / decoder_step and returns a transcript. Options: --variant fp32, --max-new-tokens, --json.

The package follows the same split / file naming convention as andrewleech/qwen3-asr-onnx (encoder.onnx, decoder_init*, decoder_step*, embed_tokens.bin), so tooling built around that layout should work with little or no change.

Benchmarks

ADLIB-DevTerm (247 cases):

Model CER Term Accuracy (Exact) Composite
lilfugu 26.3% 51.6% 0.6272
lilfugu-onnx (this) 29.6% 45.9% 0.5794
Qwen3-ASR-1.7B (base) 41.1% 24.6% 0.4203

Composite = 0.4 × (1 - CER) + 0.6 × Term Accuracy (includes exact and flexible matches). Benchmark: ADLIB. Scores above were measured on Apple Silicon. CPU architecture and onnxruntime build can shift token-level results, so validate on your target hardware if benchmark parity matters.

INT8 dynamic quantization costs roughly 0.05 on Composite vs the MLX build. Switch to the FP32 decoder variants (--variant fp32) if that drop matters for your use case.

Peak memory ~13 GB (INT8) / ~22 GB (FP32). Plan for 24 GB+ on desktops; 16 GB only works dedicated.

Variants

Repo Format Target
lilfugu MLX bf16 Apple Silicon
lilfugu-8bit MLX 8bit Apple Silicon
lilfugu-transformers safetensors fp16 CUDA / Linux
lilfugu-transformers-8bit bitsandbytes int8 CUDA
lilfugu-lora MLX LoRA adapter —
lilfugu-onnx (this) ONNX FP32 + INT8 CPU (Linux / Windows / macOS)

License

Apache 2.0

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for holotherapper/lilfugu-onnx

Quantized
(30)
this model