lilfugu-onnx
ONNX (CPU) version of lilfugu for Linux / Windows / macOS. Runs without CUDA or MLX. See the main model card for details.
Layout
Encoder FP32 + decoder INT8 dynamic quantization, following the split used in Daumee/Qwen3-ASR-0.6B-ONNX-CPU. The FP32 decoder variants are included alongside for reference.
| File | Format | Size |
|---|---|---|
encoder.onnx |
FP32 | 1.2 GiB |
decoder_init.int8.onnx + .data |
INT8 dynamic | 1.9 GiB |
decoder_step.int8.onnx + .data |
INT8 dynamic | 1.6 GiB |
embed_tokens.bin |
FP32 | 1.2 GiB |
decoder_init.onnx + .data |
FP32 | 7.6 GiB |
decoder_step.onnx + .data |
FP32 | 6.4 GiB |
INT8 distribution set (encoder + *.int8.onnx* + embed_tokens.bin + tokenizer/config): 5.9 GiB.
Usage
pip install -U "huggingface_hub[cli]" onnxruntime numpy soundfile librosa transformers torch
hf download holotherapper/lilfugu-onnx --local-dir lilfugu-onnx
python3 lilfugu-onnx/inference.py audio.wav
inference.py resamples to 16 kHz mono, runs the encoder, then greedy-decodes through decoder_init / decoder_step and returns a transcript. Options: --variant fp32, --max-new-tokens, --json.
The package follows the same split / file naming convention as andrewleech/qwen3-asr-onnx (encoder.onnx, decoder_init*, decoder_step*, embed_tokens.bin), so tooling built around that layout should work with little or no change.
Benchmarks
ADLIB-DevTerm (247 cases):
| Model | CER | Term Accuracy (Exact) | Composite |
|---|---|---|---|
| lilfugu | 26.3% | 51.6% | 0.6272 |
| lilfugu-onnx (this) | 29.6% | 45.9% | 0.5794 |
| Qwen3-ASR-1.7B (base) | 41.1% | 24.6% | 0.4203 |
Composite = 0.4 × (1 - CER) + 0.6 × Term Accuracy (includes exact and flexible matches). Benchmark: ADLIB. Scores above were measured on Apple Silicon. CPU architecture and onnxruntime build can shift token-level results, so validate on your target hardware if benchmark parity matters.
INT8 dynamic quantization costs roughly 0.05 on Composite vs the MLX build. Switch to the FP32 decoder variants (--variant fp32) if that drop matters for your use case.
Peak memory ~13 GB (INT8) / ~22 GB (FP32). Plan for 24 GB+ on desktops; 16 GB only works dedicated.
Variants
| Repo | Format | Target |
|---|---|---|
| lilfugu | MLX bf16 | Apple Silicon |
| lilfugu-8bit | MLX 8bit | Apple Silicon |
| lilfugu-transformers | safetensors fp16 | CUDA / Linux |
| lilfugu-transformers-8bit | bitsandbytes int8 | CUDA |
| lilfugu-lora | MLX LoRA adapter | — |
| lilfugu-onnx (this) | ONNX FP32 + INT8 | CPU (Linux / Windows / macOS) |
License
Apache 2.0
- Downloads last month
- 3
Model tree for holotherapper/lilfugu-onnx
Base model
Qwen/Qwen3-ASR-1.7B