Automatic Speech Recognition
MLX
Safetensors
Japanese
qwen3_asr
speech-to-text
japanese
programming
asr
stt
Instructions to use holotherapper/lilfugu with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use holotherapper/lilfugu with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir lilfugu holotherapper/lilfugu
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| language: | |
| - ja | |
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-ASR-1.7B | |
| library_name: mlx | |
| tags: | |
| - automatic-speech-recognition | |
| - speech-to-text | |
| - japanese | |
| - programming | |
| - mlx | |
| - asr | |
| - stt | |
| - qwen3_asr | |
| pipeline_tag: automatic-speech-recognition | |
| # lilfugu | |
| A Japanese ASR model fine-tuned for software development. | |
| Based on [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B). Designed to produce clean, usable transcriptions for developers β not just programming term recognition, but also proper Arabic numerals (e.g. `3000`, not `δΈε`), consistent punctuation, and overall higher-quality Japanese output. | |
| ## What's improved over the base model | |
| - **Programming terms in English**: `useEffect`, `Docker`, `Vercel`, `Prisma`, `Tailwind CSS`, etc. β not katakana | |
| - **Arabic numerals**: `3000ηͺγγΌγ`, `200ms`, `8GB` β not kanji numerals | |
| - **Punctuation and formatting**: cleaner, more consistent output | |
| - **General Japanese quality**: improvements not fully captured by existing benchmarks (JSUT, etc.) due to their normalization | |
| ## Benchmarks | |
| ### [ADLIB](https://github.com/holotherapper/adlib) (DevTerm, 247 test cases) | |
| | Model | CER | Term Accuracy (Exact) | Composite | | |
| |---|---|---|---| | |
| | **lilfugu** | **26.3%** | **51.6%** | **0.6272** | | |
| | Qwen3-ASR-1.7B (base) | 41.1% | 24.6% | 0.4203 | | |
| | Whisper large-v3-turbo | 41.9% | 20.2% | 0.3935 | | |
| | kotoba-whisper-v2.0 | 61.1% | 7.0% | 0.2256 | | |
| | SenseVoice Small | 56.8% | 0.0% | 0.2090 | | |
| Composite = 0.4 Γ (1 - CER) + 0.6 Γ Term Accuracy (includes both exact and flexible matches) | |
| Benchmark: [ADLIB](https://github.com/holotherapper/adlib) β Language-aware ASR benchmark for Japanese | |
| ### [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) basic5000 (General Japanese, 300 samples) | |
| | Model | CER | | |
| |---|---| | |
| | Qwen3-ASR-1.7B (base) | 10.7% | | |
| | **lilfugu** | **10.8%** | | |
| | Whisper large-v3-turbo | 12.0% | | |
| | kotoba-whisper-v2.0 | 15.7% | | |
| | SenseVoice Small | 16.2% | | |
| Dataset: [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) | |
| Note: Existing Japanese ASR benchmarks are not designed to properly evaluate Japanese language quality β they normalize numbers, punctuation, and whitespace before scoring. These scores should be taken as a rough reference only. | |
| ## Variants | |
| | Repository | Size | Format | | |
| |---|---|---| | |
| | [lilfugu](https://huggingface.co/holotherapper/lilfugu) (this) | 4.1 GB | MLX bfloat16 | | |
| | [lilfugu-8bit](https://huggingface.co/holotherapper/lilfugu-8bit) | 2.8 GB | MLX 8bit quantized | | |
| | [lilfugu-transformers](https://huggingface.co/holotherapper/lilfugu-transformers) | 4.1 GB | safetensors fp16 (CUDA/Linux) | | |
| | [lilfugu-transformers-8bit](https://huggingface.co/holotherapper/lilfugu-transformers-8bit) | 2.2 GB | bitsandbytes int8 (CUDA/Linux) | | |
| | [lilfugu-lora](https://huggingface.co/holotherapper/lilfugu-lora) | ~49 MB | LoRA adapter | | |
| See also: [lilfugu-experimental](https://huggingface.co/holotherapper/lilfugu-experimental) β higher term accuracy, but may over-convert in some cases. | |
| ## Usage | |
| ### MLX (Apple Silicon) | |
| ```bash | |
| pip install -U mlx-audio | |
| ``` | |
| ```python | |
| from mlx_audio.stt import load | |
| model = load("holotherapper/lilfugu") | |
| result = model.generate("audio.wav", language="Japanese") | |
| print(result.text) | |
| ``` | |
| For the 8bit version: | |
| ```python | |
| model = load("holotherapper/lilfugu-8bit") | |
| ``` | |
| ### CUDA / Linux | |
| ```python | |
| from qwen_asr import Qwen3ASRModel | |
| model = Qwen3ASRModel.from_pretrained("holotherapper/lilfugu-transformers") | |
| result = model.transcribe("audio.wav") | |
| ``` | |
| ### LoRA adapter (custom scale tuning) | |
| ```python | |
| from mlx_tune.stt import FastSTTModel | |
| from mlx_lm.tuner.lora import LoRALinear | |
| model, _ = FastSTTModel.from_pretrained("mlx-community/Qwen3-ASR-1.7B-bf16") | |
| model.load_adapter("holotherapper/lilfugu-lora") | |
| # Adjust scale (0.0-1.0). Higher = stronger term conversion. | |
| for _, module in model.model.named_modules(): | |
| if isinstance(module, LoRALinear): | |
| module.scale = 1.0 | |
| text = model.transcribe("audio.wav", language="ja") | |
| ``` | |
| ## License | |
| Apache 2.0 (following [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)) | |