--- language: - ja license: apache-2.0 base_model: Qwen/Qwen3-ASR-1.7B library_name: mlx tags: - automatic-speech-recognition - speech-to-text - japanese - programming - mlx - asr - stt - qwen3_asr pipeline_tag: automatic-speech-recognition --- # lilfugu A Japanese ASR model fine-tuned for software development. Based on [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B). Designed to produce clean, usable transcriptions for developers — not just programming term recognition, but also proper Arabic numerals (e.g. `3000`, not `三千`), consistent punctuation, and overall higher-quality Japanese output. ## What's improved over the base model - **Programming terms in English**: `useEffect`, `Docker`, `Vercel`, `Prisma`, `Tailwind CSS`, etc. — not katakana - **Arabic numerals**: `3000番ポート`, `200ms`, `8GB` — not kanji numerals - **Punctuation and formatting**: cleaner, more consistent output - **General Japanese quality**: improvements not fully captured by existing benchmarks (JSUT, etc.) due to their normalization ## Benchmarks ### [ADLIB](https://github.com/holotherapper/adlib) (DevTerm, 247 test cases) | Model | CER | Term Accuracy (Exact) | Composite | |---|---|---|---| | **lilfugu** | **26.3%** | **51.6%** | **0.6272** | | Qwen3-ASR-1.7B (base) | 41.1% | 24.6% | 0.4203 | | Whisper large-v3-turbo | 41.9% | 20.2% | 0.3935 | | kotoba-whisper-v2.0 | 61.1% | 7.0% | 0.2256 | | SenseVoice Small | 56.8% | 0.0% | 0.2090 | Composite = 0.4 × (1 - CER) + 0.6 × Term Accuracy (includes both exact and flexible matches) Benchmark: [ADLIB](https://github.com/holotherapper/adlib) — Language-aware ASR benchmark for Japanese ### [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) basic5000 (General Japanese, 300 samples) | Model | CER | |---|---| | Qwen3-ASR-1.7B (base) | 10.7% | | **lilfugu** | **10.8%** | | Whisper large-v3-turbo | 12.0% | | kotoba-whisper-v2.0 | 15.7% | | SenseVoice Small | 16.2% | Dataset: [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) Note: Existing Japanese ASR benchmarks are not designed to properly evaluate Japanese language quality — they normalize numbers, punctuation, and whitespace before scoring. These scores should be taken as a rough reference only. ## Variants | Repository | Size | Format | |---|---|---| | [lilfugu](https://huggingface.co/holotherapper/lilfugu) (this) | 4.1 GB | MLX bfloat16 | | [lilfugu-8bit](https://huggingface.co/holotherapper/lilfugu-8bit) | 2.8 GB | MLX 8bit quantized | | [lilfugu-transformers](https://huggingface.co/holotherapper/lilfugu-transformers) | 4.1 GB | safetensors fp16 (CUDA/Linux) | | [lilfugu-transformers-8bit](https://huggingface.co/holotherapper/lilfugu-transformers-8bit) | 2.2 GB | bitsandbytes int8 (CUDA/Linux) | | [lilfugu-lora](https://huggingface.co/holotherapper/lilfugu-lora) | ~49 MB | LoRA adapter | See also: [lilfugu-experimental](https://huggingface.co/holotherapper/lilfugu-experimental) — higher term accuracy, but may over-convert in some cases. ## Usage ### MLX (Apple Silicon) ```bash pip install -U mlx-audio ``` ```python from mlx_audio.stt import load model = load("holotherapper/lilfugu") result = model.generate("audio.wav", language="Japanese") print(result.text) ``` For the 8bit version: ```python model = load("holotherapper/lilfugu-8bit") ``` ### CUDA / Linux ```python from qwen_asr import Qwen3ASRModel model = Qwen3ASRModel.from_pretrained("holotherapper/lilfugu-transformers") result = model.transcribe("audio.wav") ``` ### LoRA adapter (custom scale tuning) ```python from mlx_tune.stt import FastSTTModel from mlx_lm.tuner.lora import LoRALinear model, _ = FastSTTModel.from_pretrained("mlx-community/Qwen3-ASR-1.7B-bf16") model.load_adapter("holotherapper/lilfugu-lora") # Adjust scale (0.0-1.0). Higher = stronger term conversion. for _, module in model.model.named_modules(): if isinstance(module, LoRALinear): module.scale = 1.0 text = model.transcribe("audio.wav", language="ja") ``` ## License Apache 2.0 (following [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B))