lilfugu / README.md
holotherapper's picture
Upload README.md with huggingface_hub
4bdec9f verified
metadata
language:
  - ja
license: apache-2.0
base_model: Qwen/Qwen3-ASR-1.7B
library_name: mlx
tags:
  - automatic-speech-recognition
  - speech-to-text
  - japanese
  - programming
  - mlx
  - asr
  - stt
  - qwen3_asr
pipeline_tag: automatic-speech-recognition

lilfugu

A Japanese ASR model fine-tuned for software development.

Based on Qwen3-ASR-1.7B. Designed to produce clean, usable transcriptions for developers — not just programming term recognition, but also proper Arabic numerals (e.g. 3000, not 三千), consistent punctuation, and overall higher-quality Japanese output.

What's improved over the base model

  • Programming terms in English: useEffect, Docker, Vercel, Prisma, Tailwind CSS, etc. — not katakana
  • Arabic numerals: 3000番ポート, 200ms, 8GB — not kanji numerals
  • Punctuation and formatting: cleaner, more consistent output
  • General Japanese quality: improvements not fully captured by existing benchmarks (JSUT, etc.) due to their normalization

Benchmarks

ADLIB (DevTerm, 247 test cases)

Model CER Term Accuracy (Exact) Composite
lilfugu 26.3% 51.6% 0.6272
Qwen3-ASR-1.7B (base) 41.1% 24.6% 0.4203
Whisper large-v3-turbo 41.9% 20.2% 0.3935
kotoba-whisper-v2.0 61.1% 7.0% 0.2256
SenseVoice Small 56.8% 0.0% 0.2090

Composite = 0.4 × (1 - CER) + 0.6 × Term Accuracy (includes both exact and flexible matches)

Benchmark: ADLIB — Language-aware ASR benchmark for Japanese

JSUT basic5000 (General Japanese, 300 samples)

Model CER
Qwen3-ASR-1.7B (base) 10.7%
lilfugu 10.8%
Whisper large-v3-turbo 12.0%
kotoba-whisper-v2.0 15.7%
SenseVoice Small 16.2%

Dataset: JSUT

Note: Existing Japanese ASR benchmarks are not designed to properly evaluate Japanese language quality — they normalize numbers, punctuation, and whitespace before scoring. These scores should be taken as a rough reference only.

Variants

Repository Size Format
lilfugu (this) 4.1 GB MLX bfloat16
lilfugu-8bit 2.8 GB MLX 8bit quantized
lilfugu-transformers 4.1 GB safetensors fp16 (CUDA/Linux)
lilfugu-transformers-8bit 2.2 GB bitsandbytes int8 (CUDA/Linux)
lilfugu-lora ~49 MB LoRA adapter

See also: lilfugu-experimental — higher term accuracy, but may over-convert in some cases.

Usage

MLX (Apple Silicon)

pip install -U mlx-audio
from mlx_audio.stt import load

model = load("holotherapper/lilfugu")
result = model.generate("audio.wav", language="Japanese")
print(result.text)

For the 8bit version:

model = load("holotherapper/lilfugu-8bit")

CUDA / Linux

from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained("holotherapper/lilfugu-transformers")
result = model.transcribe("audio.wav")

LoRA adapter (custom scale tuning)

from mlx_tune.stt import FastSTTModel
from mlx_lm.tuner.lora import LoRALinear

model, _ = FastSTTModel.from_pretrained("mlx-community/Qwen3-ASR-1.7B-bf16")
model.load_adapter("holotherapper/lilfugu-lora")

# Adjust scale (0.0-1.0). Higher = stronger term conversion.
for _, module in model.model.named_modules():
    if isinstance(module, LoRALinear):
        module.scale = 1.0

text = model.transcribe("audio.wav", language="ja")

License

Apache 2.0 (following Qwen3-ASR-1.7B)