lilfugu / README.md

Upload README.md with huggingface_hub

4bdec9f verified about 2 months ago

4.16 kB

language:
  - ja
license: apache-2.0
base_model: Qwen/Qwen3-ASR-1.7B
library_name: mlx
tags:
  - automatic-speech-recognition
  - speech-to-text
  - japanese
  - programming
  - mlx
  - asr
  - stt
  - qwen3_asr
pipeline_tag: automatic-speech-recognition

lilfugu

A Japanese ASR model fine-tuned for software development.

Based on Qwen3-ASR-1.7B. Designed to produce clean, usable transcriptions for developers — not just programming term recognition, but also proper Arabic numerals (e.g. 3000, not 三千), consistent punctuation, and overall higher-quality Japanese output.

What's improved over the base model

Programming terms in English: useEffect, Docker, Vercel, Prisma, Tailwind CSS, etc. — not katakana
Arabic numerals: 3000番ポート, 200ms, 8GB — not kanji numerals
Punctuation and formatting: cleaner, more consistent output
General Japanese quality: improvements not fully captured by existing benchmarks (JSUT, etc.) due to their normalization

Benchmarks

ADLIB (DevTerm, 247 test cases)

Model	CER	Term Accuracy (Exact)	Composite
lilfugu	26.3%	51.6%	0.6272
Qwen3-ASR-1.7B (base)	41.1%	24.6%	0.4203
Whisper large-v3-turbo	41.9%	20.2%	0.3935
kotoba-whisper-v2.0	61.1%	7.0%	0.2256
SenseVoice Small	56.8%	0.0%	0.2090

Composite = 0.4 × (1 - CER) + 0.6 × Term Accuracy (includes both exact and flexible matches)

Benchmark: ADLIB — Language-aware ASR benchmark for Japanese

JSUT basic5000 (General Japanese, 300 samples)

Model	CER
Qwen3-ASR-1.7B (base)	10.7%
lilfugu	10.8%
Whisper large-v3-turbo	12.0%
kotoba-whisper-v2.0	15.7%
SenseVoice Small	16.2%

Dataset: JSUT

Note: Existing Japanese ASR benchmarks are not designed to properly evaluate Japanese language quality — they normalize numbers, punctuation, and whitespace before scoring. These scores should be taken as a rough reference only.

Variants

Repository	Size	Format
lilfugu (this)	4.1 GB	MLX bfloat16
lilfugu-8bit	2.8 GB	MLX 8bit quantized
lilfugu-transformers	4.1 GB	safetensors fp16 (CUDA/Linux)
lilfugu-transformers-8bit	2.2 GB	bitsandbytes int8 (CUDA/Linux)
lilfugu-lora	~49 MB	LoRA adapter

See also: lilfugu-experimental — higher term accuracy, but may over-convert in some cases.

Usage

MLX (Apple Silicon)

pip install -U mlx-audio

from mlx_audio.stt import load

model = load("holotherapper/lilfugu")
result = model.generate("audio.wav", language="Japanese")
print(result.text)

For the 8bit version:

model = load("holotherapper/lilfugu-8bit")

CUDA / Linux

from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained("holotherapper/lilfugu-transformers")
result = model.transcribe("audio.wav")

LoRA adapter (custom scale tuning)

from mlx_tune.stt import FastSTTModel
from mlx_lm.tuner.lora import LoRALinear

model, _ = FastSTTModel.from_pretrained("mlx-community/Qwen3-ASR-1.7B-bf16")
model.load_adapter("holotherapper/lilfugu-lora")

# Adjust scale (0.0-1.0). Higher = stronger term conversion.
for _, module in model.model.named_modules():
    if isinstance(module, LoRALinear):
        module.scale = 1.0

text = model.transcribe("audio.wav", language="ja")

License

Apache 2.0 (following Qwen3-ASR-1.7B)