lilfugu / README.md

Upload README.md with huggingface_hub

4bdec9f verified about 2 months ago

4.16 kB

	---
	language:
	- ja
	license: apache-2.0
	base_model: Qwen/Qwen3-ASR-1.7B
	library_name: mlx
	tags:
	- automatic-speech-recognition
	- speech-to-text
	- japanese
	- programming
	- mlx
	- asr
	- stt
	- qwen3_asr
	pipeline_tag: automatic-speech-recognition
	---

	# lilfugu

	A Japanese ASR model fine-tuned for software development.

	Based on [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B). Designed to produce clean, usable transcriptions for developers — not just programming term recognition, but also proper Arabic numerals (e.g. `3000`, not `三千`), consistent punctuation, and overall higher-quality Japanese output.

	## What's improved over the base model

	- Programming terms in English: `useEffect`, `Docker`, `Vercel`, `Prisma`, `Tailwind CSS`, etc. — not katakana
	- Arabic numerals: `3000番ポート`, `200ms`, `8GB` — not kanji numerals
	- Punctuation and formatting: cleaner, more consistent output
	- General Japanese quality: improvements not fully captured by existing benchmarks (JSUT, etc.) due to their normalization

	## Benchmarks

	### [ADLIB](https://github.com/holotherapper/adlib) (DevTerm, 247 test cases)

	\| Model \| CER \| Term Accuracy (Exact) \| Composite \|
	\|---\|---\|---\|---\|
	\| lilfugu \| 26.3% \| 51.6% \| 0.6272 \|
	\| Qwen3-ASR-1.7B (base) \| 41.1% \| 24.6% \| 0.4203 \|
	\| Whisper large-v3-turbo \| 41.9% \| 20.2% \| 0.3935 \|
	\| kotoba-whisper-v2.0 \| 61.1% \| 7.0% \| 0.2256 \|
	\| SenseVoice Small \| 56.8% \| 0.0% \| 0.2090 \|

	Composite = 0.4 × (1 - CER) + 0.6 × Term Accuracy (includes both exact and flexible matches)

	Benchmark: [ADLIB](https://github.com/holotherapper/adlib) — Language-aware ASR benchmark for Japanese

	### [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) basic5000 (General Japanese, 300 samples)

	\| Model \| CER \|
	\|---\|---\|
	\| Qwen3-ASR-1.7B (base) \| 10.7% \|
	\| lilfugu \| 10.8% \|
	\| Whisper large-v3-turbo \| 12.0% \|
	\| kotoba-whisper-v2.0 \| 15.7% \|
	\| SenseVoice Small \| 16.2% \|

	Dataset: [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut)

	Note: Existing Japanese ASR benchmarks are not designed to properly evaluate Japanese language quality — they normalize numbers, punctuation, and whitespace before scoring. These scores should be taken as a rough reference only.

	## Variants

	\| Repository \| Size \| Format \|
	\|---\|---\|---\|
	\| [lilfugu](https://huggingface.co/holotherapper/lilfugu) (this) \| 4.1 GB \| MLX bfloat16 \|
	\| [lilfugu-8bit](https://huggingface.co/holotherapper/lilfugu-8bit) \| 2.8 GB \| MLX 8bit quantized \|
	\| [lilfugu-transformers](https://huggingface.co/holotherapper/lilfugu-transformers) \| 4.1 GB \| safetensors fp16 (CUDA/Linux) \|
	\| [lilfugu-transformers-8bit](https://huggingface.co/holotherapper/lilfugu-transformers-8bit) \| 2.2 GB \| bitsandbytes int8 (CUDA/Linux) \|
	\| [lilfugu-lora](https://huggingface.co/holotherapper/lilfugu-lora) \| ~49 MB \| LoRA adapter \|

	See also: [lilfugu-experimental](https://huggingface.co/holotherapper/lilfugu-experimental) — higher term accuracy, but may over-convert in some cases.

	## Usage

	### MLX (Apple Silicon)

	```bash
	pip install -U mlx-audio
	```

	```python
	from mlx_audio.stt import load

	model = load("holotherapper/lilfugu")
	result = model.generate("audio.wav", language="Japanese")
	print(result.text)
	```

	For the 8bit version:
	```python
	model = load("holotherapper/lilfugu-8bit")
	```

	### CUDA / Linux

	```python
	from qwen_asr import Qwen3ASRModel

	model = Qwen3ASRModel.from_pretrained("holotherapper/lilfugu-transformers")
	result = model.transcribe("audio.wav")
	```

	### LoRA adapter (custom scale tuning)

	```python
	from mlx_tune.stt import FastSTTModel
	from mlx_lm.tuner.lora import LoRALinear

	model, _ = FastSTTModel.from_pretrained("mlx-community/Qwen3-ASR-1.7B-bf16")
	model.load_adapter("holotherapper/lilfugu-lora")

	# Adjust scale (0.0-1.0). Higher = stronger term conversion.
	for _, module in model.model.named_modules():
	if isinstance(module, LoRALinear):
	module.scale = 1.0

	text = model.transcribe("audio.wav", language="ja")
	```

	## License

	Apache 2.0 (following [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B))