Milo-ASR: Dansk ASR Model
Milo-ASR er en dansk speech-to-text model baseret på Qwen3-ASR-1.7B, finetuned til at forstå dansk - både oplæst tale og samtaler/podcasts.
Modellen er trænet på CoRal v2 + danske podcast-data, så den klarer sig godt på tværs af domæner. De fleste andre modeller er kun gode til enten det ene eller det andet.
Resultater
CoRal v2 (oplæst tale, 10.370 samples)
| Model | WER | CER |
|---|---|---|
| hviske-v2 (Whisper v2) | 17.40% | 7.96% |
| hviske-v3 (Whisper v3) | 21.62% | 9.22% |
| Milo-ASR | 23.24% | 11.17% |
| Whisper v3 Turbo | 40.35% | 15.51% |
| Qwen3-ASR base | 46.28% | 19.78% |
Podcast (samtaler, 500 samples)
| Model | WER | CER |
|---|---|---|
| Milo-ASR | 21.82% | 15.64% |
| hviske-v2 (Whisper v2) | 50.67% | 38.31% |
| Whisper v3 Turbo | 67.03% | 45.98% |
| Qwen3-ASR base | 67.52% | 47.71% |
| hviske-v3 (Whisper v3) | 67.65% | 50.12% |
Milo-ASR er den eneste model der klarer begge domæner godt. På podcasts er den 2.3x bedre end næstbedste model (hviske-v2).
Plots
Quick Start
pip install qwen-asr transformers torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"pluttodk/Milo-ASR",
dtype="bfloat16",
device_map="cuda:0",
)
results = model.transcribe(
audio="path/to/danish_audio.wav",
language="Danish",
)
print(results[0].text)
Batch Transcription
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = model.transcribe(audio=audio_files, language="Danish")
for r in results:
print(r.text)
Timestamps
model = Qwen3ASRModel.from_pretrained(
"pluttodk/Milo-ASR",
forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
dtype="bfloat16",
device_map="cuda:0",
)
results = model.transcribe(
audio="path/to/audio.wav",
language="Danish",
return_time_stamps=True,
)
for item in results[0].time_stamps.items:
print(f"{item.start_time:.2f}s - {item.end_time:.2f}s: {item.text}")
Streaming (vLLM)
model = Qwen3ASRModel.LLM(
model="pluttodk/Milo-ASR",
gpu_memory_utilization=0.8,
)
state = model.init_streaming_state(language="Danish", chunk_size_sec=2.0)
for audio_chunk in audio_stream():
state = model.streaming_transcribe(audio_chunk, state)
print(state.text)
state = model.finish_streaming_transcribe(state)
Træningsdetaljer
Modellen er finetuned i to stages:
- Stage 1: Qwen3-ASR-1.7B finetuned på CoRal v2 (~250K samples, 3 epochs, lr=2e-5)
- Stage 2: Fortsat fra stage 1 checkpoint på podcast + Azure podcast data (~141K samples, 8 epochs, lr=1e-5, cosine schedule)
| Parameter | Stage 2 |
|---|---|
| Learning rate | 1e-5 |
| Batch size | 8 (x4 grad acc = 32 effective) |
| Epochs | 8 |
| LR scheduler | Cosine |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Precision | bfloat16 |
| Training steps | 35,560 |
Ting nedarvet fra Qwen3-ASR
- Streaming/real-time via vLLM
- Sang detection (baggrundsmusik)
- Word-level timestamps
- 30+ sprog (dansk optimeret)
- Op til 20 min audio pr. request
Citation
@misc{Milo-ASR,
author = {Rønnelund, Mathias Oliver Valdbjørn},
title = {Milo-ASR: Danish ASR Model based on Qwen3-ASR},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/pluttodk/Milo-ASR}
}
Acknowledgements
- Qwen Team for Qwen3-ASR
- Alexandra Institute for CoRal v2
- Downloads last month
- 80
Model tree for pluttodk/milo-asr
Base model
Qwen/Qwen3-ASR-1.7BDataset used to train pluttodk/milo-asr
Evaluation results
- WER on CoRal v2 Testtest set self-reported23.240
- CER on CoRal v2 Testtest set self-reported11.170



