Milo-ASR: Dansk ASR Model

Milo-ASR er en dansk speech-to-text model baseret på Qwen3-ASR-1.7B, finetuned til at forstå dansk - både oplæst tale og samtaler/podcasts.

Modellen er trænet på CoRal v2 + danske podcast-data, så den klarer sig godt på tværs af domæner. De fleste andre modeller er kun gode til enten det ene eller det andet.

Resultater

CoRal v2 (oplæst tale, 10.370 samples)

Model WER CER
hviske-v2 (Whisper v2) 17.40% 7.96%
hviske-v3 (Whisper v3) 21.62% 9.22%
Milo-ASR 23.24% 11.17%
Whisper v3 Turbo 40.35% 15.51%
Qwen3-ASR base 46.28% 19.78%

Podcast (samtaler, 500 samples)

Model WER CER
Milo-ASR 21.82% 15.64%
hviske-v2 (Whisper v2) 50.67% 38.31%
Whisper v3 Turbo 67.03% 45.98%
Qwen3-ASR base 67.52% 47.71%
hviske-v3 (Whisper v3) 67.65% 50.12%

Milo-ASR er den eneste model der klarer begge domæner godt. På podcasts er den 2.3x bedre end næstbedste model (hviske-v2).

Plots

CoRal v2 WER Podcast WER CoRal vs Podcast Speed


Quick Start

pip install qwen-asr transformers torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "pluttodk/Milo-ASR",
    dtype="bfloat16",
    device_map="cuda:0",
)

results = model.transcribe(
    audio="path/to/danish_audio.wav",
    language="Danish",
)

print(results[0].text)

Batch Transcription

audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = model.transcribe(audio=audio_files, language="Danish")

for r in results:
    print(r.text)

Timestamps

model = Qwen3ASRModel.from_pretrained(
    "pluttodk/Milo-ASR",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    dtype="bfloat16",
    device_map="cuda:0",
)

results = model.transcribe(
    audio="path/to/audio.wav",
    language="Danish",
    return_time_stamps=True,
)

for item in results[0].time_stamps.items:
    print(f"{item.start_time:.2f}s - {item.end_time:.2f}s: {item.text}")

Streaming (vLLM)

model = Qwen3ASRModel.LLM(
    model="pluttodk/Milo-ASR",
    gpu_memory_utilization=0.8,
)

state = model.init_streaming_state(language="Danish", chunk_size_sec=2.0)

for audio_chunk in audio_stream():
    state = model.streaming_transcribe(audio_chunk, state)
    print(state.text)

state = model.finish_streaming_transcribe(state)

Træningsdetaljer

Modellen er finetuned i to stages:

  1. Stage 1: Qwen3-ASR-1.7B finetuned på CoRal v2 (~250K samples, 3 epochs, lr=2e-5)
  2. Stage 2: Fortsat fra stage 1 checkpoint på podcast + Azure podcast data (~141K samples, 8 epochs, lr=1e-5, cosine schedule)
Parameter Stage 2
Learning rate 1e-5
Batch size 8 (x4 grad acc = 32 effective)
Epochs 8
LR scheduler Cosine
Warmup ratio 0.1
Weight decay 0.01
Precision bfloat16
Training steps 35,560

Ting nedarvet fra Qwen3-ASR

  • Streaming/real-time via vLLM
  • Sang detection (baggrundsmusik)
  • Word-level timestamps
  • 30+ sprog (dansk optimeret)
  • Op til 20 min audio pr. request

Citation

@misc{Milo-ASR,
  author = {Rønnelund, Mathias Oliver Valdbjørn},
  title = {Milo-ASR: Danish ASR Model based on Qwen3-ASR},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/pluttodk/Milo-ASR}
}

Acknowledgements

Downloads last month
80
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pluttodk/milo-asr

Finetuned
(11)
this model

Dataset used to train pluttodk/milo-asr

Evaluation results