| --- |
| language: |
| - da |
| - en |
| license: apache-2.0 |
| base_model: ibm-granite/granite-speech-4.1-2b-plus |
| datasets: |
| - CoRal-project/coral-v3 |
| tags: |
| - automatic-speech-recognition |
| - speech-translation |
| - danish |
| - speech-llm |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # Milo-ASR-v2 |
|
|
| **A promptable Danish speech model.** Unlike a plain transcriber, you tell |
| Milo-ASR-v2 *what to do* with the audio — transcribe it, translate it to |
| English, or return structured JSON — all from a single model, chosen by the |
| instruction you give it. |
|
|
| It's a fine-tune of |
| [`ibm-granite/granite-speech-4.1-2b-plus`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b-plus) |
| on the Danish [CoRal-v3](https://huggingface.co/datasets/CoRal-project/coral-v3) |
| corpus. |
|
|
| > If you only need the lowest possible Danish transcription error rate, a |
| > dedicated Whisper-style Danish ASR model will be more accurate. Milo-ASR-v2 is |
| > for when you want **one model that follows instructions** about the audio — |
| > and an easy path to **speaker-attributed transcripts** (see below). |
|
|
| ## How it works |
|
|
| ```mermaid |
| flowchart LR |
| A([Danish audio]) --> M{{Milo-ASR-v2}} |
| P([Your instruction]) --> M |
| M --> T([Danish transcript]) |
| M --> E([English translation]) |
| M --> J([JSON output]) |
| ``` |
|
|
| The same audio yields a transcript, an English translation, or JSON — |
| **selected by the instruction**. |
|
|
| ## What it can do |
|
|
| All outputs below are **real model outputs**, not hand-written: |
|
|
| | Instruction | Output | |
| |---|---| |
| | `transskriber talen til dansk tekst.` | `Mest beværgelsesværdige er områderne med faciliteter for sportsgrene indenfor de alpine discipliner samt downhill mountainbike i år` | |
| | `Transcribe the speech and translate it into English.` | `Most popular is the areas with facilities for sports activities ... downhill mountain bike` | |
| | `Transcribe and return JSON like {"transcript": "..."}.` | `{"transcription": "Mest beværgelsesværdige er områderne med ..."}` | |
|
|
| ## Quickstart |
|
|
| ```python |
| import torch, soundfile as sf, librosa |
| from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
| |
| MODEL = "pluttodk/Milo-ASR-v2" |
| processor = AutoProcessor.from_pretrained(MODEL) |
| model = AutoModelForSpeechSeq2Seq.from_pretrained( |
| MODEL, dtype=torch.bfloat16, device_map="auto").eval() |
| |
| SYSTEM = ("Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\n" |
| "You are Granite, developed by IBM. You are a helpful AI assistant") |
| |
| def run(wav_path, instruction, max_new_tokens=200): |
| audio, sr = sf.read(wav_path, dtype="float32", always_2d=False) |
| if audio.ndim > 1: audio = audio.mean(axis=1) |
| if sr != 16000: audio = librosa.resample(audio, orig_sr=sr, target_sr=16000) |
| audio = torch.from_numpy(audio).unsqueeze(0) |
| chat = [{"role": "system", "content": SYSTEM}, |
| {"role": "user", "content": f"<|audio|> {instruction}"}] |
| text = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
| inputs = processor(text, audio, return_tensors="pt").to(model.device) |
| out = model.generate(**inputs, max_new_tokens=max_new_tokens, num_beams=5, |
| repetition_penalty=1.1, no_repeat_ngram_size=4, early_stopping=True) |
| return processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True) |
| |
| print(run("clip.wav", "transskriber talen til dansk tekst.")) # Danish transcript |
| print(run("clip.wav", "Transcribe the speech and translate it into English.")) # English translation |
| print(run("clip.wav", 'Transcribe and return JSON like {"transcript": "..."}.'))# JSON |
| ``` |
|
|
| ## Speaker diarization — "who said what" |
|
|
| Milo-ASR-v2's own inline speaker tags are not reliable. For dependable |
| **speaker-attributed transcripts**, run a small diarization front-end (voice |
| activity detection → speaker embeddings → clustering) and let Milo-ASR-v2 |
| transcribe each speaker turn: |
|
|
| ```mermaid |
| flowchart LR |
| A([Audio]) --> V[Voice activity detection] |
| V --> S[Speaker embeddings + clustering] |
| S --> G[Speaker turns] |
| G --> R{{Milo-ASR-v2 per turn}} |
| R --> O(["Speaker 1: ... Speaker 2: ..."]) |
| ``` |
|
|
| A complete, self-contained example using only open tools |
| (`pip install silero-vad scikit-learn librosa soundfile`): |
|
|
| ```python |
| import torch, numpy as np, librosa, soundfile as sf |
| from sklearn.cluster import AgglomerativeClustering |
| from silero_vad import load_silero_vad, get_speech_timestamps |
| from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
| |
| SR = 16000 |
| SYSTEM = ("Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\n" |
| "You are Granite, developed by IBM. You are a helpful AI assistant") |
| NUM_SPEAKERS = 2 # set the known count, or None to auto-estimate |
| |
| # load mono 16 kHz audio |
| wav, sr = sf.read("meeting.wav", dtype="float32", always_2d=False) |
| if wav.ndim > 1: wav = wav.mean(axis=1) |
| if sr != SR: wav = librosa.resample(wav, orig_sr=sr, target_sr=SR) |
| |
| # 1) voice activity detection -> speech segments |
| vad = load_silero_vad() |
| segments = get_speech_timestamps(torch.from_numpy(wav), vad, sampling_rate=SR, return_seconds=True) |
| |
| # 2) speaker embedding per segment (ReDimNet) -> clustering |
| embedder = torch.hub.load("IDRnD/ReDimNet", "ReDimNet", |
| model_name="b2", train_type="ptn", dataset="vox2", trust_repo=True).eval() |
| def embed(a): |
| with torch.no_grad(): |
| e = embedder(torch.from_numpy(a).float().unsqueeze(0))[0].cpu().numpy() |
| return e / (np.linalg.norm(e) + 1e-9) |
| emb = np.stack([embed(wav[int(s["start"]*SR):int(s["end"]*SR)]) for s in segments]) |
| labels = AgglomerativeClustering(n_clusters=NUM_SPEAKERS, metric="cosine", |
| linkage="average").fit_predict(emb) |
| |
| # 3) transcribe each turn with Milo-ASR-v2 |
| processor = AutoProcessor.from_pretrained("pluttodk/Milo-ASR-v2") |
| model = AutoModelForSpeechSeq2Seq.from_pretrained( |
| "pluttodk/Milo-ASR-v2", dtype=torch.bfloat16, device_map="auto").eval() |
| def transcribe(a): |
| chat = [{"role": "system", "content": SYSTEM}, |
| {"role": "user", "content": "<|audio|> transskriber talen til dansk tekst."}] |
| text = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
| inp = processor(text, torch.from_numpy(a).float().unsqueeze(0), return_tensors="pt").to(model.device) |
| out = model.generate(**inp, max_new_tokens=200, num_beams=5, early_stopping=True) |
| return processor.tokenizer.decode(out[0, inp["input_ids"].shape[-1]:], skip_special_tokens=True) |
| |
| # 4) print speaker-attributed transcript in time order |
| speaker_no = {} |
| for seg, lab in sorted(zip(segments, labels), key=lambda x: x[0]["start"]): |
| n = speaker_no.setdefault(int(lab), len(speaker_no) + 1) |
| turn = wav[int(seg["start"]*SR):int(seg["end"]*SR)] |
| print(f"[Speaker {n}]: {transcribe(turn)}") |
| ``` |
|
|
| ## Performance |
|
|
| Danish ASR on the full CoRal-v3 **test** set (strict normalisation): |
|
|
| | Split | WER | CER | |
| |---|---:|---:| |
| | read-aloud | 18.1% | 8.5% | |
| | conversation | 33.2% | 19.9% | |
| | weighted | 25.3% | 14.0% | |
|
|
| On clean read-aloud validation it reaches ~16.8% WER / 6.9% CER. A dedicated |
| Whisper-style Danish ASR model is more accurate on raw transcription; pick |
| Milo-ASR-v2 for its promptable, multi-task behaviour. |
|
|
| ## Limitations |
|
|
| - **Higher word error rate than a dedicated Whisper Danish ASR model** — see above. |
| - **English translation is serviceable, not publication-grade.** |
| - **Inline speaker tags are unreliable** — use the diarization recipe above for |
| speaker-attributed output. |
| - Trained on read-aloud and conversational Danish; very noisy or far-field audio |
| is out of distribution. |
|
|
| ## How it was trained |
|
|
| Milo-ASR-v2 was adapted to Danish from IBM Granite-Speech in two stages: first a |
| Danish ASR fine-tune on CoRal-v3, then a mixed-task instruction fine-tune |
| (transcribe / translate / structured output) so it follows instructions while |
| keeping its transcription quality. |
|
|
| ## License |
|
|
| Apache-2.0. Base model: IBM Granite-Speech-4.1-2B-Plus (Apache-2.0). Training |
| data: CoRal-v3. |
|
|