Upload folder using huggingface_hub

1d08606 verified 15 days ago

8.05 kB

	---
	language:
	- da
	- en
	license: apache-2.0
	base_model: ibm-granite/granite-speech-4.1-2b-plus
	datasets:
	- CoRal-project/coral-v3
	tags:
	- automatic-speech-recognition
	- speech-translation
	- danish
	- speech-llm
	pipeline_tag: automatic-speech-recognition
	---

	# Milo-ASR-v2

	A promptable Danish speech model. Unlike a plain transcriber, you tell
	Milo-ASR-v2 what to do with the audio — transcribe it, translate it to
	English, or return structured JSON — all from a single model, chosen by the
	instruction you give it.

	It's a fine-tune of
	[`ibm-granite/granite-speech-4.1-2b-plus`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b-plus)
	on the Danish [CoRal-v3](https://huggingface.co/datasets/CoRal-project/coral-v3)
	corpus.

	> If you only need the lowest possible Danish transcription error rate, a
	> dedicated Whisper-style Danish ASR model will be more accurate. Milo-ASR-v2 is
	> for when you want one model that follows instructions about the audio —
	> and an easy path to speaker-attributed transcripts (see below).

	## How it works

	```mermaid
	flowchart LR
	A([Danish audio]) --> M{{Milo-ASR-v2}}
	P([Your instruction]) --> M
	M --> T([Danish transcript])
	M --> E([English translation])
	M --> J([JSON output])
	```

	The same audio yields a transcript, an English translation, or JSON —
	selected by the instruction.

	## What it can do

	All outputs below are real model outputs, not hand-written:

	\| Instruction \| Output \|
	\|---\|---\|
	\| `transskriber talen til dansk tekst.` \| `Mest beværgelsesværdige er områderne med faciliteter for sportsgrene indenfor de alpine discipliner samt downhill mountainbike i år` \|
	\| `Transcribe the speech and translate it into English.` \| `Most popular is the areas with facilities for sports activities ... downhill mountain bike` \|
	\| `Transcribe and return JSON like {"transcript": "..."}.` \| `{"transcription": "Mest beværgelsesværdige er områderne med ..."}` \|

	## Quickstart

	```python
	import torch, soundfile as sf, librosa
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

	MODEL = "pluttodk/Milo-ASR-v2"
	processor = AutoProcessor.from_pretrained(MODEL)
	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	MODEL, dtype=torch.bfloat16, device_map="auto").eval()

	SYSTEM = ("Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\n"
	"You are Granite, developed by IBM. You are a helpful AI assistant")

	def run(wav_path, instruction, max_new_tokens=200):
	audio, sr = sf.read(wav_path, dtype="float32", always_2d=False)
	if audio.ndim > 1: audio = audio.mean(axis=1)
	if sr != 16000: audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
	audio = torch.from_numpy(audio).unsqueeze(0)
	chat = [{"role": "system", "content": SYSTEM},
	{"role": "user", "content": f"<\|audio\|> {instruction}"}]
	text = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
	inputs = processor(text, audio, return_tensors="pt").to(model.device)
	out = model.generate(**inputs, max_new_tokens=max_new_tokens, num_beams=5,
	repetition_penalty=1.1, no_repeat_ngram_size=4, early_stopping=True)
	return processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

	print(run("clip.wav", "transskriber talen til dansk tekst.")) # Danish transcript
	print(run("clip.wav", "Transcribe the speech and translate it into English.")) # English translation
	print(run("clip.wav", 'Transcribe and return JSON like {"transcript": "..."}.'))# JSON
	```

	## Speaker diarization — "who said what"

	Milo-ASR-v2's own inline speaker tags are not reliable. For dependable
	speaker-attributed transcripts, run a small diarization front-end (voice
	activity detection → speaker embeddings → clustering) and let Milo-ASR-v2
	transcribe each speaker turn:

	```mermaid
	flowchart LR
	A([Audio]) --> V[Voice activity detection]
	V --> S[Speaker embeddings + clustering]
	S --> G[Speaker turns]
	G --> R{{Milo-ASR-v2 per turn}}
	R --> O(["Speaker 1: ... Speaker 2: ..."])
	```

	A complete, self-contained example using only open tools
	(`pip install silero-vad scikit-learn librosa soundfile`):

	```python
	import torch, numpy as np, librosa, soundfile as sf
	from sklearn.cluster import AgglomerativeClustering
	from silero_vad import load_silero_vad, get_speech_timestamps
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

	SR = 16000
	SYSTEM = ("Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\n"
	"You are Granite, developed by IBM. You are a helpful AI assistant")
	NUM_SPEAKERS = 2 # set the known count, or None to auto-estimate

	# load mono 16 kHz audio
	wav, sr = sf.read("meeting.wav", dtype="float32", always_2d=False)
	if wav.ndim > 1: wav = wav.mean(axis=1)
	if sr != SR: wav = librosa.resample(wav, orig_sr=sr, target_sr=SR)

	# 1) voice activity detection -> speech segments
	vad = load_silero_vad()
	segments = get_speech_timestamps(torch.from_numpy(wav), vad, sampling_rate=SR, return_seconds=True)

	# 2) speaker embedding per segment (ReDimNet) -> clustering
	embedder = torch.hub.load("IDRnD/ReDimNet", "ReDimNet",
	model_name="b2", train_type="ptn", dataset="vox2", trust_repo=True).eval()
	def embed(a):
	with torch.no_grad():
	e = embedder(torch.from_numpy(a).float().unsqueeze(0))[0].cpu().numpy()
	return e / (np.linalg.norm(e) + 1e-9)
	emb = np.stack([embed(wav[int(s["start"]SR):int(s["end"]SR)]) for s in segments])
	labels = AgglomerativeClustering(n_clusters=NUM_SPEAKERS, metric="cosine",
	linkage="average").fit_predict(emb)

	# 3) transcribe each turn with Milo-ASR-v2
	processor = AutoProcessor.from_pretrained("pluttodk/Milo-ASR-v2")
	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	"pluttodk/Milo-ASR-v2", dtype=torch.bfloat16, device_map="auto").eval()
	def transcribe(a):
	chat = [{"role": "system", "content": SYSTEM},
	{"role": "user", "content": "<\|audio\|> transskriber talen til dansk tekst."}]
	text = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
	inp = processor(text, torch.from_numpy(a).float().unsqueeze(0), return_tensors="pt").to(model.device)
	out = model.generate(**inp, max_new_tokens=200, num_beams=5, early_stopping=True)
	return processor.tokenizer.decode(out[0, inp["input_ids"].shape[-1]:], skip_special_tokens=True)

	# 4) print speaker-attributed transcript in time order
	speaker_no = {}
	for seg, lab in sorted(zip(segments, labels), key=lambda x: x[0]["start"]):
	n = speaker_no.setdefault(int(lab), len(speaker_no) + 1)
	turn = wav[int(seg["start"]SR):int(seg["end"]SR)]
	print(f"[Speaker {n}]: {transcribe(turn)}")
	```

	## Performance

	Danish ASR on the full CoRal-v3 test set (strict normalisation):

	\| Split \| WER \| CER \|
	\|---\|---:\|---:\|
	\| read-aloud \| 18.1% \| 8.5% \|
	\| conversation \| 33.2% \| 19.9% \|
	\| weighted \| 25.3% \| 14.0% \|

	On clean read-aloud validation it reaches ~16.8% WER / 6.9% CER. A dedicated
	Whisper-style Danish ASR model is more accurate on raw transcription; pick
	Milo-ASR-v2 for its promptable, multi-task behaviour.

	## Limitations

	- Higher word error rate than a dedicated Whisper Danish ASR model — see above.
	- English translation is serviceable, not publication-grade.
	- Inline speaker tags are unreliable — use the diarization recipe above for
	speaker-attributed output.
	- Trained on read-aloud and conversational Danish; very noisy or far-field audio
	is out of distribution.

	## How it was trained

	Milo-ASR-v2 was adapted to Danish from IBM Granite-Speech in two stages: first a
	Danish ASR fine-tune on CoRal-v3, then a mixed-task instruction fine-tune
	(transcribe / translate / structured output) so it follows instructions while
	keeping its transcription quality.

	## License

	Apache-2.0. Base model: IBM Granite-Speech-4.1-2B-Plus (Apache-2.0). Training
	data: CoRal-v3.