VibeVoice-ASR — Darija (Moroccan Arabic) LoRA

Fine-tuned PEFT LoRA adapter for microsoft/VibeVoice-ASR, targeting Darija (Moroccan Arabic / ISO 639-3: ary) — a low-resource dialect with rich code-switching between Arabic, French, Amazigh, and Spanish that is largely absent from the base model's training data.

This adapter was developed as a community contribution to the VibeVoice project to demonstrate fine-tuning on a new low-resource language.

Training details

Setting	Value
Base model	microsoft/VibeVoice-ASR (Qwen2.5-7B backbone)
Method	QLoRA 4-bit (nf4), rank 16, alpha 32
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs	5 (staged curriculum)
Effective batch size	8 (per_device=1, grad_accum=8)
Learning rate	1e-4 with cosine warmup (10%)
Train samples	17,984
Val samples	997
GPU	NVIDIA Quadro RTX 5000 (16 GB VRAM)

Staged training curriculum

Stage	Arabic weight	Darija weight	Epochs
1 — Arabic transfer	70%	30%	2
2 — Darija focus	30%	70%	2
3 — Darija refinement	0%	100%	1

Qualitative observations

The base microsoft/VibeVoice-ASR model struggles with Darija in several ways:

Dialect correction: it "fixes" Darija words toward MSA — e.g. labas, bezzaf, kifash, mezyan get substituted or distorted
Code-switching: frequent French insertions in Darija speech are mishandled
Phonetics: Moroccan emphatic consonants and vowel patterns differ from Gulf/Levantine Arabic in the base training data

After fine-tuning the model handles dialectal vocabulary, code-switching with French, and Moroccan phonetic patterns significantly better.

Formal WER/CER evaluation on the 997-sample validation set is in progress — numbers will be added here soon.

Usage

import torch
from peft import PeftModel
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor

BASE  = "microsoft/VibeVoice-ASR"
LORA  = "Mohcinimohamed/vibevoice-asr-darija-lora"

processor = VibeVoiceASRProcessor.from_pretrained(
    BASE, language_model_pretrained_name="Qwen/Qwen2.5-7B"
)

model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, device_map="cuda"
)
model = PeftModel.from_pretrained(model, LORA)
model.eval()

# Transcribe a Darija audio file
inputs = processor(
    audio="darija_speech.wav",
    return_tensors="pt",
    add_generation_prompt=True,
    context_info="Moroccan Darija, labas, bezzaf, dyal, kifash, mezyan",
)
inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.0,
        do_sample=False,
        eos_token_id=processor.tokenizer.eos_token_id,
    )

# Strip prompt tokens
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
text = processor.decode(generated_ids, skip_special_tokens=True)
print(text)

About Darija

Moroccan Darija (العربية المغربية) is spoken by ~36 million people in Morocco. It is a highly mixed variety with:

Arabic (Maghrebi) phonological base
French, Spanish, and Amazigh (Tamazight) loanwords
Active code-switching, especially in educated/urban speech
No standardized orthography — both Arabic script and Latin (Franco-Arab) are used

Standard Arabic ASR systems perform poorly on Darija. This adapter is a step toward dedicated Moroccan speech technology.

Citation

If you use this adapter, please also cite the original VibeVoice paper:

@misc{vibevoice2025,
  title  = {VibeVoice: Real-time Voice Interaction with Multimodal LLMs},
  author = {Microsoft Research},
  year   = {2025},
  url    = {https://github.com/microsoft/VibeVoice}
}

Author

Mohcinimohamed — community contribution to microsoft/VibeVoice

Downloads last month: 64

Model tree for Mohcinimohamed/vibevoice-asr-darija-lora

Base model

microsoft/VibeVoice-ASR

Adapter

(7)

this model