VibeVoice-ASR โ€” Darija (Moroccan Arabic) LoRA

Fine-tuned PEFT LoRA adapter for microsoft/VibeVoice-ASR, targeting Darija (Moroccan Arabic / ISO 639-3: ary) โ€” a low-resource dialect with rich code-switching between Arabic, French, Amazigh, and Spanish that is largely absent from the base model's training data.

This adapter was developed as a community contribution to the VibeVoice project to demonstrate fine-tuning on a new low-resource language.


Training details

Setting Value
Base model microsoft/VibeVoice-ASR (Qwen2.5-7B backbone)
Method QLoRA 4-bit (nf4), rank 16, alpha 32
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs 5 (staged curriculum)
Effective batch size 8 (per_device=1, grad_accum=8)
Learning rate 1e-4 with cosine warmup (10%)
Train samples 17,984
Val samples 997
GPU NVIDIA Quadro RTX 5000 (16 GB VRAM)

Staged training curriculum

Stage Arabic weight Darija weight Epochs
1 โ€” Arabic transfer 70% 30% 2
2 โ€” Darija focus 30% 70% 2
3 โ€” Darija refinement 0% 100% 1

Qualitative observations

The base microsoft/VibeVoice-ASR model struggles with Darija in several ways:

  • Dialect correction: it "fixes" Darija words toward MSA โ€” e.g. labas, bezzaf, kifash, mezyan get substituted or distorted
  • Code-switching: frequent French insertions in Darija speech are mishandled
  • Phonetics: Moroccan emphatic consonants and vowel patterns differ from Gulf/Levantine Arabic in the base training data

After fine-tuning the model handles dialectal vocabulary, code-switching with French, and Moroccan phonetic patterns significantly better.

Formal WER/CER evaluation on the 997-sample validation set is in progress โ€” numbers will be added here soon.


Usage

import torch
from peft import PeftModel
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor

BASE  = "microsoft/VibeVoice-ASR"
LORA  = "Mohcinimohamed/vibevoice-asr-darija-lora"

processor = VibeVoiceASRProcessor.from_pretrained(
    BASE, language_model_pretrained_name="Qwen/Qwen2.5-7B"
)

model = VibeVoiceASRForConditionalGeneration.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, device_map="cuda"
)
model = PeftModel.from_pretrained(model, LORA)
model.eval()

# Transcribe a Darija audio file
inputs = processor(
    audio="darija_speech.wav",
    return_tensors="pt",
    add_generation_prompt=True,
    context_info="Moroccan Darija, labas, bezzaf, dyal, kifash, mezyan",
)
inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.0,
        do_sample=False,
        eos_token_id=processor.tokenizer.eos_token_id,
    )

# Strip prompt tokens
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
text = processor.decode(generated_ids, skip_special_tokens=True)
print(text)

About Darija

Moroccan Darija (ุงู„ุนุฑุจูŠุฉ ุงู„ู…ุบุฑุจูŠุฉ) is spoken by ~36 million people in Morocco. It is a highly mixed variety with:

  • Arabic (Maghrebi) phonological base
  • French, Spanish, and Amazigh (Tamazight) loanwords
  • Active code-switching, especially in educated/urban speech
  • No standardized orthography โ€” both Arabic script and Latin (Franco-Arab) are used

Standard Arabic ASR systems perform poorly on Darija. This adapter is a step toward dedicated Moroccan speech technology.


Citation

If you use this adapter, please also cite the original VibeVoice paper:

@misc{vibevoice2025,
  title  = {VibeVoice: Real-time Voice Interaction with Multimodal LLMs},
  author = {Microsoft Research},
  year   = {2025},
  url    = {https://github.com/microsoft/VibeVoice}
}

Author

Mohcinimohamed โ€” community contribution to microsoft/VibeVoice

Downloads last month
64
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Mohcinimohamed/vibevoice-asr-darija-lora

Adapter
(7)
this model