VibeVoice-ASR โ Darija (Moroccan Arabic) LoRA
Fine-tuned PEFT LoRA adapter for microsoft/VibeVoice-ASR, targeting Darija (Moroccan Arabic / ISO 639-3: ary) โ a low-resource dialect with rich code-switching between Arabic, French, Amazigh, and Spanish that is largely absent from the base model's training data.
This adapter was developed as a community contribution to the VibeVoice project to demonstrate fine-tuning on a new low-resource language.
Training details
| Setting | Value |
|---|---|
| Base model | microsoft/VibeVoice-ASR (Qwen2.5-7B backbone) |
| Method | QLoRA 4-bit (nf4), rank 16, alpha 32 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Epochs | 5 (staged curriculum) |
| Effective batch size | 8 (per_device=1, grad_accum=8) |
| Learning rate | 1e-4 with cosine warmup (10%) |
| Train samples | 17,984 |
| Val samples | 997 |
| GPU | NVIDIA Quadro RTX 5000 (16 GB VRAM) |
Staged training curriculum
| Stage | Arabic weight | Darija weight | Epochs |
|---|---|---|---|
| 1 โ Arabic transfer | 70% | 30% | 2 |
| 2 โ Darija focus | 30% | 70% | 2 |
| 3 โ Darija refinement | 0% | 100% | 1 |
Qualitative observations
The base microsoft/VibeVoice-ASR model struggles with Darija in several ways:
- Dialect correction: it "fixes" Darija words toward MSA โ e.g. labas, bezzaf, kifash, mezyan get substituted or distorted
- Code-switching: frequent French insertions in Darija speech are mishandled
- Phonetics: Moroccan emphatic consonants and vowel patterns differ from Gulf/Levantine Arabic in the base training data
After fine-tuning the model handles dialectal vocabulary, code-switching with French, and Moroccan phonetic patterns significantly better.
Formal WER/CER evaluation on the 997-sample validation set is in progress โ numbers will be added here soon.
Usage
import torch
from peft import PeftModel
from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration
from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor
BASE = "microsoft/VibeVoice-ASR"
LORA = "Mohcinimohamed/vibevoice-asr-darija-lora"
processor = VibeVoiceASRProcessor.from_pretrained(
BASE, language_model_pretrained_name="Qwen/Qwen2.5-7B"
)
model = VibeVoiceASRForConditionalGeneration.from_pretrained(
BASE, torch_dtype=torch.bfloat16, device_map="cuda"
)
model = PeftModel.from_pretrained(model, LORA)
model.eval()
# Transcribe a Darija audio file
inputs = processor(
audio="darija_speech.wav",
return_tensors="pt",
add_generation_prompt=True,
context_info="Moroccan Darija, labas, bezzaf, dyal, kifash, mezyan",
)
inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.0,
do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id,
)
# Strip prompt tokens
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
text = processor.decode(generated_ids, skip_special_tokens=True)
print(text)
About Darija
Moroccan Darija (ุงูุนุฑุจูุฉ ุงูู ุบุฑุจูุฉ) is spoken by ~36 million people in Morocco. It is a highly mixed variety with:
- Arabic (Maghrebi) phonological base
- French, Spanish, and Amazigh (Tamazight) loanwords
- Active code-switching, especially in educated/urban speech
- No standardized orthography โ both Arabic script and Latin (Franco-Arab) are used
Standard Arabic ASR systems perform poorly on Darija. This adapter is a step toward dedicated Moroccan speech technology.
Citation
If you use this adapter, please also cite the original VibeVoice paper:
@misc{vibevoice2025,
title = {VibeVoice: Real-time Voice Interaction with Multimodal LLMs},
author = {Microsoft Research},
year = {2025},
url = {https://github.com/microsoft/VibeVoice}
}
Author
Mohcinimohamed โ community contribution to microsoft/VibeVoice
- Downloads last month
- 64
Model tree for Mohcinimohamed/vibevoice-asr-darija-lora
Base model
microsoft/VibeVoice-ASR