VibeVoice - Egyptian Arabic (Stage 1)

Multi-speaker Egyptian Arabic fine-tune built on VibeVoice-Large (7B). Trained on 36.4 hours of undiacritized Egyptian Arabic across 42 speakers. The LoRA (rank 32) has been merged into the base weights — this is a standalone checkpoint.

This is a research release. The model demonstrates that VibeVoice's continuous architecture can learn Arabic phonology, but has known limitations. See Known Issues.

Audio Samples

Three Egyptian Arabic prompts comparing the base VibeVoice-Large (English/Mandarin only) against this Stage 1 fine-tune.

EGY_01: ازيك يا باشا عامل ايه انا من زمان مشفتكش والله وحشتني

Base Model Stage 1 (Egyptian Arabic)

EGY_02: تخيل امبارح كنت ماشي في شارع جامعة الدول ولقيت عربية واقفة نص الطريق

Base Model Stage 1 (Egyptian Arabic)

EGY_03: لو سمحت الطريق لميدان التحرير منين هل اخد المترو ولا الاتوبيس احسن

Base Model Stage 1 (Egyptian Arabic)

Longer Samples

VibeVoice generates audio as a continuous signal rather than stitching together discrete audio codes. This means quality stays consistent even on longer text — unlike many TTS models that degrade as output length increases. The 32K token context window supports generating several minutes of speech in a single pass.

PARA_01: انا فاكر اول مرة نزلت وسط البلد لوحدي كان عندي يمكن اتناشر سنة وكنت خايف جدا من الزحمة والدوشة بس لما مشيت في شارع طلعت حرب ولقيت الناس كلها ماشية في حالها حسيت اني كبرت فجاة وقلت لنفسي دي القاهرة اللي ابويا كان بيحكيلي عنها وانا صغير

Base Model Stage 1 (Egyptian Arabic)

Quick Start

Requirements

GPU A100-80GB recommended. Runs in bfloat16; ~18GB VRAM minimum
Python 3.11+
PyTorch 2.5.1+ with CUDA 12.4
flash-attn Optional, recommended for speed (falls back to SDPA)

Install

pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install git+https://github.com/vibevoice-community/VibeVoice.git@493b186
pip install flash-attn  # optional

Usage

This repo includes infer.py, a zero-shot inference runner that wraps VibeVoice's upstream demo script. It requires a local clone of the VibeVoice community repo.

git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice && git checkout 493b186 && cd ..

python infer.py \
  --model_path ./ \
  --vibevoice_repo ./VibeVoice \
  --ref_dir ./reference_clips \
  --device cuda

See python infer.py --help for all options (reference audio, speaker names, CFG scale, etc.).

Training Details

The base VibeVoice-Large was trained on English and Mandarin Chinese only — it has no understanding of Arabic. Stage 1 teaches it Egyptian Arabic phonology across 42 speakers from podcast interviews and conversational recordings. All text was kept undiacritized: no Egyptian Arabic diacritizer exists (every tool is MSA-only), and MSA diacritics encode pronunciations that don't match Egyptian speech. The model learns pronunciation directly from the audio.

Data 18,171 clips, 36.4h, 42 speakers
LoRA rank 32, alpha 64, on q/k/v/o/gate/up/down projections
Fully trained DDPM diffusion head + acoustic/semantic connectors
Frozen σ-VAE encoder (340M) + σ-VAE decoder (340M)
LR 2.5e-5, cosine decay
Batch size 66 effective
Voice prompt drop 20%
Epochs 5
Hardware 1x A100-80GB (Modal)

Why VibeVoice for Arabic?

Most open-source TTS models (Orpheus, XTTS v2, Dia, Sesame CSM) use frozen discrete audio codecs trained on English-dominant data. If the codec never learned to represent pharyngeals (ع, ح) or emphatics (ط, ض, ص), no amount of LLM fine-tuning can fix that — we confirmed this experimentally when Orpheus produced complete gibberish on Arabic input.

VibeVoice's signal path is fully continuous: text → Qwen2.5 LLM → DDPM diffusion head → σ-VAE latent vectors → 24kHz waveform. No codebooks, no quantization, no discrete audio tokens. The frozen σ-VAE is a reconstruction constraint (synthesizing frequencies), not a representation constraint (choosing from a fixed vocabulary). We confirmed this with a round-trip encode/decode test on Egyptian Arabic reference clips: PESQ 2.71 mean, all target phonemes preserved.

Known Issues

This is an early research checkpoint with known limitations:

  • MSA tilt — some outputs lean toward Modern Standard Arabic pronunciation rather than Egyptian dialect
  • Prosody — intonation and rhythm can sound unnatural or robotic in some sentences
  • Phoneme errors — كشري consistently mispronounced as جيش; some pharyngeals (ح) and ج occasionally unclear
  • Speed variation — some outputs are slightly sped up compared to natural speech
  • No diacritization — text must be undiacritized; MSA diacritics will produce incorrect pronunciations
  • Abjad ambiguity — without a dedicated Egyptian Arabic phonemizer, the model must infer vowel placement from context, which it sometimes gets wrong

Lessons Learned

Our experiments (LoRA at both r=32 and r=128) confirmed that LoRA alone cannot rewire dialect phonology. The binding constraint is the Abjad problem: undiacritized Arabic text forces the Qwen2.5 LLM to guess vowel placement through its English-trained acoustic space, and no amount of low-rank adaptation fixes that. A future iteration would require either a full fine-tune with significantly more Egyptian Arabic data, or an IPA transcription pipeline that bypasses abjad ambiguity entirely.

License

MIT — inherited from VibeVoice-Large. Please don't use this model for voice impersonation without speaker consent or for generating disinformation. The original VibeVoice release includes audible AI-generated disclaimers and imperceptible audio watermarks in generated output.

Acknowledgments

Training and inference were run on A100-80GB GPUs provided by Modal.

Downloads last month
13
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MAdel121/VibeVoice-Egy

Finetuned
(1)
this model