VibeVoice - Egyptian Arabic (Stage 1)
Multi-speaker Egyptian Arabic fine-tune built on VibeVoice-Large (7B). Trained on 36.4 hours of undiacritized Egyptian Arabic across 42 speakers. The LoRA (rank 32) has been merged into the base weights — this is a standalone checkpoint.
This is a research release. The model demonstrates that VibeVoice's continuous architecture can learn Arabic phonology, but has known limitations. See Known Issues.
Audio Samples
Three Egyptian Arabic prompts comparing the base VibeVoice-Large (English/Mandarin only) against this Stage 1 fine-tune.
EGY_01: ازيك يا باشا عامل ايه انا من زمان مشفتكش والله وحشتني
| Base Model | Stage 1 (Egyptian Arabic) |
|---|---|
EGY_02: تخيل امبارح كنت ماشي في شارع جامعة الدول ولقيت عربية واقفة نص الطريق
| Base Model | Stage 1 (Egyptian Arabic) |
|---|---|
EGY_03: لو سمحت الطريق لميدان التحرير منين هل اخد المترو ولا الاتوبيس احسن
| Base Model | Stage 1 (Egyptian Arabic) |
|---|---|
Longer Samples
VibeVoice generates audio as a continuous signal rather than stitching together discrete audio codes. This means quality stays consistent even on longer text — unlike many TTS models that degrade as output length increases. The 32K token context window supports generating several minutes of speech in a single pass.
PARA_01: انا فاكر اول مرة نزلت وسط البلد لوحدي كان عندي يمكن اتناشر سنة وكنت خايف جدا من الزحمة والدوشة بس لما مشيت في شارع طلعت حرب ولقيت الناس كلها ماشية في حالها حسيت اني كبرت فجاة وقلت لنفسي دي القاهرة اللي ابويا كان بيحكيلي عنها وانا صغير
| Base Model | Stage 1 (Egyptian Arabic) |
|---|---|
Quick Start
Requirements
| GPU | A100-80GB recommended. Runs in bfloat16; ~18GB VRAM minimum |
| Python | 3.11+ |
| PyTorch | 2.5.1+ with CUDA 12.4 |
| flash-attn | Optional, recommended for speed (falls back to SDPA) |
Install
pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install git+https://github.com/vibevoice-community/VibeVoice.git@493b186
pip install flash-attn # optional
Usage
This repo includes infer.py, a zero-shot inference runner that wraps VibeVoice's upstream demo script. It requires a local clone of the VibeVoice community repo.
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice && git checkout 493b186 && cd ..
python infer.py \
--model_path ./ \
--vibevoice_repo ./VibeVoice \
--ref_dir ./reference_clips \
--device cuda
See python infer.py --help for all options (reference audio, speaker names, CFG scale, etc.).
Training Details
The base VibeVoice-Large was trained on English and Mandarin Chinese only — it has no understanding of Arabic. Stage 1 teaches it Egyptian Arabic phonology across 42 speakers from podcast interviews and conversational recordings. All text was kept undiacritized: no Egyptian Arabic diacritizer exists (every tool is MSA-only), and MSA diacritics encode pronunciations that don't match Egyptian speech. The model learns pronunciation directly from the audio.
| Data | 18,171 clips, 36.4h, 42 speakers |
| LoRA | rank 32, alpha 64, on q/k/v/o/gate/up/down projections |
| Fully trained | DDPM diffusion head + acoustic/semantic connectors |
| Frozen | σ-VAE encoder ( |
| LR | 2.5e-5, cosine decay |
| Batch size | 66 effective |
| Voice prompt drop | 20% |
| Epochs | 5 |
| Hardware | 1x A100-80GB (Modal) |
Why VibeVoice for Arabic?
Most open-source TTS models (Orpheus, XTTS v2, Dia, Sesame CSM) use frozen discrete audio codecs trained on English-dominant data. If the codec never learned to represent pharyngeals (ع, ح) or emphatics (ط, ض, ص), no amount of LLM fine-tuning can fix that — we confirmed this experimentally when Orpheus produced complete gibberish on Arabic input.
VibeVoice's signal path is fully continuous: text → Qwen2.5 LLM → DDPM diffusion head → σ-VAE latent vectors → 24kHz waveform. No codebooks, no quantization, no discrete audio tokens. The frozen σ-VAE is a reconstruction constraint (synthesizing frequencies), not a representation constraint (choosing from a fixed vocabulary). We confirmed this with a round-trip encode/decode test on Egyptian Arabic reference clips: PESQ 2.71 mean, all target phonemes preserved.
Known Issues
This is an early research checkpoint with known limitations:
- MSA tilt — some outputs lean toward Modern Standard Arabic pronunciation rather than Egyptian dialect
- Prosody — intonation and rhythm can sound unnatural or robotic in some sentences
- Phoneme errors — كشري consistently mispronounced as جيش; some pharyngeals (ح) and ج occasionally unclear
- Speed variation — some outputs are slightly sped up compared to natural speech
- No diacritization — text must be undiacritized; MSA diacritics will produce incorrect pronunciations
- Abjad ambiguity — without a dedicated Egyptian Arabic phonemizer, the model must infer vowel placement from context, which it sometimes gets wrong
Lessons Learned
Our experiments (LoRA at both r=32 and r=128) confirmed that LoRA alone cannot rewire dialect phonology. The binding constraint is the Abjad problem: undiacritized Arabic text forces the Qwen2.5 LLM to guess vowel placement through its English-trained acoustic space, and no amount of low-rank adaptation fixes that. A future iteration would require either a full fine-tune with significantly more Egyptian Arabic data, or an IPA transcription pipeline that bypasses abjad ambiguity entirely.
License
MIT — inherited from VibeVoice-Large. Please don't use this model for voice impersonation without speaker consent or for generating disinformation. The original VibeVoice release includes audible AI-generated disclaimers and imperceptible audio watermarks in generated output.
Acknowledgments
Training and inference were run on A100-80GB GPUs provided by Modal.
- Downloads last month
- 13
Model tree for MAdel121/VibeVoice-Egy
Base model
aoi-ot/VibeVoice-Large