Configuration Parsing Warning:Invalid JSON for config file config.json
- Arabic TTS: Improving XTTS-v2 for Arabic Speech Synthesis
- Table of Contents
- Project Overview
- Where This Project Sits in AI
- Architecture
- Setup & Installation
- Phase 1: Baseline Evaluation
- Phase 2: Compatibility Fixes
- Phase 3: Arabic Text Preprocessing
- Phase 4: Inference Pipeline Improvements
- Phase 5: Data Collection & Preparation
- Phase 6: Fine-Tuning
- Phase 7: Evaluation & Comparison
- Phase 8: Studio-Quality Audio Upsampling
- Resource Usage
- Results
- Usage
- Known Limitations & Future Improvements
- Industry Use Cases
- License
- Citation
- Table of Contents
Arabic TTS: Improving XTTS-v2 for Arabic Speech Synthesis
A systematic project to improve Coqui XTTS-v2 for high-quality Arabic text-to-speech generation. This repository documents the full journey from baseline evaluation to fine-tuned deployment, with reproducible benchmarks at every stage.
Table of Contents
- Project Overview
- Where This Project Sits in AI
- Architecture
- Setup & Installation
- Phase 1: Baseline Evaluation
- Phase 2: Compatibility Fixes
- Phase 3: Arabic Text Preprocessing
- Phase 4: Inference Pipeline Improvements
- Phase 5: Data Collection & Preparation
- Phase 6: Fine-Tuning
- Phase 7: Evaluation & Comparison
- Phase 8: Studio-Quality Audio Upsampling
- Resource Usage
- Results
- Known Limitations & Future Improvements
- Industry Use Cases
- Usage
- License
Project Overview
XTTS-v2 is a powerful multilingual TTS model supporting 17 languages out of the box — a remarkable achievement by the Coqui team. This project builds on that foundation to push Arabic output quality further by:
- Expanding Arabic training data — supplementing the base model with additional high-quality Arabic speech data
- Adding Arabic text preprocessing — introducing diacritization (tashkeel) and hamza normalization to help the model with Arabic-specific orthography
- Optimizing the tokenizer for Arabic — adapting the shared multilingual tokenizer to better handle Arabic text patterns
- Tuning inference for Arabic prosody — adjusting generation parameters to better match Arabic speech rhythms and intonation
- Building a post-processing pipeline — adding pause control, silence management, and output refinement tailored to Arabic
This project addresses each area systematically, with before/after comparisons at every stage.
Goals
- Natural-sounding Arabic speech — produce voices that sound authentically human, not robotic or accented
- Multi-dialect support — generate speech across different Arabic dialects and accents (e.g. Sudanese, Egyptian, Gulf, Levantine, North African, MSA) by fine-tuning on dialect-specific speakers
- Document a reproducible pipeline from base model to production-quality Arabic TTS
- Provide quantitative benchmarks (MOS scores, RTF, SNR) at each improvement stage
- Release fine-tuned checkpoints and preprocessing tools for the Arabic TTS community
Where This Project Sits in AI
Figure 1: The relationship between AI, Machine Learning, Deep Learning, and Data Science — and where each part of this project applies.
This project spans all four domains. Here is where each phase of our work falls:
| Domain | What we did in this project | Phase |
|---|---|---|
| Artificial Intelligence | End-to-end system that converts Arabic text to natural-sounding speech — a core AI application | All phases |
| Deep Learning | Fine-tuned a 370M-parameter GPT-2 transformer + HiFiGAN vocoder on Arabic speech data | Phase 6 (Training) |
| Machine Learning | Speaker clustering with ECAPA-TDNN embeddings + Agglomerative Clustering to identify speakers in unlabeled data | Phase 5 (Data Preparation) |
| Data Science | Dataset evaluation, quality filtering, statistical analysis, SNR measurement, Whisper-based alignment verification, visualization of results | Phases 5, 7 (Data & Evaluation) |
In detail:
- Deep Learning powers the core model — the GPT-2 backbone generates audio codes autoregressively, the HiFiGAN vocoder converts them to waveforms, and the Perceiver Resampler handles speaker conditioning. Fine-tuning these neural networks on Arabic data is a Deep Learning task
- Machine Learning solved our speaker identification problem — with 103K unlabeled clips, we used ECAPA-TDNN (a neural feature extractor) combined with Agglomerative Clustering (a classical ML algorithm) to find and extract 5,000 clips from the same speaker
- Data Science drove every decision about data quality — from evaluating 6 candidate datasets, to measuring SNR and silence ratios, to building the 3-layer sanity check pipeline, to generating comparison charts and metrics
- AI is the umbrella — the final system takes Arabic text as input and produces human-like speech as output, combining all three disciplines into a practical application
Architecture
XTTS-v2 Model Components
Figure 2: XTTS-v2 architecture — text and reference audio are processed through separate pathways, combined in the GPT-2 transformer, and decoded to a 24kHz waveform.
Key Specifications
| Component | Details |
|---|---|
| GPT Backbone | 30 layers, 1024 hidden dim, 16 heads, ~350M params |
| Tokenizer | VoiceBPE, 6,681 tokens, 17 languages |
| Audio Codec | Discrete VAE, 1,026 codebook entries |
| Vocoder | HiFiGAN, [8,8,2,2] upsampling, ResBlock1 |
| Speaker Encoder | ResNet50 + SE blocks, 512-dim output |
| Conditioning | Perceiver Resampler, 2 layers, 32 latents |
| Output | 24,000 Hz WAV |
| Input Sample Rate | 22,050 Hz |
| Max Text Tokens | 402 |
| Max Audio Tokens | 605 |
| Character Limit | 166 per chunk (Arabic) |
| Total Parameters | ~370M |
Setup & Installation
Hardware
This project was developed on an NVIDIA DGX Spark — a desktop AI workstation with a Grace Blackwell architecture. If you're using different hardware, you may need to adjust CUDA versions, PyTorch builds, and batch sizes accordingly.
| Spec | DGX Spark (this project) |
|---|---|
| GPU | NVIDIA GB10 (Blackwell) |
| GPU Memory | 121.7 GB (unified) |
| CPU Architecture | ARM (aarch64) |
| System RAM | 128 GB |
| OS | Ubuntu 24.04 LTS |
| CUDA | 13.0 |
| NVIDIA Driver | 580.142 |
Running on different hardware?
Consumer GPUs (RTX 3090, 4090, etc.):
- You'll need PyTorch built for your CUDA version (e.g.,
cu118,cu121,cu124instead ofcu130)- With 24GB VRAM, reduce
batch_sizeto 2 andgrad_accumulation_stepsto 1 during fine-tuning- Inference works fine on 8GB+ VRAM
Data Center GPUs (A100, H100):
- Should work out of the box with matching CUDA/PyTorch versions
- Can increase batch sizes for faster training
x86 vs ARM (aarch64):
- DGX Spark uses ARM. If you're on x86 (most desktops/servers), install the x86 PyTorch build — the code is the same, only the binary packages differ
pip install TTSwill automatically pull the correct architectureCPU only:
- Inference is possible but very slow (~10x slower than GPU)
- Fine-tuning on CPU is not practical
Key: Match your PyTorch version to your CUDA version. Check compatibility at pytorch.org/get-started.
Software Environment
| Package | Version | Purpose |
|---|---|---|
| Python | 3.10.20 | Runtime (via Miniconda) |
| PyTorch | 2.11.0+cu130 | Deep learning framework — match to your CUDA version |
| CUDA Toolkit | 13.0 | Match to your GPU driver |
| Coqui TTS | 0.22.0 | Core XTTS-v2 framework |
| Transformers | 4.44.2 | Pinned — v5.x breaks Coqui TTS (see Phase 2) |
| SpeechBrain | 1.1.0 | ECAPA-TDNN speaker embeddings |
| OpenAI Whisper | latest | Alignment verification (large-v3 model) |
| AudioSR | 0.0.7 | Neural audio upsampling (24→48 kHz) |
| Mishkal | 0.4.1 | Arabic tashkeel — optional, for formal MSA mode |
| PyArabic | 0.6.15 | Arabic text utilities |
| datasets | 4.8.4 | HuggingFace dataset loading |
| torchcodec | 0.11.1 | Audio decoding for HuggingFace datasets |
| matplotlib | — | Chart generation |
| umap-learn | — | Dimensionality reduction for speaker cluster visualization |
Installation Steps
# 1. Create conda environment
conda create -n new-arabic-tts python=3.10 -y
conda activate new-arabic-tts
# 2. Install Coqui TTS
pip install TTS
# 3. Pin transformers for compatibility (see Phase 2)
pip install transformers==4.44.2
# 4. Download XTTS-v2 base model
echo "y" | python -c "
from TTS.utils.manage import ModelManager
mm = ModelManager()
mm.download_model('tts_models/multilingual/multi-dataset/xtts_v2')
"
# 5. Copy model to project directory
mkdir -p models/base
cp -r ~/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2/* models/base/
Project Structure
New Arabic TTS/
├── README.md # This file
├── CHANGELOG.md # Detailed change log
├── configs/ # Configuration files
├── data/
│ └── egyptian/ # Phase 5: Prepared training data
│ ├── metadata_train.csv # Training split (cleaned)
│ ├── metadata_eval.csv # Evaluation split (cleaned)
│ ├── wavs/ # 22,050 Hz mono WAV files
│ └── rejected/ # Clips that failed sanity check
├── docs/
│ ├── DATA_FORMAT.md # Data format guide for fine-tuning
│ ├── images/
│ │ ├── speaker_clusters_umap.png # Phase 5: UMAP speaker clusters
│ │ ├── cluster_sizes.png # Phase 5: Cluster size chart
│ │ ├── similarity_distribution.png # Phase 5: Speaker consistency cutoff
│ │ ├── duration_distribution.png # Phase 5: Duration histogram
│ │ ├── comparison_waveforms.png # Phase 7: Waveform comparison
│ │ ├── comparison_spectrogram.png # Phase 7: Spectrogram comparison
│ │ ├── comparison_metrics.png # Phase 7: Metrics bar chart
│ │ └── training_loss.png # Phase 7: Training loss curve
│ └── benchmarks/
│ ├── baseline.json # Phase 1: Baseline metrics
│ ├── dataset_preparation.json # Phase 5: Pipeline report
│ ├── sanity_check.json # Phase 5: Data quality report
│ ├── training_summary.json # Phase 6: Training results
│ └── evaluation.json # Phase 7: Comparison metrics
├── models/
│ └── base/ # XTTS-v2 base model (1.8GB)
│ ├── config.json
│ ├── model.pth
│ ├── speakers_xtts.pth # 58 built-in speakers
│ └── vocab.json
├── outputs/
│ ├── original model/ # Phase 1: Raw base model outputs
│ │ └── Original_Model_test.wav # Baseline benchmark (standard demo text)
│ ├── improved model/ # Phase 3+4: Text preprocessing + post-processing
│ │ ├── Improved_Model_test.wav # Same text, improved pipeline
│ │ └── Improved_Model_test_stats.json # Per-chunk generation stats
│ └── finetuned model/ # Phase 6: After fine-tuning
│ ├── Finetuned_Model_test.wav # Fine-tuned output (same demo text)
│ └── Finetuned_Model_test_stats.json # Generation stats
└── scripts/
├── arabic_preprocessor.py # Phase 3: Arabic text preprocessing
├── baseline_test.py # Phase 1: Baseline benchmark generator
├── infer.py # Phase 4: Improved inference pipeline
├── patch_tts.py # Phase 2: PyTorch compatibility patch
├── prepare_dataset.py # Phase 5: Dataset download, clustering, export
├── refine_cluster.py # Phase 5: Refined top-N speaker selection
├── sanity_check.py # Phase 5: 3-layer data quality verification
├── train.py # Phase 6: XTTS-v2 fine-tuning
├── evaluate.py # Phase 7: Metrics and comparison charts
└── upsample.py # Phase 8: AudioSR upsampling (24→48kHz)
Phase 1: Baseline Evaluation
Objective: Establish baseline Arabic output quality using the unmodified XTTS-v2 model.
1.1 Model Download
Downloaded XTTS-v2 base model (1.8GB) from Coqui's model hub. The model includes:
model.pth- Full checkpoint (~370M parameters)speakers_xtts.pth- 58 pre-computed speaker embeddingsvocab.json- BPE vocabulary (6,681 tokens)config.json- Model and training configuration
1.2 Built-in Speakers
XTTS-v2 ships with 58 speaker embeddings. No Arabic-native speakers are included - all speakers are from English/European datasets. We selected "Gilberto Mathias" for baseline testing as a speaker with a name suggesting potential Arabic phonetic compatibility.
Full speaker list (58 speakers)
Aaron Dreschner, Abrahan Mack, Adde Michal, Alexandra Hisakawa, Alison Dietlinde, Alma Maria, Ana Florence, Andrew Chipper, Annmarie Nele, Asya Anara, Gilberto Mathias, Baldur Sanjin, Barbora MacLean, Brenda Stern, Camilla Holmstrom, Chandra MacFarland, Claribel Dervla, Craig Gutsy, Daisy Studious, Damien Black, Damjan Chapman, Dionisio Schuyler, Eugenio Mataraci, Ferran Simen, Filip Traverse, Gilberto Mathias, Gitta Nikolina, Gracie Wise, Henriette Usha, Ige Behringer, Ilkin Urbano, Kazuhiko Atallah, Kumar Dahl, Lidiya Szekeres, Lilya Stainthorpe, Ludvig Milivoj, Luis Moray, Maja Ruoho, Marcos Rudaski, Narelle Moon, Nova Hogarth, Rosemary Okafor, Royston Min, Sofia Hellen, Suad Qasim, Szofi Granger, Tammie Ema, Tammy Grit, Tanja Adelina, Torcull Diarmuid, Uta Obando, Viktor Eka, Viktor Menelaos, Vjollca Johnnie, Wulf Carlevaro, Xavier Hayasaka, Zacharie Aimilios, Zofija Kendrick
1.3 Baseline Test
Standard demo text — this text is used consistently across all phases for comparison. It is also the default demo text when the model is published on HuggingFace:
مرحباً بكم في عالم التكنولوجيا الحديثة.
تُعدّ اللغة العربية من أقدم اللغات وأكثرها انتشاراً في العالم،
إذ يتحدث بها أكثر من أربعمائة مليون شخص حول العالم.
Generation parameters (baseline):
| Parameter | Value |
|---|---|
| Temperature | 0.55 |
| Top-p | 0.85 |
| Repetition Penalty | 2.5 |
| Speaker | Gilberto Mathias (built-in) |
| Sentence Pause | 0.4s |
Output file: outputs/original model/Original_Model_test.wav — 19.33 seconds
Benchmark data: docs/benchmarks/baseline.json
Listen to this file to hear the unmodified XTTS-v2 base model generating Arabic. This is the starting point — all subsequent improvements are measured against this output.
1.4 Baseline Observations
Note: Detailed quality metrics (MOS, SNR, intelligibility) will be added after the evaluation pipeline is built in Phase 7.
Known issues with baseline output:
- Pronunciation errors on undiacritized words
- Non-native Arabic accent (English-trained speaker embedding)
- Inconsistent prosody and pacing
- No pause control between sentences (fixed 0.35s gaps)
- Potential rambling/over-generation on longer sentences
Phase 2: Compatibility Fixes
Before any improvements could begin, two compatibility issues had to be resolved.
2.1 Transformers Version Conflict
Problem: Coqui TTS 0.22.0 imports BeamSearchScorer from HuggingFace Transformers, which was removed in Transformers v5.x.
ImportError: cannot import name 'BeamSearchScorer' from 'transformers'
Fix: Pin Transformers to a compatible version.
pip install transformers==4.44.2
2.2 PyTorch weights_only Default Change
Problem: PyTorch 2.11 changed the default of torch.load() from weights_only=False to weights_only=True. XTTS-v2 checkpoints contain non-tensor objects (config classes) that fail to load under the new default.
_pickle.UnpicklingError: Weights only load failed.
Unsupported global: GLOBAL TTS.tts.configs.xtts_config.XttsConfig
Fix: Patched TTS/utils/io.py to explicitly pass weights_only=False.
# File: <env>/lib/python3.10/site-packages/TTS/utils/io.py
# Lines 51, 54: Added weights_only=False to both torch.load() calls
# Before:
return torch.load(f, map_location=map_location, **kwargs)
# After:
return torch.load(f, map_location=map_location, weights_only=False, **kwargs)
Note: This patch is applied to the installed package. If the
new-arabic-ttsconda environment is recreated, this patch must be re-applied. A script to automate this is provided atscripts/patch_tts.py.
2.3 Coqui Trainer Git Branch Detection
Problem: The Coqui Trainer class tries to detect the current git branch for logging. If the project directory is a git repo with no commits, git branch returns no output and the trainer crashes with StopIteration.
StopIteration (in trainer/generic_utils.py get_git_branch)
Fix: Initialize the git repo with at least one commit before running training.
git add . && git commit -m "Initial commit"
2.4 AudioSR Dependency Conflicts
Problem: Installing audiosr downgrades librosa (0.10→0.9.2) and transformers (4.44→4.30), breaking Coqui TTS compatibility.
Fix: After installing AudioSR, restore the required versions:
pip install audiosr
pip install "librosa>=0.10.0" "transformers==4.44.2" # restore after audiosr
Phase 3: Arabic Text Preprocessing
Objective: Build an Arabic-aware text preprocessing pipeline that improves pronunciation without forcing a specific dialect.
Key design decision: Tashkeel (diacritization) is off by default. Adding diacritics forces formal MSA pronunciation, which conflicts with the project's goal of natural dialect-specific speech. Instead, the model learns pronunciation from the speaker's training data. Tashkeel is available as an optional flag (--tashkeel) for users who want formal Arabic output.
Script: scripts/arabic_preprocessor.py
3.1 Hamza Normalization
Corrects common hamza placement errors using a 40+ word correction map. This helps the tokenizer produce consistent token sequences regardless of how the input was typed.
Examples:
| Input | Corrected |
|---|---|
| ان | أن |
| الى | إلى |
| الالات | الآلات |
| اصبحت | أصبحت |
| اكثر | أكثر |
| الان | الآن |
| مسوول | مسؤول |
3.2 Number and Symbol Expansion
Converts numerals and symbols to Arabic words so the model doesn't have to guess pronunciation of non-text characters.
Examples:
| Input | Expanded |
|---|---|
70% |
سبعون بالمئة |
500$ |
خمسمائة دولار |
& |
و |
2026 |
ألفان و ستة و عشرون |
3.3 Tashkeel (Optional — MSA Mode)
When enabled via --tashkeel, the Mishkal engine adds full diacritical marks. This produces formal MSA pronunciation.
Example:
Input: الذكاء الاصطناعي يتطور بسرعة كبيرة
Tashkeel: الذَّكَاءُ الْاِصْطِنَاعِيُّ يَتَطَوَّرُ بِسُرْعَةِ كَبِيرَةِ
3.4 Processing Pipeline
Default (dialect mode): clean → hamza → numbers → symbols → inference
Optional (formal MSA mode): clean → hamza → numbers → symbols → tashkeel → inference
Phase 4: Inference Pipeline Improvements
Objective: Build a production-quality inference pipeline with smart chunking, post-processing, and reproducible output.
Script: scripts/infer.py
4.1 Smart Text Chunking
The base XTTS-v2 has a 166-character limit per inference call. Our chunker handles this with a 3-tier strategy:
- Sentence splitting — split at Arabic punctuation (
.؟!؛) - Long sentence splitting — split at commas/semicolons if over 160 chars, then by word boundary as last resort
- Short chunk merging — merge chunks under 30 chars with their neighbors to avoid choppy output
4.2 Post-Processing Pipeline
Each generated chunk goes through:
| Step | What it does |
|---|---|
| Pause compression | Caps internal silence to 350ms — removes unnatural long pauses while preserving natural breathing rhythm |
| Trailing silence trim | Removes dead air at the end, keeps 150ms tail for natural sentence endings |
| Rambling detection | If output exceeds 1.5x expected duration, trims at nearest silence point |
4.3 Deterministic Generation
Each chunk gets a unique seed (base_seed + chunk_index), so the same input always produces the same output. Default base seed: 12345.
4.4 Configurable Pause Control
| Pause Type | Duration | Purpose |
|---|---|---|
| Sentence pause | 0.4s | Between sentences |
| Paragraph pause | 0.70s | At end of full text |
4.5 First Improved Output
Generated the same AI test text through the improved pipeline:
Output file: outputs/improved model/Improved_Model_test.wav — 18.99 seconds
Stats file: outputs/improved model/Improved_Model_test_stats.json
Listen to this file and compare with
outputs/original model/Original_Model_test.wavto hear the effect of text preprocessing and post-processing. The voice is the same — the improvements are in how text is handled and how audio is cleaned.
| Metric | Baseline | Improved | Change |
|---|---|---|---|
| Total Duration | 19.33s | 18.99s | -1.8% (tighter, less dead air) |
| Generation Time | 6.41s | 6.86s | — |
| Chunks | 4 (manual) | 2 (smart chunking) | Fewer splits, more natural |
| Text Preprocessing | None | Hamza correction | Added |
| Post-Processing | None | Pause compress + trim + taper | Added |
What improved and what didn't (yet)
Improved in this phase (pipeline & text):
- Numbers are spoken correctly (
70%→سبعون بالمئةinstead of raw symbols)- Hamza correction ensures consistent tokenization (
اكثر→أكثر)- Cleaner audio: internal pauses capped at 150ms, natural sentence endings with taper
- Reproducible output via deterministic seeding
- Per-sentence stats tracked in JSON
Not improved yet (requires fine-tuning in Phase 6):
- Voice is still a non-Arabic built-in speaker (Gilberto Mathias)
- Pronunciation still reflects the base model's limited Arabic training
- Accent and prosody are not natural Arabic
The voice quality transformation happens in Phase 6 when we fine-tune on real Arabic speaker data. Phases 3–4 ensure the pipeline is clean and correct so that fine-tuning produces the best possible results.
4.6 Output Reference Guide
All output files are organized by phase. Each file captures a specific stage of improvement so you can listen and compare the progression:
| File | Phase | What it demonstrates |
|---|---|---|
outputs/original model/Original_Model_test.wav |
Phase 1 | Standard demo text, raw base model — baseline for all comparisons |
outputs/improved model/Improved_Model_test.wav |
Phase 3+4 | Same demo text, with hamza correction, smart chunking, pause compression, sentence taper |
outputs/improved model/Improved_Model_test_stats.json |
Phase 4 | Per-chunk generation stats (duration, timing, params) |
docs/benchmarks/baseline.json |
Phase 1 | Per-sentence metrics for baseline |
outputs/finetuned model/Finetuned_Model_test.wav |
Phase 6 | Same demo text, fine-tuned on Egyptian Arabic — voice quality transformation |
outputs/finetuned model/Finetuned_Model_test_stats.json |
Phase 6 | Per-chunk generation stats for fine-tuned output |
outputs/original model/Original_Model_test_48kHz.wav |
Phase 8 | Baseline upsampled to 48kHz/24-bit |
outputs/improved model/Improved_Model_test_48kHz.wav |
Phase 8 | Improved pipeline upsampled to 48kHz/24-bit |
outputs/finetuned model/Finetuned_Model_test_48kHz.wav |
Phase 8 | Fine-tuned model upsampled to 48kHz/24-bit — best quality |
How to compare: Play
Original_Model_test.wavandImproved_Model_test.wavside by side — all files use the same standard demo text. After Phase 6, play the fine-tuned version to hear the voice quality transformation. This progression tells the full story of the project.
Phase 5: Data Collection & Preparation
Status: In Progress
For full details on the data format, see docs/DATA_FORMAT.md.
5.1 Data Format Summary
XTTS-v2 fine-tuning requires:
- Audio: WAV files, 22,050 Hz mono (auto-resampled), 0.5–11.6s per clip
- Metadata: Pipe-delimited CSV with header:
audio_file|text|speaker_name - Split: Separate
metadata_train.csvandmetadata_eval.csv
5.2 Dataset Evaluation
We evaluated all publicly available Arabic speech datasets for suitability as XTTS-v2 fine-tuning data. The ideal dataset has: single-speaker or speaker-labeled clips, 2–11s duration, clean audio, accurate transcripts, and 22kHz+ sample rate.
Fully Free & Ready to Download
| Dataset | Dialect | Size | Sample Rate | Speakers | Transcripts | Our Verdict |
|---|---|---|---|---|---|---|
| MAdel121/arabic-egy-cleaned | Egyptian | 72h / 103K clips | 16 kHz | Unknown (~85% male, no IDs) | Yes (normalized, no diacritics) | Best option |
| Kaggle Egyptian Audio | Egyptian | Unknown | Unknown | Unknown | Unknown | Needs investigation |
| MagicHub ASR-EgArbCSC | Egyptian | 5.5h / 9 conversations | 16 kHz | 4 speakers (2 pairs) | Yes (UTF-8 TXT) | Too small, conversational overlap |
Free with Registration
| Dataset | Dialect | Size | Sample Rate | Speakers | Transcripts | Our Verdict |
|---|---|---|---|---|---|---|
| MGB-3 (via ARBML/klaam) | Egyptian | 15h+ | Unknown | Multiple (YouTube) | Yes (4 annotators per sentence) | Good quality, registration required |
| Mozilla Common Voice | Mixed MSA | ~300h (Arabic) | 48 kHz | Crowdsourced | Yes | Large but mixed accents, no speaker consistency |
Other Arabic Speech Resources
| Dataset | Dialect | Size | Notes |
|---|---|---|---|
| OpenSLR #46 | Tunisian MSA | Unknown | Tunisian Modern Standard Arabic |
| OpenSLR #108 MediaSpeech | Multi (includes Arabic) | Unknown | Media speech from broadcasts |
| OpenSLR #132 | Quranic Arabic | Unknown | Quran recitation — very formal, not conversational |
| ADI-5 | Multi-dialect | 50h+ | Egyptian, Levantine, Gulf, North African, MSA |
Evaluation Summary
Figure 4: Dataset evaluation results — green (chosen), yellow (viable alternatives), red (rejected).
5.3 Chosen Approach
Primary dataset: MAdel121/arabic-egy-cleaned — starting with a subset for initial fine-tuning experiments.
Why: It's the largest freely available Egyptian Arabic dataset, already cleaned, with normalized transcripts that match our dialect-first approach (no diacritics). The 16kHz sample rate is a compromise, but XTTS-v2 handles resampling automatically.
Future: As the project expands to more dialects (Sudanese, Gulf, Levantine), we will collect dialect-specific data from single speakers — either from YouTube (using yt-dlp + Whisper for transcription) or from dedicated recording sessions.
5.4 Data Preparation Pipeline
The source dataset has 103K clips with no speaker labels. To fine-tune XTTS-v2 for a single consistent voice, we need clips from the same speaker. We built an automated speaker clustering pipeline to solve this.
Curated with care: The heavy lifting of speaker identification, quality filtering, and format conversion has been done for you. If you're using this project as a starting point, the exported data in
data/egyptian/is ready for training — no additional preparation needed.
Script: scripts/prepare_dataset.py
Source dataset: MAdel121/arabic-egy-cleaned (HuggingFace)
Pipeline Overview
Figure 3: Data preparation pipeline — from raw HuggingFace dataset to XTTS-v2 training-ready format.
Step-by-Step Details
Step 1 — Download
| Detail | Value |
|---|---|
| Source | MAdel121/arabic-egy-cleaned |
| Method | datasets.load_dataset() (HuggingFace datasets library) |
| Size | 16.5 GB, 103,603 clips |
| Cache | ~/.cache/huggingface/datasets/ (auto-cached, no re-download on rerun) |
| Columns | audio (WAV 16kHz), text (normalized Arabic), duration (float), dataset_source (A-D) |
Step 2 — Quality Filter
Removes clips unsuitable for XTTS-v2 fine-tuning before any expensive processing:
| Filter | Criteria | Reason |
|---|---|---|
| Duration | 2–11 seconds | XTTS-v2 max is 11.6s; clips < 2s lack enough context for speaker embedding |
| Text length | 10–200 characters | XTTS-v2 tokenizer max is 200; very short text = poor audio-text alignment |
| Input | Output |
|---|---|
| 103,603 clips (full dataset) | ~82,000 clips (estimated after filtering) |
Step 3 — Speaker Embedding Extraction
Extracts a numerical "voice fingerprint" for each clip so we can identify who is speaking.
| Detail | Value |
|---|---|
| Model | ECAPA-TDNN (SpeechBrain) |
| Source | speechbrain/spkrec-ecapa-voxceleb (HuggingFace) |
| Embedding dim | 192 |
| Trained on | VoxCeleb (7,000+ speakers, multi-lingual) |
| Input | Each clip's audio (resampled to 16kHz for the encoder) |
| Output | One 192-dim vector per clip |
| Speed | ~200 clips/second on GPU |
Why ECAPA-TDNN over alternatives?
| Model | Dim | Trained On | Why we chose/skipped |
|---|---|---|---|
| ResNet (built-in XTTS) | 512 | English/European | Skipped — trained mainly on English speakers, less accurate for Arabic |
| ECAPA-TDNN (SpeechBrain) | 192 | VoxCeleb (7,000+ speakers) | Chosen — robust across languages, fast, no auth token needed |
| Pyannote | 512 | Multi-lingual | Good accuracy but requires HuggingFace auth token and license acceptance |
| WavLM + X-vector | 512 | Self-supervised | State of the art but significantly slower |
Step 4 — Speaker Clustering
Groups embeddings so that each cluster represents clips from the same speaker.
| Detail | Value |
|---|---|
| Method | Agglomerative Clustering (sklearn) |
| Similarity metric | Cosine similarity (embeddings L2-normalized) |
| Cluster count | Auto-selected from range 10–30 using silhouette score |
| Output | Cluster label per clip + silhouette score |
Step 5 — Best Cluster Selection
Picks the cluster most likely to be a single consistent speaker.
| Detail | Value |
|---|---|
| Minimum cluster size | 500 clips (configurable) |
| Selection metric | Highest mean cosine similarity to cluster centroid |
| Speaker ID assigned | egyptian_male_01 |
| Output stats | Cluster size, total hours, mean similarity, mean duration |
Step 6 — Export to XTTS-v2 Format
Converts the selected cluster into the exact format XTTS-v2 expects for fine-tuning.
| Detail | Value |
|---|---|
| Audio resampling | 16,000 Hz → 22,050 Hz (via torchaudio.functional.resample) |
| Audio format | WAV, mono, float32 |
| CSV format | Pipe-delimited | with header: audio_file|text|speaker_name |
| Train/eval split | 90% train / 10% eval (random, seed=42) |
| Output file | Description |
|---|---|
data/egyptian/metadata_train.csv |
Training metadata |
data/egyptian/metadata_eval.csv |
Evaluation metadata |
data/egyptian/wavs/clip_XXXXXX.wav |
Resampled audio files |
docs/benchmarks/dataset_preparation.json |
Full pipeline report (JSON) |
Generated Visualizations
The pipeline automatically generates analysis charts saved to docs/images/:
| Chart | File | What it shows |
|---|---|---|
| Speaker clusters | docs/images/speaker_clusters_umap.png |
UMAP projection of all embeddings, selected speaker in green |
| Cluster sizes | docs/images/cluster_sizes.png |
Bar chart of all clusters, selected in green, threshold line |
| Duration distribution | docs/images/duration_distribution.png |
Histogram of all clips vs selected speaker clips |
5.5 Reproducibility
To rerun the full pipeline from scratch:
conda activate new-arabic-tts
cd "New Arabic TTS"
python scripts/prepare_dataset.py
The dataset will be cached after the first download. Subsequent runs skip the download and go straight to filtering. All random operations use seed=42 for reproducibility.
5.6 Dependencies Added in This Phase
pip install datasets speechbrain umap-learn matplotlib torchcodec openai-whisper
| Package | Version | Purpose |
|---|---|---|
datasets |
4.8.4 | HuggingFace dataset download and loading |
speechbrain |
1.1.0 | ECAPA-TDNN speaker embedding model |
umap-learn |
— | UMAP dimensionality reduction for visualization |
matplotlib |
— | Chart generation |
torchcodec |
0.11.1 | Audio decoding for HuggingFace datasets (required by datasets library) |
openai-whisper |
— | Alignment verification in sanity check (Layer 3) |
Why OpenAI Whisper instead of faster-whisper? We initially tried
faster-whisper(CTranslate2-based) but it lacks CUDA support on ARM (aarch64) platforms like DGX Spark. OpenAI's original Whisper uses PyTorch directly and works with any CUDA-capable GPU.
5.7 Data Sanity Check
Why clean data matters for TTS
TTS fine-tuning is fundamentally different from ASR (speech recognition). An ASR model can tolerate noisy data because it's learning to understand speech — a few bad transcripts won't ruin comprehension. But a TTS model learns to reproduce exactly what it hears:
- Misaligned text (text doesn't match audio) → the model learns wrong pronunciations and associates incorrect words with sounds
- Noisy audio → the model learns to generate background noise, hiss, or hum as part of normal speech
- Silence or dead air → the model learns to insert random pauses mid-sentence
- Corrupted or garbled text → the model learns to mispronounce words consistently
A few bad clips in a 5,000-clip dataset might go unnoticed. But hundreds of bad clips will measurably degrade every output the fine-tuned model produces. The quality of your training data directly determines the quality ceiling of your model.
Script: scripts/sanity_check.py
We apply a three-layer automated verification pipeline to catch problems before they reach the training stage.
Layer 1 — Text Quality
Fast checks on the transcript text to catch obviously bad entries.
| Check | Threshold | What it catches |
|---|---|---|
| Arabic character ratio | Min 60% | Non-Arabic text, garbled encoding |
| Latin character count | Max 3 | English words mixed into Arabic transcripts |
| Repeated words | Max 3 consecutive | Stuttered or looping transcriptions |
| Characters per second | 3–25 chars/s | Text that doesn't match audio duration (too much or too little text for the clip length) |
Layer 2 — Audio Quality
Signal analysis on the audio waveform.
| Check | Threshold | What it catches |
|---|---|---|
| Signal-to-noise ratio | Min 10 dB | Noisy recordings, background music |
| Silence ratio | Max 50% | Clips that are mostly dead air |
| Clipping ratio | Max 1% | Distorted audio from volume overflow |
Layer 3 — Alignment Verification (Whisper)
The most important layer. Uses OpenAI Whisper (large-v3) to independently transcribe each clip, then compares the Whisper transcript to the metadata text.
| Detail | Value |
|---|---|
| Model | Whisper large-v3 (OpenAI, PyTorch-based) |
| Comparison | Normalized text similarity (SequenceMatcher) |
| Threshold | Min 40% similarity |
| What it catches | Misaligned text — clips where the transcript doesn't match what's actually spoken |
Text normalization before comparison:
- Remove diacritics
- Normalize alef variants (أ/إ/آ → ا)
- Normalize taa marbuta (ة → ه)
- Remove punctuation
Output Files
| File | Description |
|---|---|
data/egyptian/metadata_train.csv |
Cleaned training metadata (bad clips removed) |
data/egyptian/metadata_eval.csv |
Cleaned evaluation metadata (bad clips removed) |
data/egyptian/rejected/ |
Rejected WAV files (moved here for manual review) |
docs/benchmarks/sanity_check.json |
Full report with per-clip rejection reasons |
How to run
conda activate new-arabic-tts
python scripts/sanity_check.py
Dependencies
pip install openai-whisper
Note: We use
openai-whisper(PyTorch-based) instead offaster-whisper(CTranslate2-based) because CTranslate2 lacks CUDA support on ARM platforms (DGX Spark). If you're on x86 with CUDA,faster-whisperwill also work and is faster.
Phase 6: Fine-Tuning
Status: In Progress
Script: scripts/train.py
6.1 Training Configuration
| Parameter | Value | Rationale |
|---|---|---|
| Epochs | 4 | Conservative — 5h of data, avoid overfitting |
| Batch size | 4 | Fits comfortably in DGX Spark GPU memory |
| Gradient accumulation | 2 | Effective batch size = 8 |
| Learning rate | 5e-6 | Standard for XTTS-v2 fine-tuning |
| Optimizer | AdamW | With betas [0.9, 0.96], weight decay 1e-2 |
| LR scheduler | MultiStepLR | Decays at 50K, 150K, 300K steps |
| Precision | fp32 | Mixed precision causes NaN losses with XTTS GPT |
| Save frequency | Every 1,000 steps | Keep top 3 checkpoints |
| Base model | models/base/model.pth |
XTTS-v2 pretrained weights |
6.2 Training Data
| Property | Value |
|---|---|
| Dataset | Cleaned Egyptian Arabic (post sanity check) |
| Train clips | 4,328 |
| Eval clips | 480 |
| Total hours | ~4.9 hours |
| Speaker | egyptian_male_01 |
| Sample rate | 22,050 Hz |
6.3 Training Process
conda activate new-arabic-tts
python scripts/train.py
Output saved to models/finetuned/run/training/ with:
- Best model checkpoint
- TensorBoard logs
- Training configuration
6.4 Training Results
| Metric | Value |
|---|---|
| Training time | 30 minutes |
| Final loss | 3.60 (down from 4.47 at start) |
| Text CE loss | 0.029 |
| Mel CE loss | 3.57 |
| Best checkpoint | best_model_4328.pth (final step) |
| Checkpoints saved | 1000, 2000, 3000, 4328 (best) |
| Output directory | models/finetuned/run/training/GPT_XTTS_AR_FT-April-20-2026_02+14PM-81b7d55/ |
6.5 Fine-Tuned Output
Output file: outputs/finetuned model/Finetuned_Model_test.wav — 16.76 seconds
Stats file: outputs/finetuned model/Finetuned_Model_test_stats.json
Listen to this file and compare with the baseline and improved outputs. The voice now comes from a real Arabic speaker (Egyptian male) rather than a built-in English-trained speaker. This is the core quality transformation of the project.
6.6 Full Comparison (Same Demo Text)
All three outputs use the same standard demo text for a fair comparison:
| Metric | Phase 1: Original | Phase 3+4: Improved | Phase 6: Fine-tuned |
|---|---|---|---|
| File | Original_Model_test.wav |
Improved_Model_test.wav |
Finetuned_Model_test.wav |
| Duration | 19.33s | 18.99s | 16.76s |
| Speaker | Gilberto Mathias (built-in) | Gilberto Mathias (built-in) | egyptian_male_01 (trained) |
| Voice | Non-Arabic accent | Non-Arabic accent | Arabic speaker |
| Preprocessing | None | Hamza + smart chunking | Hamza |
| Model | Base XTTS-v2 | Base XTTS-v2 | Fine-tuned (4 epochs) |
The fine-tuned output is shorter (16.76s vs 19.33s) because the Arabic-trained speaker speaks at a more natural pace compared to the English-trained baseline speaker.
Phase 7: Evaluation & Comparison
Status: Complete
Script: scripts/evaluate.py
7.1 Evaluation Metrics
All three outputs were analyzed using the same standard demo text for a fair comparison.
| Metric | Phase 1: Baseline | Phase 3+4: Improved | Phase 6: Fine-tuned |
|---|---|---|---|
| Duration | 19.33s | 18.99s | 16.76s |
| SNR | 76.1 dB | 61.0 dB | 43.0 dB |
| Silence ratio | 26.8% | 28.8% | 22.6% |
| Speaker | Built-in (non-Arabic) | Built-in (non-Arabic) | Egyptian Arabic (trained) |
Note on SNR: The baseline has higher SNR because the built-in speaker produces cleaner but less natural audio. The fine-tuned model has lower SNR because it reproduces the natural characteristics of real speech (breathing, vocal texture), which register as "noise" in a simple SNR measurement. This is expected and desirable — real human speech is not perfectly clean.
7.2 Generated Charts
All charts are saved to docs/images/ and generated automatically by scripts/evaluate.py.
Waveform Comparison
Side-by-side waveforms of the same text across all three stages. The fine-tuned model shows more natural amplitude variation and pacing.
Spectrogram Comparison
Frequency content across time. The fine-tuned model shows richer harmonic structure typical of natural Arabic speech.
Metrics Comparison
Bar chart comparing duration, SNR, speaking duration, and silence ratio across all stages.
Training Loss Curve
Loss curve during fine-tuning. Steady decrease from 4.47 to 3.60 over 4 epochs (4,328 steps, 30 minutes).
7.3 Audio Samples
Listen to all three outputs in order to hear the progression:
| Stage | File | Description |
|---|---|---|
| 1. Baseline | outputs/original model/Original_Model_test.wav |
Raw base model, non-Arabic speaker |
| 2. Improved | outputs/improved model/Improved_Model_test.wav |
Same model, better text handling |
| 3. Fine-tuned | outputs/finetuned model/Finetuned_Model_test.wav |
Arabic speaker, natural pronunciation |
7.4 How to Reproduce
conda activate new-arabic-tts
python scripts/evaluate.py
This regenerates all charts and metrics from the existing output WAV files.
Phase 8: Studio-Quality Audio Upsampling
Status: Complete
Why not change the model's internal sample rate?
XTTS-v2's sample rate (22,050 Hz internal → 24,000 Hz output) is deeply embedded across all components — the DVAE codec, GPT backbone, HiFiGAN vocoder, and mel spectrogram extraction are all trained and calibrated for these rates. Changing the output sample rate would require retraining the entire model from scratch (~350M parameters), which is not feasible.
Instead, we use a neural audio upsampler as a post-processing step after the model generates speech. This is the same approach used in professional TTS production pipelines.
8.1 Approach
XTTS-v2 (24 kHz / 16-bit) → Neural Upsampler → 48 kHz or 96 kHz / 24-bit
The upsampler reconstructs the high-frequency detail (4–24 kHz) that the base model cannot produce, resulting in studio-quality output without modifying the core model.
Audio Quality Tiers
| Tier | Sample Rate | Bit Depth | Quality | Use Case |
|---|---|---|---|---|
| Default | 24 kHz | 16-bit | Good | Fast generation, previews |
| High | 48 kHz | 24-bit | Professional | Podcasts, voiceover, video |
| Studio | 96 kHz | 24-bit | Studio master | Music production, archival |
8.2 Chosen Upsampler: AudioSR
We evaluated three options and chose AudioSR:
| Tool | Output | Speed | Why we chose/skipped |
|---|---|---|---|
| AudioSR | 48 kHz | GPU-accelerated | Chosen — purpose-built for speech, diffusion-based, high quality |
| Vocos | 48 kHz | Fast | Skipped — requires custom training for best results |
| HiFi-GAN Universal | 48 kHz | Fast | Skipped — less natural than diffusion-based approach |
Script: scripts/upsample.py
8.3 Upsampling Process
All three official outputs are upsampled for comparison:
conda activate new-arabic-tts
python scripts/upsample.py --all
| Input (24kHz / 16-bit) | Output (48kHz / 24-bit) |
|---|---|
Original_Model_test.wav |
Original_Model_test_48kHz.wav |
Improved_Model_test.wav |
Improved_Model_test_48kHz.wav |
Finetuned_Model_test.wav |
Finetuned_Model_test_48kHz.wav |
8.4 AudioSR Parameters
| Parameter | Value | Purpose |
|---|---|---|
| Model | speech |
Optimized for speech (vs basic for general audio) |
| DDIM steps | 50 | Diffusion denoising steps (quality vs speed tradeoff) |
| Guidance scale | 3.5 | Controls fidelity to input |
| Seed | 42 | Reproducible output |
| Output format | WAV PCM_24 | 24-bit depth for professional quality |
AudioSR reconstructs the high-frequency content (8–24 kHz) that the base XTTS-v2 model cannot produce at its native 24kHz output. The result is noticeably crisper consonants, more natural breathiness, and professional-grade audio suitable for podcasts, voiceover, and video production.
8.5 Upsampling Results
| Input File | Output File | Processing Time |
|---|---|---|
Original_Model_test.wav (24kHz) |
Original_Model_test_48kHz.wav (48kHz/24-bit) |
60.8s |
Improved_Model_test.wav (24kHz) |
Improved_Model_test_48kHz.wav (48kHz/24-bit) |
17.1s |
Finetuned_Model_test.wav (24kHz) |
Finetuned_Model_test_48kHz.wav (48kHz/24-bit) |
17.1s |
The best output from this project is
Finetuned_Model_test_48kHz.wav— fine-tuned Arabic voice at professional 48kHz/24-bit quality.
Resource Usage
Every phase was tracked for compute resources. This helps you estimate hardware requirements if running on different GPUs.
Per-Phase Resource Breakdown
| Phase | Task | GPU VRAM | System RAM | Time | Notes |
|---|---|---|---|---|---|
| 1. Baseline | Model loading | ~4 GB | ~2 GB | 10s | XTTS-v2 base model (370M params) |
| 1. Baseline | Inference (4 sentences) | ~5 GB | ~2 GB | 6.4s | 24kHz output, built-in speaker |
| 2. Compatibility | Patch only | — | — | <1s | No GPU needed |
| 3. Preprocessing | Text processing | — | ~200 MB | <1s | CPU only (no GPU) |
| 3. Preprocessing | Mishkal tashkeel (optional) | — | ~500 MB | <1s | CPU only |
| 4. Inference | Full pipeline (chunking + post) | ~5 GB | ~2 GB | 6.9s | Same model, added post-processing |
| 5. Data download | HuggingFace dataset | — | ~36 GB | 27 min | 103K clips decoded into memory |
| 5. Embeddings | ECAPA-TDNN (53K clips) | ~1 GB | ~2.5 GB | 10.4 min | 85 clips/sec on GPU |
| 5. Clustering | Agglomerative + UMAP | — | ~4 GB | 2 min | CPU-bound (sklearn) |
| 5. Export | Resample + write WAVs | — | ~1 GB | 3 min | 5,000 clips to 22,050 Hz |
| 5. Sanity check | Whisper large-v3 (5K clips) | ~10 GB | ~2 GB | 93 min | ~1 clip/sec, alignment verification |
| 6. Training | XTTS-v2 fine-tuning (4 epochs) | ~18 GB | ~4 GB | 30 min | 4,328 steps, batch 4, fp32 |
| 6. Training | Peak during backprop | ~22 GB | ~4 GB | — | Gradient accumulation x2 |
| 7. Evaluation | Chart generation | — | ~500 MB | <30s | CPU only (matplotlib) |
| 8. Upsampling | AudioSR (per file) | ~6 GB | ~3 GB | 17-60s/file | 50 DDIM steps, diffusion model |
GPU Memory Summary
┌──────────────────────────────────────────────────────────────┐
│ DGX Spark GPU: 121.7 GB unified memory │
│ │
│ Peak usage by task: │
│ ██░░░░░░░░░░░░░░░░░░ Training (22 GB) 18% of total │
│ █░░░░░░░░░░░░░░░░░░░ Whisper check (10 GB) 8% of total │
│ █░░░░░░░░░░░░░░░░░░░ AudioSR (6 GB) 5% of total │
│ ░░░░░░░░░░░░░░░░░░░░ Inference (5 GB) 4% of total │
│ │
│ Minimum GPU for this project: 8 GB (inference only) │
│ Minimum GPU for full pipeline: 24 GB (training + Whisper) │
└──────────────────────────────────────────────────────────────┘
For Different Hardware
| GPU | VRAM | Can run |
|---|---|---|
| RTX 3060 | 12 GB | Inference only. Training needs batch_size=1 |
| RTX 3090 / 4090 | 24 GB | Full pipeline. Reduce batch_size to 2 for training |
| A100 / H100 | 40-80 GB | Full pipeline at higher batch sizes |
| DGX Spark (this project) | 121.7 GB | Full pipeline, comfortable headroom |
Total Pipeline Cost
| Resource | Value |
|---|---|
| Total GPU time | ~2.5 hours (all phases) |
| Total wall clock | ~4 hours (including downloads and waits) |
| Peak GPU VRAM | 22 GB (during training) |
| Peak system RAM | 36 GB (during dataset loading) |
| Disk space (project) | ~25 GB (model + data + outputs) |
| Disk space (cache) | ~16 GB (HuggingFace dataset cache) |
Results
| Metric | Phase 1: Baseline | Phase 3+4: Improved | Phase 6: Fine-tuned |
|---|---|---|---|
| Output duration | 19.33s | 18.99s | 16.76s |
| Speaker | Built-in (non-Arabic) | Built-in (non-Arabic) | Egyptian Arabic (trained) |
| Arabic naturalness | Low | Low (same model) | Significantly improved |
| Text handling | Raw (numbers, symbols as-is) | Hamza + numbers + symbols | Hamza |
| Post-processing | None | Pause compress + trim + taper | None (raw for comparison) |
| Training loss | — | — | 3.60 (from 4.47) |
| Training time | — | — | 30 minutes |
Usage
CLI Usage
# Basic generation (dialect mode — no tashkeel)
python scripts/infer.py --text "مرحباً بكم في العالم العربي" --output outputs/test.wav
# From a text file
python scripts/infer.py --text-file input.txt --output outputs/test.wav
# Formal MSA mode (with tashkeel)
python scripts/infer.py --text "مرحباً بكم" --output outputs/msa.wav --tashkeel
# Custom parameters
python scripts/infer.py --text "مرحباً" --output outputs/test.wav \
--speaker "Gilberto Mathias" --temperature 0.55 --top-p 0.85 --rep-penalty 2.5 --pause 0.4
Python API
from scripts.infer import load_model, load_speaker, generate
from scripts.arabic_preprocessor import ArabicPreprocessor
import soundfile as sf
model = load_model("models/base")
gpt_cond_latent, speaker_embedding = load_speaker("models/base", "Gilberto Mathias")
# Dialect mode (default)
preprocessor = ArabicPreprocessor(enable_tashkeel=False)
result = generate(
model=model,
text="الذكاء الاصطناعي يتطور بسرعة كبيرة",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
preprocessor=preprocessor,
)
sf.write("output.wav", result["wav"], result["sr"])
print(result["stats"])
Known Limitations & Future Improvements
Current Limitations
The current model produces noticeably improved Arabic speech compared to the base XTTS-v2, but it is not yet at the level of a professional human voice actor. The primary bottleneck is data quality, not model architecture.
Why data is the limiting factor:
The training data used in this release comes from MAdel121/arabic-egy-cleaned, an open-source Egyptian Arabic speech dataset originally built for ASR (speech recognition), not TTS. While we applied extensive cleaning — speaker clustering, quality filtering, and Whisper-based alignment verification — the data has inherent limitations:
- No speaker labels in the source — our speaker clustering identifies the most consistent voice, but it's not guaranteed to be a single speaker across all 4,328 training clips
- 16 kHz source audio — upsampled to 22,050 Hz for training, but the high-frequency detail above 8 kHz was never captured in the original recordings
- ASR-grade transcripts — the text was generated by speech recognition models, not hand-verified. Our Whisper cross-check caught 188 misaligned clips (3.8%), but subtle errors may remain
- Single dialect — currently trained on Egyptian Arabic only. Sudanese, Gulf, Levantine, and other dialects require their own training data
Data Collection Pipeline (Next Phase)
To achieve production-quality Arabic TTS, we need purpose-built training data. The next phase will use a custom data collection pipeline:
Figure 5: Future data collection pipeline — from YouTube to training-ready data. Step 6 (Manual Review) is the most critical step for TTS data quality.
Manual cleaning is crucial. No automated pipeline — no matter how sophisticated — can replace a human listener verifying that text matches audio. The difference between a good TTS model and a great one often comes down to the last 5% of data quality that only manual review can catch.
Acknowledgments — Training Data
We gratefully acknowledge the following open-source data providers whose work made this project possible:
- MAdel121 — Creator of the arabic-egy-cleaned dataset (72h Egyptian Arabic, cleaned and normalized). This dataset served as our primary training data source
- Mozilla Common Voice — Crowdsourced multilingual speech data including Arabic contributions
- Coqui AI — Creators of XTTS-v2 and the Coqui TTS framework that this project builds upon
- SpeechBrain — ECAPA-TDNN speaker verification model used for speaker clustering
- OpenAI Whisper — Speech recognition model used for data alignment verification
Roadmap
| Priority | Improvement | Impact |
|---|---|---|
| High | Collect clean, single-speaker Sudanese Arabic data (10-20 hours) | Enables the primary dialect goal of this project |
| High | Manual review of training data with native Arabic speakers | Significant quality improvement from cleaner data |
| Medium | Expand to Gulf, Levantine, and North African dialects | Broader Arabic dialect coverage |
| Medium | Record purpose-built TTS data in a controlled environment | Studio-quality source audio (48kHz+) eliminates upsampling artifacts |
| Low | Train a custom Arabic tokenizer with larger vocabulary | Better text-to-token mapping for Arabic script |
| Low | Fine-tune the HiFiGAN vocoder on Arabic speech | More natural waveform generation for Arabic phonetics |
Industry Use Cases
When this project reaches production quality with clean, manually verified training data across multiple Arabic dialects, it enables a wide range of applications:
Media & Entertainment
| Use Case | Description |
|---|---|
| Arabic podcast production | Generate full episodes or narration in natural Arabic voices, reducing production time from days to minutes |
| Video dubbing & localization | Dub English/foreign content into Arabic dialects — Egyptian for entertainment, MSA for news, Gulf for regional content |
| Audiobook generation | Convert Arabic books and educational material to spoken audio at scale |
| Social media content | Generate voiceovers for Arabic YouTube, TikTok, and Instagram content |
Business & Enterprise
| Use Case | Description |
|---|---|
| Arabic call center IVR | Natural-sounding Arabic voice prompts and automated responses for customer service |
| Corporate training | Generate Arabic e-learning narration for employee training across MENA region |
| Accessibility | Screen readers and assistive technology with natural Arabic speech for visually impaired users |
| Document-to-speech | Convert Arabic reports, emails, and documents to audio for on-the-go consumption |
Technology & AI
| Use Case | Description |
|---|---|
| Arabic voice assistants | Power Siri/Alexa-like assistants that speak natural Arabic across dialects |
| Real-time translation | Text-to-speech component for Arabic in live translation systems |
| Gaming | Generate Arabic NPC dialogue and narration for games targeting the MENA market |
| Robotics | Arabic speech output for service robots in airports, malls, and hospitals across the Arab world |
Education & Government
| Use Case | Description |
|---|---|
| Language learning | Generate example pronunciations in specific Arabic dialects for language learners |
| Public announcements | Natural Arabic announcements for airports, transit systems, and government services |
| News automation | Generate Arabic news bulletins from text feeds |
Market Opportunity
The Arabic-speaking world represents over 400 million native speakers across 25+ countries, yet Arabic TTS technology significantly lags behind English and European languages. High-quality, dialect-aware Arabic TTS addresses an underserved market with growing demand for Arabic digital content, accessibility tools, and AI-powered services.
Future Releases
This is an active project. Improved versions of the model will be released as we collect cleaner training data, expand dialect coverage, and refine the pipeline. Future releases will include:
- v2.0 — Retrained on manually verified, studio-quality Arabic speech data
- Dialect-specific checkpoints — Separate models for Sudanese, Egyptian, Gulf, Levantine, and MSA
- Pre-built speaker profiles — Ready-to-use voice embeddings for common use cases
Follow this repository for updates. Releases will be published on both GitHub and HuggingFace.
License
- XTTS-v2 Base Model: Coqui Public Model License (CPML)
- This Project: MIT License
Citation
@misc{arabic-tts-xtts-v2,
title={Improving XTTS-v2 for Arabic Text-to-Speech},
author={Moe Eldouma},
year={2026},
url={https://github.com/eldoumamoe-glitch/arabic-tts}
}
- Downloads last month
- 426
Model tree for Moeeldouma/arabic-tts-xtts-v2
Base model
coqui/XTTS-v2
