--- license: apache-2.0 language: - ar library_name: coqui pipeline_tag: text-to-speech tags: - tts - text-to-speech - speech-synthesis - arabic - egyptian-arabic - xtts - voice-cloning datasets: - KickItLikeShika/NileTTS base_model: coqui/XTTS-v2 --- # Nile-XTTS Model 🇪🇬 **Paper:** https://arxiv.org/abs/2602.15675 **Nile-XTTS** is a fine-tuned version of [XTTS v2](https://huggingface.co/coqui/XTTS-v2) optimized for **Egyptian Arabic (اللهجة المصرية)** text-to-speech synthesis with zero-shot voice cloning capabilities. ## Model Description This model was fine-tuned on the [NileTTS dataset](https://huggingface.co/datasets/KickItLikeShika/NileTTS), comprising 38 hours of Egyptian Arabic speech across medical, sales, and general conversation domains. ### Key Features - **Egyptian Arabic optimized**: Trained specifically on Egyptian dialect, not MSA or Gulf Arabic - **Zero-shot voice cloning**: Clone any voice with just a 6-second reference audio - **Improved intelligibility**: 29.9% reduction in WER compared to base XTTS v2 - **Better pronunciation**: 49.4% reduction in CER for Egyptian Arabic ### Performance | Metric | XTTS v2 (Baseline) | Nile-XTTS-v2 (Ours) | Improvement | |--------|-------------------|---------------------|-------------| | WER | 26.8% | **18.8%** | 29.9% | | CER | 8.1% | **4.1%** | 49.4% | | Speaker Similarity | 0.713 | **0.755** | +5.9% | ## Usage [**Interactive Demo**](https://github.com/KickItLikeShika/NileTTS/blob/main/playground.ipynb) ### Installation ```bash pip install TTS ``` ### Usage (Direct Model Loading) ```python import torch import torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts # load config and model config = XttsConfig() config.load_json("config.json") model = Xtts.init_from_config(config) model.load_checkpoint( config, checkpoint_path="model.pth", vocab_path="vocab.json", use_deepspeed=False ) model.cuda() model.eval() # get speaker latents from reference audio gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( audio_path="reference.wav", gpt_cond_len=6, max_ref_length=30, sound_norm_refs=False ) # synth speech out = model.inference( text="مرحبا، إزيك النهارده؟", language="ar", gpt_cond_latent=gpt_cond_latent, speaker_embedding=speaker_embedding, temperature=0.7, ) # save output torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000) ``` ## Training Details - **Base model**: XTTS v2 - **Training data**: NileTTS dataset (38 hours, 2 speakers) - **Epochs**: 8 (early stopping) - **Learning rate**: 5e-6 ## Limitations - Limited to 2 speaker voices in training data - Optimized for Egyptian Arabic; may not perform as well on other Arabic dialects - Zero-shot cloning quality depends on reference audio quality ## Citation If you use this model, please cite: [TO BE ADDED] ## License This model is released under the Apache 2.0 license, following the original XTTS v2 license. ## Acknowledgements - [Coqui TTS](https://github.com/coqui-ai/TTS) for the XTTS v2 base model - The NileTTS team for the dataset creation