---
license: apache-2.0
language:
- ar
library_name: coqui
pipeline_tag: text-to-speech
tags:
- tts
- text-to-speech
- speech-synthesis
- arabic
- egyptian-arabic
- xtts
- voice-cloning
datasets:
- KickItLikeShika/NileTTS
base_model: coqui/XTTS-v2
---

# Nile-XTTS Model 🇪🇬

**Paper:** https://arxiv.org/abs/2602.15675

**Nile-XTTS** is a fine-tuned version of [XTTS v2](https://huggingface.co/coqui/XTTS-v2) optimized for **Egyptian Arabic (اللهجة المصرية)** text-to-speech synthesis with zero-shot voice cloning capabilities.

## Model Description

This model was fine-tuned on the [NileTTS dataset](https://huggingface.co/datasets/KickItLikeShika/NileTTS), comprising 38 hours of Egyptian Arabic speech across medical, sales, and general conversation domains.

### Key Features

- **Egyptian Arabic optimized**: Trained specifically on Egyptian dialect, not MSA or Gulf Arabic
- **Zero-shot voice cloning**: Clone any voice with just a 6-second reference audio
- **Improved intelligibility**: 29.9% reduction in WER compared to base XTTS v2
- **Better pronunciation**: 49.4% reduction in CER for Egyptian Arabic

### Performance

| Metric | XTTS v2 (Baseline) | Nile-XTTS-v2 (Ours) | Improvement |
|--------|-------------------|---------------------|-------------|
| WER | 26.8% | **18.8%** | 29.9% |
| CER | 8.1% | **4.1%** | 49.4% |
| Speaker Similarity | 0.713 | **0.755** | +5.9% |

## Usage

[**Interactive Demo**](https://github.com/KickItLikeShika/NileTTS/blob/main/playground.ipynb)

### Installation

```bash
pip install TTS
```

### Usage (Direct Model Loading)

```python
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# load config and model
config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="model.pth",
    vocab_path="vocab.json",
    use_deepspeed=False
)
model.cuda()
model.eval()

# get speaker latents from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav",
    gpt_cond_len=6,
    max_ref_length=30,
    sound_norm_refs=False
)

# synth speech
out = model.inference(
    text="مرحبا، إزيك النهارده؟",
    language="ar",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
)

# save output
torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
```

## Training Details

- **Base model**: XTTS v2
- **Training data**: NileTTS dataset (38 hours, 2 speakers)
- **Epochs**: 8 (early stopping)
- **Learning rate**: 5e-6

## Limitations

- Limited to 2 speaker voices in training data
- Optimized for Egyptian Arabic; may not perform as well on other Arabic dialects
- Zero-shot cloning quality depends on reference audio quality

## Citation

If you use this model, please cite:
[TO BE ADDED]


## License

This model is released under the Apache 2.0 license, following the original XTTS v2 license.

## Acknowledgements

- [Coqui TTS](https://github.com/coqui-ai/TTS) for the XTTS v2 base model
- The NileTTS team for the dataset creation