|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ar |
|
|
library_name: coqui |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- tts |
|
|
- text-to-speech |
|
|
- speech-synthesis |
|
|
- arabic |
|
|
- egyptian-arabic |
|
|
- xtts |
|
|
- voice-cloning |
|
|
datasets: |
|
|
- KickItLikeShika/NileTTS |
|
|
base_model: coqui/XTTS-v2 |
|
|
--- |
|
|
|
|
|
# Nile-XTTS Model 🇪🇬 |
|
|
|
|
|
**Paper:** https://arxiv.org/abs/2602.15675 |
|
|
|
|
|
**Nile-XTTS** is a fine-tuned version of [XTTS v2](https://huggingface.co/coqui/XTTS-v2) optimized for **Egyptian Arabic (اللهجة المصرية)** text-to-speech synthesis with zero-shot voice cloning capabilities. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model was fine-tuned on the [NileTTS dataset](https://huggingface.co/datasets/KickItLikeShika/NileTTS), comprising 38 hours of Egyptian Arabic speech across medical, sales, and general conversation domains. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Egyptian Arabic optimized**: Trained specifically on Egyptian dialect, not MSA or Gulf Arabic |
|
|
- **Zero-shot voice cloning**: Clone any voice with just a 6-second reference audio |
|
|
- **Improved intelligibility**: 29.9% reduction in WER compared to base XTTS v2 |
|
|
- **Better pronunciation**: 49.4% reduction in CER for Egyptian Arabic |
|
|
|
|
|
### Performance |
|
|
|
|
|
| Metric | XTTS v2 (Baseline) | Nile-XTTS-v2 (Ours) | Improvement | |
|
|
|--------|-------------------|---------------------|-------------| |
|
|
| WER | 26.8% | **18.8%** | 29.9% | |
|
|
| CER | 8.1% | **4.1%** | 49.4% | |
|
|
| Speaker Similarity | 0.713 | **0.755** | +5.9% | |
|
|
|
|
|
## Usage |
|
|
|
|
|
[**Interactive Demo**](https://github.com/KickItLikeShika/NileTTS/blob/main/playground.ipynb) |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install TTS |
|
|
``` |
|
|
|
|
|
### Usage (Direct Model Loading) |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torchaudio |
|
|
from TTS.tts.configs.xtts_config import XttsConfig |
|
|
from TTS.tts.models.xtts import Xtts |
|
|
|
|
|
# load config and model |
|
|
config = XttsConfig() |
|
|
config.load_json("config.json") |
|
|
|
|
|
model = Xtts.init_from_config(config) |
|
|
model.load_checkpoint( |
|
|
config, |
|
|
checkpoint_path="model.pth", |
|
|
vocab_path="vocab.json", |
|
|
use_deepspeed=False |
|
|
) |
|
|
model.cuda() |
|
|
model.eval() |
|
|
|
|
|
# get speaker latents from reference audio |
|
|
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( |
|
|
audio_path="reference.wav", |
|
|
gpt_cond_len=6, |
|
|
max_ref_length=30, |
|
|
sound_norm_refs=False |
|
|
) |
|
|
|
|
|
# synth speech |
|
|
out = model.inference( |
|
|
text="مرحبا، إزيك النهارده؟", |
|
|
language="ar", |
|
|
gpt_cond_latent=gpt_cond_latent, |
|
|
speaker_embedding=speaker_embedding, |
|
|
temperature=0.7, |
|
|
) |
|
|
|
|
|
# save output |
|
|
torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base model**: XTTS v2 |
|
|
- **Training data**: NileTTS dataset (38 hours, 2 speakers) |
|
|
- **Epochs**: 8 (early stopping) |
|
|
- **Learning rate**: 5e-6 |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Limited to 2 speaker voices in training data |
|
|
- Optimized for Egyptian Arabic; may not perform as well on other Arabic dialects |
|
|
- Zero-shot cloning quality depends on reference audio quality |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
[TO BE ADDED] |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 license, following the original XTTS v2 license. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
- [Coqui TTS](https://github.com/coqui-ai/TTS) for the XTTS v2 base model |
|
|
- The NileTTS team for the dataset creation |
|
|
|
|
|
|