KickItLikeShika
/

NileTTS-XTTS

speech-synthesis

egyptian-arabic

Model card Files Files and versions

NileTTS-XTTS / README.md

KickItLikeShika's picture

KickItLikeShika

Update README.md

cc0a1c5 verified 2 days ago

|

history blame contribute delete

3.2 kB

	---
	license: apache-2.0
	language:
	- ar
	library_name: coqui
	pipeline_tag: text-to-speech
	tags:
	- tts
	- text-to-speech
	- speech-synthesis
	- arabic
	- egyptian-arabic
	- xtts
	- voice-cloning
	datasets:
	- KickItLikeShika/NileTTS
	base_model: coqui/XTTS-v2
	---

	# Nile-XTTS Model 🇪🇬

	Paper: https://arxiv.org/abs/2602.15675

	Nile-XTTS is a fine-tuned version of [XTTS v2](https://huggingface.co/coqui/XTTS-v2) optimized for Egyptian Arabic (اللهجة المصرية) text-to-speech synthesis with zero-shot voice cloning capabilities.

	## Model Description

	This model was fine-tuned on the [NileTTS dataset](https://huggingface.co/datasets/KickItLikeShika/NileTTS), comprising 38 hours of Egyptian Arabic speech across medical, sales, and general conversation domains.

	### Key Features

	- Egyptian Arabic optimized: Trained specifically on Egyptian dialect, not MSA or Gulf Arabic
	- Zero-shot voice cloning: Clone any voice with just a 6-second reference audio
	- Improved intelligibility: 29.9% reduction in WER compared to base XTTS v2
	- Better pronunciation: 49.4% reduction in CER for Egyptian Arabic

	### Performance

	\| Metric \| XTTS v2 (Baseline) \| Nile-XTTS-v2 (Ours) \| Improvement \|
	\|--------\|-------------------\|---------------------\|-------------\|
	\| WER \| 26.8% \| 18.8% \| 29.9% \|
	\| CER \| 8.1% \| 4.1% \| 49.4% \|
	\| Speaker Similarity \| 0.713 \| 0.755 \| +5.9% \|

	## Usage

	[Interactive Demo](https://github.com/KickItLikeShika/NileTTS/blob/main/playground.ipynb)

	### Installation

	```bash
	pip install TTS
	```

	### Usage (Direct Model Loading)

	```python
	import torch
	import torchaudio
	from TTS.tts.configs.xtts_config import XttsConfig
	from TTS.tts.models.xtts import Xtts

	# load config and model
	config = XttsConfig()
	config.load_json("config.json")

	model = Xtts.init_from_config(config)
	model.load_checkpoint(
	config,
	checkpoint_path="model.pth",
	vocab_path="vocab.json",
	use_deepspeed=False
	)
	model.cuda()
	model.eval()

	# get speaker latents from reference audio
	gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
	audio_path="reference.wav",
	gpt_cond_len=6,
	max_ref_length=30,
	sound_norm_refs=False
	)

	# synth speech
	out = model.inference(
	text="مرحبا، إزيك النهارده؟",
	language="ar",
	gpt_cond_latent=gpt_cond_latent,
	speaker_embedding=speaker_embedding,
	temperature=0.7,
	)

	# save output
	torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
	```

	## Training Details

	- Base model: XTTS v2
	- Training data: NileTTS dataset (38 hours, 2 speakers)
	- Epochs: 8 (early stopping)
	- Learning rate: 5e-6

	## Limitations

	- Limited to 2 speaker voices in training data
	- Optimized for Egyptian Arabic; may not perform as well on other Arabic dialects
	- Zero-shot cloning quality depends on reference audio quality

	## Citation

	If you use this model, please cite:
	[TO BE ADDED]


	## License

	This model is released under the Apache 2.0 license, following the original XTTS v2 license.

	## Acknowledgements

	- [Coqui TTS](https://github.com/coqui-ai/TTS) for the XTTS v2 base model
	- The NileTTS team for the dataset creation