NileTTS-XTTS / README.md

KickItLikeShika

Update README.md

cc0a1c5 verified 1 day ago

preview code

raw

history blame contribute delete

3.2 kB

metadata

license: apache-2.0
language:
  - ar
library_name: coqui
pipeline_tag: text-to-speech
tags:
  - tts
  - text-to-speech
  - speech-synthesis
  - arabic
  - egyptian-arabic
  - xtts
  - voice-cloning
datasets:
  - KickItLikeShika/NileTTS
base_model: coqui/XTTS-v2

Nile-XTTS Model 🇪🇬

Paper: https://arxiv.org/abs/2602.15675

Nile-XTTS is a fine-tuned version of XTTS v2 optimized for Egyptian Arabic (اللهجة المصرية) text-to-speech synthesis with zero-shot voice cloning capabilities.

Model Description

This model was fine-tuned on the NileTTS dataset, comprising 38 hours of Egyptian Arabic speech across medical, sales, and general conversation domains.

Key Features

Egyptian Arabic optimized: Trained specifically on Egyptian dialect, not MSA or Gulf Arabic
Zero-shot voice cloning: Clone any voice with just a 6-second reference audio
Improved intelligibility: 29.9% reduction in WER compared to base XTTS v2
Better pronunciation: 49.4% reduction in CER for Egyptian Arabic

Performance

Metric	XTTS v2 (Baseline)	Nile-XTTS-v2 (Ours)	Improvement
WER	26.8%	18.8%	29.9%
CER	8.1%	4.1%	49.4%
Speaker Similarity	0.713	0.755	+5.9%

Usage

Interactive Demo

Installation

pip install TTS

Usage (Direct Model Loading)

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# load config and model
config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="model.pth",
    vocab_path="vocab.json",
    use_deepspeed=False
)
model.cuda()
model.eval()

# get speaker latents from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav",
    gpt_cond_len=6,
    max_ref_length=30,
    sound_norm_refs=False
)

# synth speech
out = model.inference(
    text="مرحبا، إزيك النهارده؟",
    language="ar",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
)

# save output
torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

Training Details

Base model: XTTS v2
Training data: NileTTS dataset (38 hours, 2 speakers)
Epochs: 8 (early stopping)
Learning rate: 5e-6

Limitations

Limited to 2 speaker voices in training data
Optimized for Egyptian Arabic; may not perform as well on other Arabic dialects
Zero-shot cloning quality depends on reference audio quality

Citation

If you use this model, please cite: [TO BE ADDED]

License

This model is released under the Apache 2.0 license, following the original XTTS v2 license.

Acknowledgements

Coqui TTS for the XTTS v2 base model
The NileTTS team for the dataset creation