VianLB
/

coqui_tts_fine-tuned

Model card Files Files and versions

XTTS v2 Fine-tuned Voice Model

This is a fine-tuned XTTS v2 model trained for voice cloning.

Model Details

Base Model: XTTS v2 (Coqui TTS)
Training: 400 epochs
Loss Improvement: 57% reduction (2.863 → 1.231)
Languages: English & French
Sample Rate: 24kHz

Files

best_model.pth - Fine-tuned GPT model weights
config.json - Model configuration
vocab.json - Tokenizer vocabulary
dvae.pth - Discrete VAE for audio encoding
mel_stats.pth - Mel-spectrogram normalization stats
reference.wav - Voice reference sample

Usage

from TTS.api import TTS

# Load the model
tts = TTS(model_path="path/to/model/folder", config_path="path/to/config.json")

# Generate speech
tts.tts_to_file(
    text="Your text here",
    file_path="output.wav",
    speaker_wav="reference.wav",
    language="en"
)

Requirements

TTS>=0.22.0
torch>=2.0.0
torchaudio

Training Details

Dataset: Custom voice recordings
Training Duration: ~50 hours total
Batch Size: 2
Gradient Accumulation: 126 steps
Optimizer: AdamW

Limitations

Optimized for the specific voice in the training data
Best performance on texts similar to training distribution
May have reduced quality on very long texts

License

Apache 2.0

Downloads last month: 3