IndexTTS2 GPT Fine-tune: English Female Voice (IMDA NSC, 44k)
This repository contains IndexTTS2 GPT checkpoints fine-tuned on an English female voice from IMDA's National Speech Corpus (NSC) open-source dataset (FEMALE_01_44k is the internal alias).
- Repo:
speedmaker/index-tts2-female01-44k - Uploaded checkpoints:
model_step14000.pth,model_step15000.pth,model_step15949.pth,latest.pth - Base model expected at inference time:
IndexTeam/IndexTTS-2
Training Method (Important)
- Fine-tuning type: full GPT fine-tuning (trainable model parameters), not LoRA adapters.
- Scope: GPT checkpoint weights for IndexTTS2 inference runtimes.
What Is Included
latest.pthmodel_step14000.pthmodel_step15000.pthmodel_step15949.pthtrain.log- TensorBoard event files under
logs/
Local-First Quick Start (Recommended)
The recommended local checkpoint for first use is model_step14000.pth.
# 1) Download the recommended checkpoint
hf download speedmaker/index-tts2-female01-44k model_step14000.pth --repo-type model --local-dir ./female01_ckpt
# 2) Prepare your local IndexTTS2 runtime (example repo)
git clone https://github.com/index-tts/index-tts.git
cd index-tts
# 3) Put the fine-tuned GPT weight where IndexTTS2 expects it
cp ../female01_ckpt/model_step14000.pth checkpoints/gpt.pth
# 4) Run local inference
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py
If you want to continue training instead of inference, use latest.pth.
Training Data
- Public source: IMDA National Speech Corpus (NSC) open-source dataset
- Voice profile used in this release: English female voice
- Internal dataset alias in this workflow:
FEMALE_01_44k - Sampling rate profile: 44k source set
- Dataset artifacts are not published in this repo
Validation Summary
Validation rows are parsed from train.log (86 points total).
Best validation point overall (not an uploaded standalone checkpoint):
epoch=5, step=13800, text_loss=2.6579, mel_loss=3.8906, mel_top1=0.2163
Uploaded Checkpoints Comparison
| Checkpoint | Target Step | Validation Step Used | val_text_loss | val_mel_loss | val_mel_top1 | Recommendation |
|---|---|---|---|---|---|---|
model_step14000.pth |
14000 | 14000 | 2.6579 | 3.8911 | 0.2164 | Default download (best uploaded val_mel_loss) |
model_step15000.pth |
15000 | 15000 | 2.6578 | 3.8913 | 0.2163 | Very close to 14k; optional A/B |
model_step15949.pth |
15949 | 15800 (nearest val point) | 2.6573 | 3.8917 | 0.2164 | Latest step; slightly worse val_mel_loss |
latest.pth |
latest | latest saved state | varies | varies | varies | Use for resuming training |
Which Metric Matters Most?
- Primary ranking metric:
val_mel_loss(best proxy for acoustic reconstruction quality). - Secondary checks:
val_text_loss: use as a linguistic/alignment sanity check.val_mel_top1: use as a stability/decoding confidence sanity check.
Practical selection rule:
- Pick the lowest
val_mel_losscheckpoint. - If the gap is very small, prefer lower
val_text_lossand stable/higherval_mel_top1. - Final decision should be listening tests on your target scripts.
How To Use (IndexTTS2)
These checkpoints are GPT weights for an existing IndexTTS2 runtime.
- Prepare an IndexTTS2 setup with standard base files in
checkpoints/(config.yaml,s2mel.pth, etc.). - Select one checkpoint from this repo and place it at the GPT path expected by config:
checkpoints/gpt.pth(fromconfig.yaml: gpt_checkpoint: gpt.pth)
- Run your normal inference entrypoint.
Example:
cp /path/to/model_step14000.pth /path/to/index-tts-training_v2/checkpoints/gpt.pth
cd /path/to/index-tts-training_v2
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py
Python usage pattern:
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(
cfg_path="checkpoints/config.yaml",
model_dir="checkpoints",
use_fp16=False,
use_cuda_kernel=False,
use_deepspeed=False,
)
tts.infer(
spk_audio_prompt="examples/voice_01.wav",
text="On a quiet morning, the streets were nearly empty.",
output_path="gen.wav",
verbose=True,
)
Intended Use
- Research and experimentation in English female-voice TTS adaptation (IMDA NSC-derived setup)
- Internal prototyping for narration and scripted generation
Limitations
- Single-speaker adaptation scope
- Quality and style transfer depend strongly on reference prompt audio
- Metrics are from one run and should be validated perceptually for deployment
Safety
- Use only with explicit consent for voice likeness usage
- Do not use for impersonation, fraud, or misleading synthetic media
Model tree for speedmaker/index-tts2-female01-44k
Base model
IndexTeam/IndexTTS-2