IndexTTS2 GPT Fine-tune: English Female Voice (IMDA NSC, 44k)

This repository contains IndexTTS2 GPT checkpoints fine-tuned on an English female voice from IMDA's National Speech Corpus (NSC) open-source dataset (FEMALE_01_44k is the internal alias).

  • Repo: speedmaker/index-tts2-female01-44k
  • Uploaded checkpoints: model_step14000.pth, model_step15000.pth, model_step15949.pth, latest.pth
  • Base model expected at inference time: IndexTeam/IndexTTS-2

Training Method (Important)

  • Fine-tuning type: full GPT fine-tuning (trainable model parameters), not LoRA adapters.
  • Scope: GPT checkpoint weights for IndexTTS2 inference runtimes.

What Is Included

  • latest.pth
  • model_step14000.pth
  • model_step15000.pth
  • model_step15949.pth
  • train.log
  • TensorBoard event files under logs/

Local-First Quick Start (Recommended)

The recommended local checkpoint for first use is model_step14000.pth.

# 1) Download the recommended checkpoint
hf download speedmaker/index-tts2-female01-44k model_step14000.pth --repo-type model --local-dir ./female01_ckpt

# 2) Prepare your local IndexTTS2 runtime (example repo)
git clone https://github.com/index-tts/index-tts.git
cd index-tts

# 3) Put the fine-tuned GPT weight where IndexTTS2 expects it
cp ../female01_ckpt/model_step14000.pth checkpoints/gpt.pth

# 4) Run local inference
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py

If you want to continue training instead of inference, use latest.pth.

Training Data

  • Public source: IMDA National Speech Corpus (NSC) open-source dataset
  • Voice profile used in this release: English female voice
  • Internal dataset alias in this workflow: FEMALE_01_44k
  • Sampling rate profile: 44k source set
  • Dataset artifacts are not published in this repo

Validation Summary

Validation rows are parsed from train.log (86 points total).

Best validation point overall (not an uploaded standalone checkpoint):

  • epoch=5, step=13800, text_loss=2.6579, mel_loss=3.8906, mel_top1=0.2163

Uploaded Checkpoints Comparison

Checkpoint Target Step Validation Step Used val_text_loss val_mel_loss val_mel_top1 Recommendation
model_step14000.pth 14000 14000 2.6579 3.8911 0.2164 Default download (best uploaded val_mel_loss)
model_step15000.pth 15000 15000 2.6578 3.8913 0.2163 Very close to 14k; optional A/B
model_step15949.pth 15949 15800 (nearest val point) 2.6573 3.8917 0.2164 Latest step; slightly worse val_mel_loss
latest.pth latest latest saved state varies varies varies Use for resuming training

Which Metric Matters Most?

  • Primary ranking metric: val_mel_loss (best proxy for acoustic reconstruction quality).
  • Secondary checks:
    • val_text_loss: use as a linguistic/alignment sanity check.
    • val_mel_top1: use as a stability/decoding confidence sanity check.

Practical selection rule:

  1. Pick the lowest val_mel_loss checkpoint.
  2. If the gap is very small, prefer lower val_text_loss and stable/higher val_mel_top1.
  3. Final decision should be listening tests on your target scripts.

How To Use (IndexTTS2)

These checkpoints are GPT weights for an existing IndexTTS2 runtime.

  1. Prepare an IndexTTS2 setup with standard base files in checkpoints/ (config.yaml, s2mel.pth, etc.).
  2. Select one checkpoint from this repo and place it at the GPT path expected by config:
    • checkpoints/gpt.pth (from config.yaml: gpt_checkpoint: gpt.pth)
  3. Run your normal inference entrypoint.

Example:

cp /path/to/model_step14000.pth /path/to/index-tts-training_v2/checkpoints/gpt.pth
cd /path/to/index-tts-training_v2
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py

Python usage pattern:

from indextts.infer_v2 import IndexTTS2

tts = IndexTTS2(
    cfg_path="checkpoints/config.yaml",
    model_dir="checkpoints",
    use_fp16=False,
    use_cuda_kernel=False,
    use_deepspeed=False,
)

tts.infer(
    spk_audio_prompt="examples/voice_01.wav",
    text="On a quiet morning, the streets were nearly empty.",
    output_path="gen.wav",
    verbose=True,
)

Intended Use

  • Research and experimentation in English female-voice TTS adaptation (IMDA NSC-derived setup)
  • Internal prototyping for narration and scripted generation

Limitations

  • Single-speaker adaptation scope
  • Quality and style transfer depend strongly on reference prompt audio
  • Metrics are from one run and should be validated perceptually for deployment

Safety

  • Use only with explicit consent for voice likeness usage
  • Do not use for impersonation, fraud, or misleading synthetic media
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for speedmaker/index-tts2-female01-44k

Finetuned
(5)
this model

Space using speedmaker/index-tts2-female01-44k 1