IndexTTS2 GPT Fine-tune: English Female Voice (IMDA NSC, 44k)

This repository contains IndexTTS2 GPT checkpoints fine-tuned on an English female voice from IMDA's National Speech Corpus (NSC) open-source dataset (FEMALE_01_44k is the internal alias).

Repo: speedmaker/index-tts2-female01-44k
Uploaded checkpoints: model_step14000.pth, model_step15000.pth, model_step15949.pth, latest.pth
Base model expected at inference time: IndexTeam/IndexTTS-2

Training Method (Important)

Fine-tuning type: full GPT fine-tuning (trainable model parameters), not LoRA adapters.
Scope: GPT checkpoint weights for IndexTTS2 inference runtimes.

What Is Included

latest.pth
model_step14000.pth
model_step15000.pth
model_step15949.pth
train.log
TensorBoard event files under logs/

Local-First Quick Start (Recommended)

The recommended local checkpoint for first use is model_step14000.pth.

# 1) Download the recommended checkpoint
hf download speedmaker/index-tts2-female01-44k model_step14000.pth --repo-type model --local-dir ./female01_ckpt

# 2) Prepare your local IndexTTS2 runtime (example repo)
git clone https://github.com/index-tts/index-tts.git
cd index-tts

# 3) Put the fine-tuned GPT weight where IndexTTS2 expects it
cp ../female01_ckpt/model_step14000.pth checkpoints/gpt.pth

# 4) Run local inference
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py

If you want to continue training instead of inference, use latest.pth.

Training Data

Public source: IMDA National Speech Corpus (NSC) open-source dataset
Voice profile used in this release: English female voice
Internal dataset alias in this workflow: FEMALE_01_44k
Sampling rate profile: 44k source set
Dataset artifacts are not published in this repo

Validation Summary

Validation rows are parsed from train.log (86 points total).

Best validation point overall (not an uploaded standalone checkpoint):

epoch=5, step=13800, text_loss=2.6579, mel_loss=3.8906, mel_top1=0.2163

Uploaded Checkpoints Comparison

Checkpoint	Target Step	Validation Step Used	val_text_loss	val_mel_loss	val_mel_top1	Recommendation
`model_step14000.pth`	14000	14000	2.6579	3.8911	0.2164	Default download (best uploaded val_mel_loss)
`model_step15000.pth`	15000	15000	2.6578	3.8913	0.2163	Very close to 14k; optional A/B
`model_step15949.pth`	15949	15800 (nearest val point)	2.6573	3.8917	0.2164	Latest step; slightly worse val_mel_loss
`latest.pth`	latest	latest saved state	varies	varies	varies	Use for resuming training

Which Metric Matters Most?

Primary ranking metric: val_mel_loss (best proxy for acoustic reconstruction quality).
Secondary checks:
- val_text_loss: use as a linguistic/alignment sanity check.
- val_mel_top1: use as a stability/decoding confidence sanity check.

Practical selection rule:

Pick the lowest val_mel_loss checkpoint.
If the gap is very small, prefer lower val_text_loss and stable/higher val_mel_top1.
Final decision should be listening tests on your target scripts.

How To Use (IndexTTS2)

These checkpoints are GPT weights for an existing IndexTTS2 runtime.

Prepare an IndexTTS2 setup with standard base files in checkpoints/ (config.yaml, s2mel.pth, etc.).
Select one checkpoint from this repo and place it at the GPT path expected by config:
- checkpoints/gpt.pth (from config.yaml: gpt_checkpoint: gpt.pth)
Run your normal inference entrypoint.

Example:

cp /path/to/model_step14000.pth /path/to/index-tts-training_v2/checkpoints/gpt.pth
cd /path/to/index-tts-training_v2
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py

Python usage pattern:

from indextts.infer_v2 import IndexTTS2

tts = IndexTTS2(
    cfg_path="checkpoints/config.yaml",
    model_dir="checkpoints",
    use_fp16=False,
    use_cuda_kernel=False,
    use_deepspeed=False,
)

tts.infer(
    spk_audio_prompt="examples/voice_01.wav",
    text="On a quiet morning, the streets were nearly empty.",
    output_path="gen.wav",
    verbose=True,
)

Intended Use

Research and experimentation in English female-voice TTS adaptation (IMDA NSC-derived setup)
Internal prototyping for narration and scripted generation

Limitations

Single-speaker adaptation scope
Quality and style transfer depend strongly on reference prompt audio
Metrics are from one run and should be validated perceptually for deployment

Safety

Use only with explicit consent for voice likeness usage
Do not use for impersonation, fraud, or misleading synthetic media

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for speedmaker/index-tts2-female01-44k

Base model

IndexTeam/IndexTTS-2

Finetuned

(8)

this model

speedmaker
/

index-tts2-female01-44k