IndexTTS-1.5 β GGUF (ggml-quantised)
GGUF / ggml conversion of IndexTeam/IndexTTS-1.5 for use with CrispStrobe/CrispASR.
IndexTTS-1.5 is a zero-shot voice cloning TTS system: reference audio + text β cloned speech at 24 kHz. Architecture: Conformer conditioning encoder (6L, d=512) β Perceiver resampler (2L, 32 latents) β GPT-2 AR decoder (24L, d=1280, 20 heads) β BigVGAN vocoder (6-stage upsample, anti-aliased SnakeBeta activations). ~500M parameters total. Distributed under Apache-2.0 license.
Two GGUF files are needed: the GPT model (conditioning + text β mel codes β latent) and the BigVGAN vocoder (latent β 24 kHz audio).
Files
| File | Quant | Size | Notes |
|---|---|---|---|
indextts-gpt-q8_0.gguf |
Q8_0 | 613 MB | GPT-2 + Conformer + Perceiver β recommended |
indextts-gpt-q4_k.gguf |
Q4_K | 347 MB | GPT-2 + Conformer + Perceiver β smallest |
indextts-gpt.gguf |
F16 | 2.2 GB | GPT-2 + Conformer + Perceiver β reference quality, bit-exact Python parity |
indextts-bigvgan.gguf |
F16 | 256 MB | BigVGAN vocoder (shared across all GPT quants) |
All quant levels produce correct speech (ASR roundtrip = "Hello world!"). F16 gives 100% mel-code parity with Python; Q8_0 is the best quality/size trade-off.
Quick start
# Easiest: auto-download (~870 MB on first run)
./build/bin/crispasr --backend indextts -m auto \
--voice reference_speaker.wav \
--tts "Hello world, this is IndexTTS speaking."
# Output: tts_output.wav (24 kHz mono)
Or with explicit paths:
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr-cli
# 2. Pull model files (pick your preferred quant)
huggingface-cli download cstr/indextts-1.5-GGUF indextts-gpt-q8_0.gguf --local-dir .
huggingface-cli download cstr/indextts-1.5-GGUF indextts-bigvgan.gguf --local-dir .
# 3. Synthesise with voice cloning
./build/bin/crispasr --backend indextts \
-m indextts-gpt-q8_0.gguf \
--codec-model indextts-bigvgan.gguf \
--voice reference_speaker.wav \
--tts "Hello world, this is IndexTTS speaking."
Features
- Zero-shot voice cloning β any 3-10 second reference WAV at 24 kHz (or auto-resampled)
- Multilingual β trained on English and Chinese; cross-language cloning works
- Beam search β num_beams=3 with repetition_penalty=10.0 (matches Python defaults)
- No external dependencies β SentencePiece tokenizer embedded in GGUF, no espeak/phonemizer needed
Accuracy
With the F16 model, C++ mel codes match Python 100% (55/55 tokens identical to the Python greedy reference). Conditioning norm matches Python within 0.001%. All quantization levels (Q8_0, Q4_K) produce correct speech verified via ASR roundtrip.
Architecture details
Input text β uppercase β SentencePiece unigram tokenizer (12000 vocab)
Reference audio β 24kHz resample β mel spectrogram (100 bands, hop=256)
β Conformer encoder (6 blocks, d=512, 8 heads, Conv2d subsampling)
β Perceiver resampler (2 layers, 32 latents, d=1280, GEGLU FFN)
β 32 conditioning vectors
GPT-2 AR decoder:
[32 cond latents | text_embs + text_pos | start_mel + mel_pos]
β 24 transformer blocks (d=1280, 20 heads, GELU FFN)
β gpt.ln_f β final_norm β mel_head β beam search (B=3)
β mel codes (stop token = 8193)
Latent extraction (2nd pass):
Full sequence β GPT-2 β gpt.ln_f β final_norm β [n_mel+1, 1280]
BigVGAN vocoder:
latent [T, 1280] β conv_pre β 6Γ (ConvTranspose1d + AMPBlock1 with
anti-aliased SnakeBeta) β conv_post β tanh β 24kHz PCM
+ ECAPA-TDNN speaker embedding for voice conditioning
Conversion
python models/convert-indextts-to-gguf.py \
--model-dir /path/to/IndexTTS-1.5 \
--output indextts-gpt.gguf \
--vocoder-output indextts-bigvgan.gguf
License
Apache-2.0 (same as upstream IndexTTS).
Citation
@misc{indextts2024,
title={IndexTTS: An Industrial-Level Zero-Shot Text-to-Speech System with Controllable Timbre},
author={IndexTeam},
year={2024},
url={https://github.com/index-tts/IndexTTS}
}
- Downloads last month
- 177
8-bit
Model tree for cstr/indextts-1.5-GGUF
Base model
IndexTeam/IndexTTS-1.5