IndexTTS2-Optimized (Safetensors)

This repository provides optimized, production-ready weights for the IndexTTS v2 text-to-speech system. We have converted the original models to Safetensors format and standardized the structure for seamless integration with vLLM and modern inference pipelines.

🌟 Acknowledgements & Credits

Original research and weights by the Index-Team. Special thanks to Ksuriuri/index-tts-vllm for the innovative vLLM V1/V2 engine optimization which this work builds upon.

🚀 Deployment & Usage

This repository is pre-structured for IndexTTS v2 (vLLM) deployment.

GPT Initialization

from vllm import LLM
# Points to the gpt/model.safetensors and config.json
llm = LLM(model="./gpt")

Emotion Adapter Initialization

emotion_engine = LLM(model="./qwen-emo")

📦 Component Details (Consolidated)

1. GPT Backbone (`gpt/`)

Role: Text-to-Semantic Language Model (UnifiedVoice).
Architecture: Decoder-only Transformer (~460M params).
Details: Converts phoneme sequences into discrete semantic tokens conditioned on speaker/emotion.

2. Qwen Emotion Adapter (`qwen-emo/`)

Role: Emotion feature extraction and transformation.
Architecture: Lightweight adapter based on Qwen-0.5B (~60M params).
Details: Processes reference audio to provide precise emotion conditioning signals.

3. S2Mel (`s2mel/`)

Role: Spectrogram synthesis.
Architecture: Conditional Flow Matching (CFM) / Diffusion (~120M params).
Details: Generates high-quality mel-spectrograms from semantic tokens. Legacy flow_matching. prefixes have been stripped for direct loading.

4. BigVGAN (`bigvgan/`)

Role: Neural Vocoder.
Architecture: Enhanced HiFi-GAN (~110M params).
Details*: Final stage synthesis; converts mel-spectrograms into 24kHz raw audio waveforms.

5. Semantic Codec (`semantic_codec/`)

Role: Discrete Audio Codec (DAC).
Architecture: EnCodec-style architecture (~80M params).
Details: Encodes raw audio into discrete tokens for reference voice cloning.

6. W2V-BERT 2.0 (`w2v-bert-2.0/`)

Role: Semantic Encoder.
Architecture: Conformer-based (~600M params).
Details: Extracts high-level semantic representations from raw audio for speaker and content conditioning.

📜 License

Distributed under the Apache-2.0 License. Refer to the original IndexTTS2 repository for additional usage terms.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for xandersbell/IndexTTS2-Safetensors

Base model

ksuriuri/IndexTTS-2-vLLM

Finetuned

(1)

this model