IndexTTS2-Optimized (Safetensors)
This repository provides optimized, production-ready weights for the IndexTTS v2 text-to-speech system. We have converted the original models to Safetensors format and standardized the structure for seamless integration with vLLM and modern inference pipelines.
π Acknowledgements & Credits
Original research and weights by the Index-Team. Special thanks to Ksuriuri/index-tts-vllm for the innovative vLLM V1/V2 engine optimization which this work builds upon.
π Deployment & Usage
This repository is pre-structured for IndexTTS v2 (vLLM) deployment.
GPT Initialization
from vllm import LLM
# Points to the gpt/model.safetensors and config.json
llm = LLM(model="./gpt")
Emotion Adapter Initialization
emotion_engine = LLM(model="./qwen-emo")
π¦ Component Details (Consolidated)
1. GPT Backbone (gpt/)
- Role: Text-to-Semantic Language Model (UnifiedVoice).
- Architecture: Decoder-only Transformer (~460M params).
- Details: Converts phoneme sequences into discrete semantic tokens conditioned on speaker/emotion.
2. Qwen Emotion Adapter (qwen-emo/)
- Role: Emotion feature extraction and transformation.
- Architecture: Lightweight adapter based on Qwen-0.5B (~60M params).
- Details: Processes reference audio to provide precise emotion conditioning signals.
3. S2Mel (s2mel/)
- Role: Spectrogram synthesis.
- Architecture: Conditional Flow Matching (CFM) / Diffusion (~120M params).
- Details: Generates high-quality mel-spectrograms from semantic tokens. Legacy
flow_matching.prefixes have been stripped for direct loading.
4. BigVGAN (bigvgan/)
- Role: Neural Vocoder.
- Architecture: Enhanced HiFi-GAN (~110M params).
- Details*: Final stage synthesis; converts mel-spectrograms into 24kHz raw audio waveforms.
5. Semantic Codec (semantic_codec/)
- Role: Discrete Audio Codec (DAC).
- Architecture: EnCodec-style architecture (~80M params).
- Details: Encodes raw audio into discrete tokens for reference voice cloning.
6. W2V-BERT 2.0 (w2v-bert-2.0/)
- Role: Semantic Encoder.
- Architecture: Conformer-based (~600M params).
- Details: Extracts high-level semantic representations from raw audio for speaker and content conditioning.
π License
Distributed under the Apache-2.0 License. Refer to the original IndexTTS2 repository for additional usage terms.
Model tree for xandersbell/IndexTTS2-Safetensors
Base model
ksuriuri/IndexTTS-2-vLLM