IndexTTS2-Optimized (Safetensors)

This repository provides optimized, production-ready weights for the IndexTTS v2 text-to-speech system. We have converted the original models to Safetensors format and standardized the structure for seamless integration with vLLM and modern inference pipelines.

🌟 Acknowledgements & Credits

Original research and weights by the Index-Team. Special thanks to Ksuriuri/index-tts-vllm for the innovative vLLM V1/V2 engine optimization which this work builds upon.

πŸš€ Deployment & Usage

This repository is pre-structured for IndexTTS v2 (vLLM) deployment.

GPT Initialization

from vllm import LLM
# Points to the gpt/model.safetensors and config.json
llm = LLM(model="./gpt") 

Emotion Adapter Initialization

emotion_engine = LLM(model="./qwen-emo")

πŸ“¦ Component Details (Consolidated)

1. GPT Backbone (gpt/)

  • Role: Text-to-Semantic Language Model (UnifiedVoice).
  • Architecture: Decoder-only Transformer (~460M params).
  • Details: Converts phoneme sequences into discrete semantic tokens conditioned on speaker/emotion.

2. Qwen Emotion Adapter (qwen-emo/)

  • Role: Emotion feature extraction and transformation.
  • Architecture: Lightweight adapter based on Qwen-0.5B (~60M params).
  • Details: Processes reference audio to provide precise emotion conditioning signals.

3. S2Mel (s2mel/)

  • Role: Spectrogram synthesis.
  • Architecture: Conditional Flow Matching (CFM) / Diffusion (~120M params).
  • Details: Generates high-quality mel-spectrograms from semantic tokens. Legacy flow_matching. prefixes have been stripped for direct loading.

4. BigVGAN (bigvgan/)

  • Role: Neural Vocoder.
  • Architecture: Enhanced HiFi-GAN (~110M params).
  • Details*: Final stage synthesis; converts mel-spectrograms into 24kHz raw audio waveforms.

5. Semantic Codec (semantic_codec/)

  • Role: Discrete Audio Codec (DAC).
  • Architecture: EnCodec-style architecture (~80M params).
  • Details: Encodes raw audio into discrete tokens for reference voice cloning.

6. W2V-BERT 2.0 (w2v-bert-2.0/)

  • Role: Semantic Encoder.
  • Architecture: Conformer-based (~600M params).
  • Details: Extracts high-level semantic representations from raw audio for speaker and content conditioning.

πŸ“œ License

Distributed under the Apache-2.0 License. Refer to the original IndexTTS2 repository for additional usage terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for xandersbell/IndexTTS2-Safetensors

Finetuned
(1)
this model