CherithCutestory commited on
Commit
af564a4
·
1 Parent(s): 5f714f1

Initial commit

Browse files
indextts2/Dockerfile ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ RUN apt-get update && apt-get install -y --no-install-recommends \
4
+ build-essential \
5
+ libsndfile1 \
6
+ ffmpeg \
7
+ git \
8
+ git-lfs \
9
+ rubberband-cli \
10
+ librubberband-dev \
11
+ && git lfs install \
12
+ && rm -rf /var/lib/apt/lists/*
13
+
14
+ RUN useradd -m -u 1000 user
15
+ WORKDIR /app
16
+
17
+ COPY requirements.txt .
18
+ RUN pip install --no-cache-dir -r requirements.txt
19
+
20
+ RUN mkdir -p checkpoints && \
21
+ python -c "from huggingface_hub import snapshot_download; snapshot_download('IndexTeam/IndexTTS-2', local_dir='checkpoints')" \
22
+ || echo "Model download deferred to runtime startup"
23
+
24
+ COPY . .
25
+
26
+ RUN chown -R user:user /app
27
+ USER user
28
+
29
+ ENV PYTHONUNBUFFERED=1
30
+ ENV HF_HOME=/app/.cache/huggingface
31
+ ENV MODEL_DIR=/app/checkpoints
32
+
33
+ EXPOSE 7860
34
+
35
+ CMD ["sh", "-c", "OMP_NUM_THREADS=4 exec uvicorn app:app --host 0.0.0.0 --port 7860"]
indextts2/README.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: VoxLibris IndexTTS2 Engine
3
+ emoji: 🎙️
4
+ colorFrom: purple
5
+ colorTo: indigo
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ ---
10
+
11
+ # VoxLibris IndexTTS2 Engine
12
+
13
+ A HuggingFace Space that serves [IndexTTS2](https://github.com/index-tts/index-tts)
14
+ as a REST API, implementing the
15
+ [VoxLibris TTS Engine API Contract](https://github.com/your-repo/docs/tts-api-contract.md).
16
+
17
+ ## Endpoints
18
+
19
+ ### POST /GetEngineDetails
20
+
21
+ Returns engine capabilities, supported emotions, and voice cloning support.
22
+
23
+ ### POST /ConvertTextToSpeech
24
+
25
+ Converts text to speech with zero-shot voice cloning. Requires a
26
+ `voice_to_clone_sample` (base64-encoded WAV). Supports 14 emotions mapped
27
+ to IndexTTS2's 8-dimensional emotion vector system.
28
+
29
+ ### GET /health
30
+
31
+ Returns model loading status.
32
+
33
+ ## Authentication
34
+
35
+ Set the `API_KEY` secret in your HuggingFace Space settings.
36
+ Requests must include `Authorization: Bearer <your-key>` header.
37
+ Leave `API_KEY` unset to disable authentication.
38
+
39
+ ## Voice Cloning
40
+
41
+ IndexTTS2 is a zero-shot voice cloning engine — every request requires a
42
+ reference voice sample. Send a base64-encoded WAV file in the
43
+ `voice_to_clone_sample` field. A 6-15 second clear speech sample works best.
44
+
45
+ The engine disentangles speaker timbre from emotional expression, allowing
46
+ the cloned voice to speak with different emotions without affecting voice
47
+ identity.
48
+
49
+ ## Emotion Support
50
+
51
+ IndexTTS2 uses an 8-dimensional emotion vector system (happy, angry, sad,
52
+ afraid, disgusted, melancholic, surprised, calm) with a fine-tuned Qwen3
53
+ model for emotion analysis. VoxLibris emotions are automatically mapped
54
+ to appropriate vector blends:
55
+
56
+ | Emotion | Mapping Strategy |
57
+ |-------------|---------------------------------------|
58
+ | neutral | High calm (0.8) |
59
+ | happy | High happy (0.8) |
60
+ | sad | High sad (0.8) |
61
+ | angry | High angry (0.8) |
62
+ | fear | High afraid (0.8) |
63
+ | disgust | High disgusted (0.8) |
64
+ | surprise | High surprised (0.7) |
65
+ | calm | High calm (0.8) |
66
+ | excited | Happy (0.6) + surprised (0.2) |
67
+ | melancholy | Sad (0.2) + melancholic (0.6) |
68
+ | anxious | Afraid (0.5) + slight calm (0.2) |
69
+ | hopeful | Happy (0.5) + calm (0.3) |
70
+ | tender | Happy (0.2) + calm (0.5) |
71
+ | proud | Happy (0.5) + surprised (0.1) |
72
+
73
+ The `intensity` parameter (1-100) scales the emotion vectors. Additional
74
+ prosody reinforcement is applied via pyrubberband speed/pitch adjustments.
75
+
76
+ ## Key Features
77
+
78
+ - **Emotion-Speaker Disentanglement**: Independent control over voice timbre
79
+ (from reference audio) and emotional expression (from emotion vectors)
80
+ - **Zero-Shot Voice Cloning**: Clone any voice from a short reference audio
81
+ - **Duration Control**: Supports both free generation and explicit token-count
82
+ modes for precise audio length
83
+ - **Multilingual**: Chinese and English (with more languages supported)
84
+ - **Built-in Qwen3 Emotion Model**: Fine-tuned for text-to-emotion analysis
85
+
86
+ ## Limits
87
+
88
+ - Maximum 500 characters per request (longer text is truncated at word boundary)
89
+ - Output: 22050 Hz mono 16-bit WAV
90
+ - Reference audio: max 15 seconds (longer clips are auto-truncated)
91
+
92
+ ## Environment Variables
93
+
94
+ | Variable | Description | Default |
95
+ |-------------|----------------------------------------|-----------------|
96
+ | `API_KEY` | Bearer token for authentication | (none/disabled) |
97
+ | `MODEL_DIR` | Path to model checkpoints directory | `checkpoints` |
98
+ | `USE_FP16` | Enable half-precision inference | `true` |
99
+
100
+ ## Deployment
101
+
102
+ 1. Create a new HuggingFace Space with **Docker** SDK
103
+ 2. Upload the contents of this folder
104
+ 3. Set the `API_KEY` secret in Space settings (optional)
105
+ 4. The model downloads automatically during build (~5 GB)
106
+ 5. Requires GPU (A10G or better recommended for reasonable speed)
107
+ 6. Register the Space URL in VoxLibris Settings under TTS Engine Management
indextts2/app.py ADDED
@@ -0,0 +1,457 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ os.environ.setdefault("OMP_NUM_THREADS", "4")
4
+ os.environ.setdefault("HF_HUB_CACHE", "./checkpoints/hf_cache")
5
+
6
+ import io
7
+ import base64
8
+ import tempfile
9
+ import logging
10
+ import wave
11
+ import numpy as np
12
+ import torch
13
+ import pyrubberband as pyrb
14
+ from contextlib import asynccontextmanager
15
+ from pathlib import Path
16
+ from fastapi import FastAPI, Request, HTTPException
17
+ from fastapi.responses import Response, JSONResponse, HTMLResponse
18
+ from pydantic import BaseModel, Field
19
+ from typing import Optional
20
+
21
+ logging.basicConfig(level=logging.INFO)
22
+ logger = logging.getLogger("indextts2-engine")
23
+
24
+ BEARER_TOKEN = os.environ.get("API_KEY",
25
+ "124CC717-7517-47A2-BBD6-54FCAE310297")
26
+ SAMPLE_RATE = 22050
27
+ BIT_DEPTH = 16
28
+ CHANNELS = 1
29
+ MAX_SECONDS = 60
30
+ MAX_CHARS = 500
31
+
32
+ VOXLIBRIS_TO_INDEXTTS2_EMOTIONS = {
33
+ "neutral": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8],
34
+ "happy": [0.8, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1],
35
+ "angry": [0.0, 0.8, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1],
36
+ "sad": [0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.0, 0.1],
37
+ "fear": [0.0, 0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.1],
38
+ "disgust": [0.0, 0.0, 0.0, 0.0, 0.8, 0.0, 0.0, 0.1],
39
+ "melancholy": [0.0, 0.0, 0.2, 0.0, 0.0, 0.6, 0.0, 0.1],
40
+ "surprise": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7, 0.1],
41
+ "calm": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8],
42
+ "excited": [0.6, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0],
43
+ "anxious": [0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.2],
44
+ "hopeful": [0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3],
45
+ "tender": [0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5],
46
+ "proud": [0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.2],
47
+ "fearful": [0.0, 0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.1],
48
+ "confused": [0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.3, 0.3],
49
+ }
50
+
51
+ EMOTION_SPEED_MAP = {
52
+ "neutral": 1.0,
53
+ "happy": 1.02,
54
+ "sad": 0.97,
55
+ "angry": 1.04,
56
+ "fear": 1.03,
57
+ "fearful": 1.03,
58
+ "surprise": 1.05,
59
+ "disgust": 0.98,
60
+ "excited": 1.03,
61
+ "calm": 0.96,
62
+ "confused": 0.98,
63
+ "anxious": 1.02,
64
+ "hopeful": 1.01,
65
+ "melancholy": 0.96,
66
+ "tender": 0.97,
67
+ "proud": 1.01,
68
+ }
69
+
70
+ EMOTION_PITCH_MAP = {
71
+ "neutral": 0.0,
72
+ "happy": 0.5,
73
+ "sad": -0.3,
74
+ "angry": -0.2,
75
+ "fear": 0.3,
76
+ "fearful": 0.3,
77
+ "surprise": 0.6,
78
+ "disgust": -0.2,
79
+ "excited": 0.7,
80
+ "calm": -0.1,
81
+ "confused": 0.2,
82
+ "anxious": 0.3,
83
+ "hopeful": 0.3,
84
+ "melancholy": -0.4,
85
+ "tender": -0.1,
86
+ "proud": 0.2,
87
+ }
88
+
89
+ CANONICAL_EMOTIONS = [
90
+ "neutral",
91
+ "happy",
92
+ "sad",
93
+ "angry",
94
+ "fear",
95
+ "surprise",
96
+ "disgust",
97
+ "excited",
98
+ "calm",
99
+ "anxious",
100
+ "hopeful",
101
+ "melancholy",
102
+ "tender",
103
+ "proud",
104
+ "fearful",
105
+ "confused",
106
+ ]
107
+
108
+ tts_model = None
109
+
110
+
111
+ def load_model():
112
+ global tts_model
113
+ from indextts.infer_v2 import IndexTTS2
114
+
115
+ model_dir = os.environ.get("MODEL_DIR", "checkpoints")
116
+ cfg_path = os.path.join(model_dir, "config.yaml")
117
+
118
+ if not os.path.exists(cfg_path):
119
+ logger.info(
120
+ "Model not found locally, downloading IndexTeam/IndexTTS-2...")
121
+ from huggingface_hub import snapshot_download
122
+ snapshot_download("IndexTeam/IndexTTS-2", local_dir=model_dir)
123
+ logger.info("Model download complete.")
124
+ use_fp16 = os.environ.get("USE_FP16",
125
+ "true").lower() in ("true", "1", "yes")
126
+
127
+ device = None
128
+ if torch.cuda.is_available():
129
+ device = "cuda:0"
130
+ elif hasattr(torch, "mps") and torch.backends.mps.is_available():
131
+ device = "mps"
132
+ use_fp16 = False
133
+ else:
134
+ device = "cpu"
135
+ use_fp16 = False
136
+
137
+ logger.info(
138
+ f"Loading IndexTTS2 model from {model_dir} on {device} (fp16={use_fp16})..."
139
+ )
140
+ tts_model = IndexTTS2(
141
+ cfg_path=cfg_path,
142
+ model_dir=model_dir,
143
+ use_fp16=use_fp16,
144
+ device=device,
145
+ )
146
+ logger.info("IndexTTS2 model loaded successfully.")
147
+
148
+
149
+ @asynccontextmanager
150
+ async def lifespan(app: FastAPI):
151
+ load_model()
152
+ yield
153
+
154
+
155
+ app = FastAPI(title="IndexTTS2 Engine", lifespan=lifespan)
156
+
157
+
158
+ def verify_auth(request: Request):
159
+ if not BEARER_TOKEN:
160
+ return
161
+ auth = request.headers.get("Authorization", "")
162
+ if auth != f"Bearer {BEARER_TOKEN}":
163
+ raise HTTPException(status_code=401, detail="Unauthorized")
164
+
165
+
166
+ def numpy_to_wav_bytes(audio_np: np.ndarray, sample_rate: int) -> bytes:
167
+ audio_np = np.clip(audio_np, -1.0, 1.0)
168
+ audio_int16 = (audio_np * 32767).astype(np.int16)
169
+
170
+ buf = io.BytesIO()
171
+ with wave.open(buf, "wb") as wf:
172
+ wf.setnchannels(CHANNELS)
173
+ wf.setsampwidth(2)
174
+ wf.setframerate(sample_rate)
175
+ wf.writeframes(audio_int16.tobytes())
176
+ return buf.getvalue()
177
+
178
+
179
+ def blend_emotion_vectors(emotion_set: list[str],
180
+ intensity: int) -> list[float]:
181
+ intensity_factor = intensity / 50.0
182
+
183
+ if not emotion_set or emotion_set == ["neutral"]:
184
+ base = VOXLIBRIS_TO_INDEXTTS2_EMOTIONS.get("neutral",
185
+ [0.0] * 7 + [0.8])
186
+ return list(base)
187
+
188
+ blended = [0.0] * 8
189
+ count = 0
190
+ for emo in emotion_set:
191
+ emo_lower = emo.lower()
192
+ vec = VOXLIBRIS_TO_INDEXTTS2_EMOTIONS.get(emo_lower)
193
+ if vec:
194
+ for i in range(8):
195
+ blended[i] += vec[i]
196
+ count += 1
197
+
198
+ if count == 0:
199
+ return list(VOXLIBRIS_TO_INDEXTTS2_EMOTIONS["neutral"])
200
+
201
+ blended = [v / count for v in blended]
202
+
203
+ for i in range(7):
204
+ blended[i] = blended[i] * intensity_factor
205
+
206
+ calm_remaining = max(0.0, 1.0 - sum(blended[:7]))
207
+ blended[7] = min(blended[7], calm_remaining)
208
+
209
+ return blended
210
+
211
+
212
+ class ConvertRequest(BaseModel):
213
+ input_text: str
214
+ builtin_voice_id: Optional[str] = None
215
+ voice_to_clone_sample: Optional[str] = None
216
+ random_seed: Optional[int] = None
217
+ emotion_set: list[str] = Field(default_factory=lambda: ["neutral"])
218
+ intensity: int = Field(default=50, ge=1, le=100)
219
+ volume: int = Field(default=75, ge=1, le=100)
220
+ speed_adjust: float = Field(default=0.0, ge=-5.0, le=5.0)
221
+ pitch_adjust: float = Field(default=0.0, ge=-5.0, le=5.0)
222
+
223
+
224
+ @app.post("/GetEngineDetails")
225
+ async def get_engine_details(request: Request):
226
+ verify_auth(request)
227
+
228
+ return {
229
+ "engine_id": "indextts2",
230
+ "engine_name": "IndexTTS2",
231
+ "sample_rate": SAMPLE_RATE,
232
+ "bit_depth": BIT_DEPTH,
233
+ "channels": CHANNELS,
234
+ "max_seconds_per_conversion": MAX_SECONDS,
235
+ "supports_voice_cloning": True,
236
+ "builtin_voices": [],
237
+ "supported_emotions": CANONICAL_EMOTIONS,
238
+ "extra_properties": {
239
+ "model":
240
+ "IndexTeam/IndexTTS-2",
241
+ "max_characters":
242
+ MAX_CHARS,
243
+ "emotion_control":
244
+ "8-dimensional emotion vectors via fine-tuned Qwen3",
245
+ "features": [
246
+ "zero-shot voice cloning",
247
+ "emotion-speaker disentanglement",
248
+ "duration control",
249
+ "multilingual (Chinese, English)",
250
+ ],
251
+ }
252
+ }
253
+
254
+
255
+ @app.post("/ConvertTextToSpeech")
256
+ async def convert_text_to_speech(request: Request):
257
+ verify_auth(request)
258
+
259
+ try:
260
+ body = await request.json()
261
+ req = ConvertRequest(**body)
262
+ except Exception as e:
263
+ return JSONResponse(status_code=400,
264
+ content={
265
+ "error": str(e),
266
+ "error_code": "INVALID_REQUEST"
267
+ })
268
+
269
+ if not req.input_text.strip():
270
+ return JSONResponse(status_code=400,
271
+ content={
272
+ "error": "Input text is empty",
273
+ "error_code": "INVALID_REQUEST"
274
+ })
275
+
276
+ if not req.voice_to_clone_sample:
277
+ return JSONResponse(
278
+ status_code=400,
279
+ content={
280
+ "error": "IndexTTS2 requires a voice sample for cloning. "
281
+ "Please provide a voice_to_clone_sample.",
282
+ "error_code": "INVALID_REQUEST"
283
+ })
284
+
285
+ if req.random_seed is not None and req.random_seed > 0:
286
+ torch.manual_seed(req.random_seed)
287
+ if torch.cuda.is_available():
288
+ torch.cuda.manual_seed(req.random_seed)
289
+
290
+ temp_files = []
291
+
292
+ try:
293
+ try:
294
+ wav_bytes = base64.b64decode(req.voice_to_clone_sample,
295
+ validate=True)
296
+ except Exception:
297
+ return JSONResponse(
298
+ status_code=400,
299
+ content={
300
+ "error": "Invalid voice_to_clone_sample: not valid base64",
301
+ "error_code": "INVALID_REQUEST"
302
+ })
303
+
304
+ if len(wav_bytes) < 44:
305
+ return JSONResponse(
306
+ status_code=400,
307
+ content={
308
+ "error":
309
+ "Invalid voice_to_clone_sample: file too small to be valid audio",
310
+ "error_code": "INVALID_REQUEST"
311
+ })
312
+
313
+ tmp_voice = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
314
+ tmp_voice.write(wav_bytes)
315
+ tmp_voice.close()
316
+ speaker_wav_path = tmp_voice.name
317
+ temp_files.append(tmp_voice.name)
318
+
319
+ tmp_out = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
320
+ tmp_out.close()
321
+ output_path = tmp_out.name
322
+ temp_files.append(tmp_out.name)
323
+
324
+ text = req.input_text.strip()
325
+ if len(text) > MAX_CHARS:
326
+ truncated = text[:MAX_CHARS]
327
+ last_space = truncated.rfind(' ')
328
+ if last_space > MAX_CHARS * 0.6:
329
+ truncated = truncated[:last_space]
330
+ text = truncated
331
+ logger.warning(f"Text truncated to {len(text)} characters")
332
+
333
+ if text and text[-1] not in '.!?;:。!?;:':
334
+ text += '.'
335
+
336
+ dominant_emotion = req.emotion_set[0].lower(
337
+ ) if req.emotion_set else "neutral"
338
+ emo_vector = blend_emotion_vectors(req.emotion_set, req.intensity)
339
+ emo_vector = tts_model.normalize_emo_vec(emo_vector, apply_bias=True)
340
+
341
+ emotion_speed = EMOTION_SPEED_MAP.get(dominant_emotion, 1.0)
342
+ emotion_pitch = EMOTION_PITCH_MAP.get(dominant_emotion, 0.0)
343
+
344
+ intensity_factor = req.intensity / 50.0
345
+ emotion_speed = 1.0 + (emotion_speed - 1.0) * intensity_factor
346
+ emotion_pitch = emotion_pitch * intensity_factor
347
+
348
+ is_neutral = all(e.lower() in ("neutral", "calm")
349
+ for e in req.emotion_set)
350
+
351
+ logger.info(f"Generating with IndexTTS2: emotions={req.emotion_set}, "
352
+ f"emo_vector={[f'{v:.2f}' for v in emo_vector]}, "
353
+ f"intensity={req.intensity}, text_len={len(text)}, "
354
+ f"is_neutral={is_neutral}")
355
+
356
+ kwargs = {
357
+ "spk_audio_prompt": speaker_wav_path,
358
+ "text": text,
359
+ "output_path": output_path,
360
+ "verbose": False,
361
+ }
362
+
363
+ if not is_neutral:
364
+ kwargs["emo_vector"] = emo_vector
365
+
366
+ tts_model.infer(**kwargs)
367
+
368
+ if not os.path.exists(output_path) or os.path.getsize(
369
+ output_path) == 0:
370
+ return JSONResponse(status_code=500,
371
+ content={
372
+ "error": "IndexTTS2 produced no output",
373
+ "error_code": "GENERATION_FAILED"
374
+ })
375
+
376
+ import torchaudio
377
+ wav_tensor, sr = torchaudio.load(output_path)
378
+ audio_np = wav_tensor.squeeze().numpy().astype(np.float32)
379
+
380
+ if sr != SAMPLE_RATE:
381
+ import librosa
382
+ audio_np = librosa.resample(audio_np,
383
+ orig_sr=sr,
384
+ target_sr=SAMPLE_RATE)
385
+
386
+ peak = np.max(np.abs(audio_np))
387
+ if peak > 0:
388
+ audio_np = audio_np / peak
389
+
390
+ speed_factor = emotion_speed
391
+ if req.speed_adjust != 0.0:
392
+ user_speed = 1.0 + (req.speed_adjust / 100.0)
393
+ speed_factor = speed_factor * user_speed
394
+ speed_factor = max(0.5, min(2.0, speed_factor))
395
+ if abs(speed_factor - 1.0) > 0.01:
396
+ audio_np = pyrb.time_stretch(audio_np, SAMPLE_RATE, speed_factor)
397
+
398
+ total_pitch = emotion_pitch
399
+ if req.pitch_adjust != 0.0:
400
+ total_pitch += req.pitch_adjust * 0.24
401
+ if abs(total_pitch) > 0.01:
402
+ audio_np = pyrb.pitch_shift(audio_np, SAMPLE_RATE, total_pitch)
403
+
404
+ vol_factor = req.volume / 75.0
405
+ audio_np = audio_np * vol_factor
406
+
407
+ wav_bytes_out = numpy_to_wav_bytes(audio_np, SAMPLE_RATE)
408
+
409
+ return Response(content=wav_bytes_out, media_type="audio/wav")
410
+
411
+ except Exception as e:
412
+ logger.exception("TTS generation failed")
413
+ return JSONResponse(status_code=500,
414
+ content={
415
+ "error": "Audio generation failed",
416
+ "error_code": "GENERATION_FAILED",
417
+ "details": str(e)
418
+ })
419
+ finally:
420
+ for f in temp_files:
421
+ try:
422
+ os.unlink(f)
423
+ except OSError:
424
+ pass
425
+
426
+
427
+ @app.get("/", response_class=HTMLResponse)
428
+ async def root():
429
+ html_path = Path(__file__).parent / "index.html"
430
+ if html_path.exists():
431
+ return HTMLResponse(content=html_path.read_text())
432
+ return HTMLResponse(content="""
433
+ <html>
434
+ <head><title>IndexTTS2 Engine</title></head>
435
+ <body style="font-family: sans-serif; max-width: 800px; margin: 40px auto; padding: 20px;">
436
+ <h1>IndexTTS2 Engine</h1>
437
+ <p>VoxLibris-compatible TTS engine powered by
438
+ <a href="https://github.com/index-tts/index-tts">IndexTTS2</a>.</p>
439
+ <h2>Endpoints</h2>
440
+ <ul>
441
+ <li><code>POST /GetEngineDetails</code> - Get engine capabilities</li>
442
+ <li><code>POST /ConvertTextToSpeech</code> - Convert text to speech</li>
443
+ <li><code>GET /health</code> - Health check</li>
444
+ </ul>
445
+ </body>
446
+ </html>
447
+ """)
448
+
449
+
450
+ @app.get("/health")
451
+ async def health():
452
+ return {"status": "ok", "model_loaded": tts_model is not None}
453
+
454
+
455
+ if __name__ == "__main__":
456
+ import uvicorn
457
+ uvicorn.run(app, host="0.0.0.0", port=7860)
indextts2/index.html ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>IndexTTS2 - Test Console</title>
7
+ <style>
8
+ *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
9
+ body {
10
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
11
+ background: #0f0d1a;
12
+ color: #e2e0eb;
13
+ min-height: 100vh;
14
+ padding: 2rem;
15
+ }
16
+ .container { max-width: 720px; margin: 0 auto; }
17
+ h1 {
18
+ font-size: 1.75rem;
19
+ font-weight: 700;
20
+ background: linear-gradient(135deg, #a78bfa, #7c3aed);
21
+ -webkit-background-clip: text;
22
+ -webkit-text-fill-color: transparent;
23
+ margin-bottom: 0.25rem;
24
+ }
25
+ .subtitle { color: #9490a8; font-size: 0.875rem; margin-bottom: 2rem; }
26
+ .card {
27
+ background: #1a1726;
28
+ border: 1px solid #2d2a3a;
29
+ border-radius: 12px;
30
+ padding: 1.5rem;
31
+ margin-bottom: 1.25rem;
32
+ }
33
+ .card-title {
34
+ font-size: 0.8rem;
35
+ font-weight: 600;
36
+ text-transform: uppercase;
37
+ letter-spacing: 0.05em;
38
+ color: #a78bfa;
39
+ margin-bottom: 1rem;
40
+ }
41
+ label {
42
+ display: block;
43
+ font-size: 0.8rem;
44
+ font-weight: 500;
45
+ color: #b0adc0;
46
+ margin-bottom: 0.35rem;
47
+ }
48
+ textarea, input[type="text"], input[type="number"], select {
49
+ width: 100%;
50
+ background: #12101e;
51
+ border: 1px solid #2d2a3a;
52
+ border-radius: 8px;
53
+ color: #e2e0eb;
54
+ padding: 0.6rem 0.75rem;
55
+ font-size: 0.875rem;
56
+ margin-bottom: 1rem;
57
+ outline: none;
58
+ transition: border-color 0.2s;
59
+ }
60
+ textarea:focus, input:focus, select:focus {
61
+ border-color: #7c3aed;
62
+ }
63
+ textarea { resize: vertical; min-height: 80px; }
64
+ .row { display: flex; gap: 1rem; }
65
+ .row > * { flex: 1; }
66
+ button.primary {
67
+ width: 100%;
68
+ padding: 0.75rem;
69
+ background: linear-gradient(135deg, #7c3aed, #6d28d9);
70
+ color: white;
71
+ border: none;
72
+ border-radius: 8px;
73
+ font-size: 0.95rem;
74
+ font-weight: 600;
75
+ cursor: pointer;
76
+ transition: opacity 0.2s;
77
+ }
78
+ button.primary:hover { opacity: 0.9; }
79
+ button.primary:disabled { opacity: 0.5; cursor: not-allowed; }
80
+ #status {
81
+ margin-top: 1rem;
82
+ padding: 0.75rem;
83
+ border-radius: 8px;
84
+ font-size: 0.85rem;
85
+ display: none;
86
+ }
87
+ #status.error { display: block; background: #2d1520; border: 1px solid #5c2338; color: #f87171; }
88
+ #status.success { display: block; background: #152d1a; border: 1px solid #235c2d; color: #4ade80; }
89
+ #status.loading { display: block; background: #1a1726; border: 1px solid #2d2a3a; color: #a78bfa; }
90
+ #audioResult { margin-top: 1rem; display: none; }
91
+ #audioResult audio { width: 100%; margin-top: 0.5rem; }
92
+ .info {
93
+ font-size: 0.75rem;
94
+ color: #706d82;
95
+ margin-top: -0.5rem;
96
+ margin-bottom: 1rem;
97
+ }
98
+ </style>
99
+ </head>
100
+ <body>
101
+ <div class="container">
102
+ <h1>IndexTTS2</h1>
103
+ <p class="subtitle">Emotionally expressive zero-shot voice cloning TTS &mdash; Test Console</p>
104
+
105
+ <div class="card">
106
+ <div class="card-title">Voice Reference</div>
107
+ <label for="voiceFile">Upload reference audio (WAV, 6-15 seconds recommended)</label>
108
+ <input type="file" id="voiceFile" accept="audio/*" style="margin-bottom:1rem">
109
+ <p class="info">IndexTTS2 clones the timbre from your reference audio for zero-shot voice synthesis.</p>
110
+ </div>
111
+
112
+ <div class="card">
113
+ <div class="card-title">Text &amp; Emotion</div>
114
+ <label for="inputText">Text to synthesize</label>
115
+ <textarea id="inputText" rows="4" placeholder="Enter text to convert to speech..."></textarea>
116
+
117
+ <label for="emotion">Emotion</label>
118
+ <select id="emotion">
119
+ <option value="neutral" selected>Neutral</option>
120
+ <option value="happy">Happy</option>
121
+ <option value="sad">Sad</option>
122
+ <option value="angry">Angry</option>
123
+ <option value="fear">Fear</option>
124
+ <option value="surprise">Surprise</option>
125
+ <option value="disgust">Disgust</option>
126
+ <option value="excited">Excited</option>
127
+ <option value="calm">Calm</option>
128
+ <option value="anxious">Anxious</option>
129
+ <option value="hopeful">Hopeful</option>
130
+ <option value="melancholy">Melancholy</option>
131
+ <option value="tender">Tender</option>
132
+ <option value="proud">Proud</option>
133
+ </select>
134
+
135
+ <div class="row">
136
+ <div>
137
+ <label for="intensity">Intensity (1-100)</label>
138
+ <input type="number" id="intensity" value="50" min="1" max="100">
139
+ </div>
140
+ <div>
141
+ <label for="volume">Volume (1-100)</label>
142
+ <input type="number" id="volume" value="75" min="1" max="100">
143
+ </div>
144
+ </div>
145
+
146
+ <div class="row">
147
+ <div>
148
+ <label for="speed">Speed adjust</label>
149
+ <input type="number" id="speed" value="0" min="-5" max="5" step="0.1">
150
+ </div>
151
+ <div>
152
+ <label for="pitch">Pitch adjust</label>
153
+ <input type="number" id="pitch" value="0" min="-5" max="5" step="0.1">
154
+ </div>
155
+ </div>
156
+ </div>
157
+
158
+ <div class="card">
159
+ <div class="card-title">Authentication</div>
160
+ <label for="apiKey">API Key (if set on server)</label>
161
+ <input type="text" id="apiKey" placeholder="Leave empty if no auth required">
162
+ </div>
163
+
164
+ <button class="primary" id="generateBtn" onclick="generate()">Generate Speech</button>
165
+
166
+ <div id="status"></div>
167
+ <div id="audioResult">
168
+ <audio id="audioPlayer" controls></audio>
169
+ </div>
170
+ </div>
171
+
172
+ <script>
173
+ async function fileToBase64(file) {
174
+ return new Promise((resolve, reject) => {
175
+ const reader = new FileReader();
176
+ reader.onload = () => {
177
+ const base64 = reader.result.split(',')[1];
178
+ resolve(base64);
179
+ };
180
+ reader.onerror = reject;
181
+ reader.readAsDataURL(file);
182
+ });
183
+ }
184
+
185
+ async function generate() {
186
+ const status = document.getElementById('status');
187
+ const btn = document.getElementById('generateBtn');
188
+ const audioResult = document.getElementById('audioResult');
189
+ const audioPlayer = document.getElementById('audioPlayer');
190
+
191
+ const voiceFile = document.getElementById('voiceFile').files[0];
192
+ const text = document.getElementById('inputText').value.trim();
193
+ const emotion = document.getElementById('emotion').value;
194
+ const intensity = parseInt(document.getElementById('intensity').value);
195
+ const volume = parseInt(document.getElementById('volume').value);
196
+ const speed = parseFloat(document.getElementById('speed').value);
197
+ const pitch = parseFloat(document.getElementById('pitch').value);
198
+ const apiKey = document.getElementById('apiKey').value.trim();
199
+
200
+ if (!voiceFile) {
201
+ status.className = 'error';
202
+ status.textContent = 'Please upload a reference voice audio file.';
203
+ return;
204
+ }
205
+ if (!text) {
206
+ status.className = 'error';
207
+ status.textContent = 'Please enter text to synthesize.';
208
+ return;
209
+ }
210
+
211
+ btn.disabled = true;
212
+ status.className = 'loading';
213
+ status.textContent = 'Generating speech... (this may take a moment)';
214
+ audioResult.style.display = 'none';
215
+
216
+ try {
217
+ const voiceBase64 = await fileToBase64(voiceFile);
218
+
219
+ const headers = { 'Content-Type': 'application/json' };
220
+ if (apiKey) headers['Authorization'] = `Bearer ${apiKey}`;
221
+
222
+ const resp = await fetch('/ConvertTextToSpeech', {
223
+ method: 'POST',
224
+ headers,
225
+ body: JSON.stringify({
226
+ input_text: text,
227
+ voice_to_clone_sample: voiceBase64,
228
+ emotion_set: [emotion],
229
+ intensity,
230
+ volume,
231
+ speed_adjust: speed,
232
+ pitch_adjust: pitch,
233
+ }),
234
+ });
235
+
236
+ if (!resp.ok) {
237
+ const err = await resp.json();
238
+ throw new Error(err.error || `HTTP ${resp.status}`);
239
+ }
240
+
241
+ const blob = await resp.blob();
242
+ const url = URL.createObjectURL(blob);
243
+ audioPlayer.src = url;
244
+ audioResult.style.display = 'block';
245
+ status.className = 'success';
246
+ status.textContent = 'Speech generated successfully!';
247
+ } catch (e) {
248
+ status.className = 'error';
249
+ status.textContent = `Error: ${e.message}`;
250
+ } finally {
251
+ btn.disabled = false;
252
+ }
253
+ }
254
+ </script>
255
+ </body>
256
+ </html>
indextts2/requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ indextts>=2.0.0
2
+ torch>=2.0.0
3
+ torchaudio>=2.0.0
4
+ fastapi>=0.104.0
5
+ uvicorn[standard]>=0.24.0
6
+ numpy
7
+ pydantic>=2.0.0
8
+ pyrubberband>=0.3.0
9
+ soundfile>=0.12.0
10
+ librosa>=0.10.0
11
+ huggingface_hub>=0.20.0