Spaces:

CherithCutestory
/

vlengine-indextts2

Paused

App Files Files Community

CherithCutestory commited on Mar 11

Commit

af564a4

1 Parent(s): 5f714f1

Initial commit

Browse files

Files changed (5) hide show

indextts2/Dockerfile +35 -0
indextts2/README.md +107 -0
indextts2/app.py +457 -0
indextts2/index.html +256 -0
indextts2/requirements.txt +11 -0

indextts2/Dockerfile ADDED Viewed

	@@ -0,0 +1,35 @@

+FROM python:3.11-slim
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    libsndfile1 \
+    ffmpeg \
+    git \
+    git-lfs \
+    rubberband-cli \
+    librubberband-dev \
+    && git lfs install \
+    && rm -rf /var/lib/apt/lists/*
+RUN useradd -m -u 1000 user
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+RUN mkdir -p checkpoints && \
+    python -c "from huggingface_hub import snapshot_download; snapshot_download('IndexTeam/IndexTTS-2', local_dir='checkpoints')" \
+    || echo "Model download deferred to runtime startup"
+COPY . .
+RUN chown -R user:user /app
+USER user
+ENV PYTHONUNBUFFERED=1
+ENV HF_HOME=/app/.cache/huggingface
+ENV MODEL_DIR=/app/checkpoints
+EXPOSE 7860
+CMD ["sh", "-c", "OMP_NUM_THREADS=4 exec uvicorn app:app --host 0.0.0.0 --port 7860"]

indextts2/README.md ADDED Viewed

	@@ -0,0 +1,107 @@

+---
+title: VoxLibris IndexTTS2 Engine
+emoji: 🎙️
+colorFrom: purple
+colorTo: indigo
+sdk: docker
+app_port: 7860
+pinned: false
+---
+# VoxLibris IndexTTS2 Engine
+A HuggingFace Space that serves [IndexTTS2](https://github.com/index-tts/index-tts)
+as a REST API, implementing the
+[VoxLibris TTS Engine API Contract](https://github.com/your-repo/docs/tts-api-contract.md).
+## Endpoints
+### POST /GetEngineDetails
+Returns engine capabilities, supported emotions, and voice cloning support.
+### POST /ConvertTextToSpeech
+Converts text to speech with zero-shot voice cloning. Requires a
+`voice_to_clone_sample` (base64-encoded WAV). Supports 14 emotions mapped
+to IndexTTS2's 8-dimensional emotion vector system.
+### GET /health
+Returns model loading status.
+## Authentication
+Set the `API_KEY` secret in your HuggingFace Space settings.
+Requests must include `Authorization: Bearer <your-key>` header.
+Leave `API_KEY` unset to disable authentication.
+## Voice Cloning
+IndexTTS2 is a zero-shot voice cloning engine — every request requires a
+reference voice sample. Send a base64-encoded WAV file in the
+`voice_to_clone_sample` field. A 6-15 second clear speech sample works best.
+The engine disentangles speaker timbre from emotional expression, allowing
+the cloned voice to speak with different emotions without affecting voice
+identity.
+## Emotion Support
+IndexTTS2 uses an 8-dimensional emotion vector system (happy, angry, sad,
+afraid, disgusted, melancholic, surprised, calm) with a fine-tuned Qwen3
+model for emotion analysis. VoxLibris emotions are automatically mapped
+to appropriate vector blends:
+| Emotion     | Mapping Strategy                      |
+|-------------|---------------------------------------|
+| neutral     | High calm (0.8)                       |
+| happy       | High happy (0.8)                      |
+| sad         | High sad (0.8)                        |
+| angry       | High angry (0.8)                      |
+| fear        | High afraid (0.8)                     |
+| disgust     | High disgusted (0.8)                  |
+| surprise    | High surprised (0.7)                  |
+| calm        | High calm (0.8)                       |
+| excited     | Happy (0.6) + surprised (0.2)         |
+| melancholy  | Sad (0.2) + melancholic (0.6)         |
+| anxious     | Afraid (0.5) + slight calm (0.2)      |
+| hopeful     | Happy (0.5) + calm (0.3)              |
+| tender      | Happy (0.2) + calm (0.5)              |
+| proud       | Happy (0.5) + surprised (0.1)         |
+The `intensity` parameter (1-100) scales the emotion vectors. Additional
+prosody reinforcement is applied via pyrubberband speed/pitch adjustments.
+## Key Features
+- **Emotion-Speaker Disentanglement**: Independent control over voice timbre
+  (from reference audio) and emotional expression (from emotion vectors)
+- **Zero-Shot Voice Cloning**: Clone any voice from a short reference audio
+- **Duration Control**: Supports both free generation and explicit token-count
+  modes for precise audio length
+- **Multilingual**: Chinese and English (with more languages supported)
+- **Built-in Qwen3 Emotion Model**: Fine-tuned for text-to-emotion analysis
+## Limits
+- Maximum 500 characters per request (longer text is truncated at word boundary)
+- Output: 22050 Hz mono 16-bit WAV
+- Reference audio: max 15 seconds (longer clips are auto-truncated)
+## Environment Variables
+| Variable    | Description                            | Default         |
+|-------------|----------------------------------------|-----------------|
+| `API_KEY`   | Bearer token for authentication        | (none/disabled) |
+| `MODEL_DIR` | Path to model checkpoints directory    | `checkpoints`   |
+| `USE_FP16`  | Enable half-precision inference        | `true`          |
+## Deployment
+1. Create a new HuggingFace Space with **Docker** SDK
+2. Upload the contents of this folder
+3. Set the `API_KEY` secret in Space settings (optional)
+4. The model downloads automatically during build (~5 GB)
+5. Requires GPU (A10G or better recommended for reasonable speed)
+6. Register the Space URL in VoxLibris Settings under TTS Engine Management

indextts2/app.py ADDED Viewed

	@@ -0,0 +1,457 @@

+import os
+os.environ.setdefault("OMP_NUM_THREADS", "4")
+os.environ.setdefault("HF_HUB_CACHE", "./checkpoints/hf_cache")
+import io
+import base64
+import tempfile
+import logging
+import wave
+import numpy as np
+import torch
+import pyrubberband as pyrb
+from contextlib import asynccontextmanager
+from pathlib import Path
+from fastapi import FastAPI, Request, HTTPException
+from fastapi.responses import Response, JSONResponse, HTMLResponse
+from pydantic import BaseModel, Field
+from typing import Optional
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger("indextts2-engine")
+BEARER_TOKEN = os.environ.get("API_KEY",
+                              "124CC717-7517-47A2-BBD6-54FCAE310297")
+SAMPLE_RATE = 22050
+BIT_DEPTH = 16
+CHANNELS = 1
+MAX_SECONDS = 60
+MAX_CHARS = 500
+VOXLIBRIS_TO_INDEXTTS2_EMOTIONS = {
+    "neutral": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8],
+    "happy": [0.8, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1],
+    "angry": [0.0, 0.8, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1],
+    "sad": [0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.0, 0.1],
+    "fear": [0.0, 0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.1],
+    "disgust": [0.0, 0.0, 0.0, 0.0, 0.8, 0.0, 0.0, 0.1],
+    "melancholy": [0.0, 0.0, 0.2, 0.0, 0.0, 0.6, 0.0, 0.1],
+    "surprise": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7, 0.1],
+    "calm": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8],
+    "excited": [0.6, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0],
+    "anxious": [0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.2],
+    "hopeful": [0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3],
+    "tender": [0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5],
+    "proud": [0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.2],
+    "fearful": [0.0, 0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.1],
+    "confused": [0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.3, 0.3],
+}
+EMOTION_SPEED_MAP = {
+    "neutral": 1.0,
+    "happy": 1.02,
+    "sad": 0.97,
+    "angry": 1.04,
+    "fear": 1.03,
+    "fearful": 1.03,
+    "surprise": 1.05,
+    "disgust": 0.98,
+    "excited": 1.03,
+    "calm": 0.96,
+    "confused": 0.98,
+    "anxious": 1.02,
+    "hopeful": 1.01,
+    "melancholy": 0.96,
+    "tender": 0.97,
+    "proud": 1.01,
+}
+EMOTION_PITCH_MAP = {
+    "neutral": 0.0,
+    "happy": 0.5,
+    "sad": -0.3,
+    "angry": -0.2,
+    "fear": 0.3,
+    "fearful": 0.3,
+    "surprise": 0.6,
+    "disgust": -0.2,
+    "excited": 0.7,
+    "calm": -0.1,
+    "confused": 0.2,
+    "anxious": 0.3,
+    "hopeful": 0.3,
+    "melancholy": -0.4,
+    "tender": -0.1,
+    "proud": 0.2,
+}
+CANONICAL_EMOTIONS = [
+    "neutral",
+    "happy",
+    "sad",
+    "angry",
+    "fear",
+    "surprise",
+    "disgust",
+    "excited",
+    "calm",
+    "anxious",
+    "hopeful",
+    "melancholy",
+    "tender",
+    "proud",
+    "fearful",
+    "confused",
+]
+tts_model = None
+def load_model():
+    global tts_model
+    from indextts.infer_v2 import IndexTTS2
+    model_dir = os.environ.get("MODEL_DIR", "checkpoints")
+    cfg_path = os.path.join(model_dir, "config.yaml")
+    if not os.path.exists(cfg_path):
+        logger.info(
+            "Model not found locally, downloading IndexTeam/IndexTTS-2...")
+        from huggingface_hub import snapshot_download
+        snapshot_download("IndexTeam/IndexTTS-2", local_dir=model_dir)
+        logger.info("Model download complete.")
+    use_fp16 = os.environ.get("USE_FP16",
+                              "true").lower() in ("true", "1", "yes")
+    device = None
+    if torch.cuda.is_available():
+        device = "cuda:0"
+    elif hasattr(torch, "mps") and torch.backends.mps.is_available():
+        device = "mps"
+        use_fp16 = False
+    else:
+        device = "cpu"
+        use_fp16 = False
+    logger.info(
+        f"Loading IndexTTS2 model from {model_dir} on {device} (fp16={use_fp16})..."
+    )
+    tts_model = IndexTTS2(
+        cfg_path=cfg_path,
+        model_dir=model_dir,
+        use_fp16=use_fp16,
+        device=device,
+    )
+    logger.info("IndexTTS2 model loaded successfully.")
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    load_model()
+    yield
+app = FastAPI(title="IndexTTS2 Engine", lifespan=lifespan)
+def verify_auth(request: Request):
+    if not BEARER_TOKEN:
+        return
+    auth = request.headers.get("Authorization", "")
+    if auth != f"Bearer {BEARER_TOKEN}":
+        raise HTTPException(status_code=401, detail="Unauthorized")
+def numpy_to_wav_bytes(audio_np: np.ndarray, sample_rate: int) -> bytes:
+    audio_np = np.clip(audio_np, -1.0, 1.0)
+    audio_int16 = (audio_np * 32767).astype(np.int16)
+    buf = io.BytesIO()
+    with wave.open(buf, "wb") as wf:
+        wf.setnchannels(CHANNELS)
+        wf.setsampwidth(2)
+        wf.setframerate(sample_rate)
+        wf.writeframes(audio_int16.tobytes())
+    return buf.getvalue()
+def blend_emotion_vectors(emotion_set: list[str],
+                          intensity: int) -> list[float]:
+    intensity_factor = intensity / 50.0
+    if not emotion_set or emotion_set == ["neutral"]:
+        base = VOXLIBRIS_TO_INDEXTTS2_EMOTIONS.get("neutral",
+                                                   [0.0] * 7 + [0.8])
+        return list(base)
+    blended = [0.0] * 8
+    count = 0
+    for emo in emotion_set:
+        emo_lower = emo.lower()
+        vec = VOXLIBRIS_TO_INDEXTTS2_EMOTIONS.get(emo_lower)
+        if vec:
+            for i in range(8):
+                blended[i] += vec[i]
+            count += 1
+    if count == 0:
+        return list(VOXLIBRIS_TO_INDEXTTS2_EMOTIONS["neutral"])
+    blended = [v / count for v in blended]
+    for i in range(7):
+        blended[i] = blended[i] * intensity_factor
+    calm_remaining = max(0.0, 1.0 - sum(blended[:7]))
+    blended[7] = min(blended[7], calm_remaining)
+    return blended
+class ConvertRequest(BaseModel):
+    input_text: str
+    builtin_voice_id: Optional[str] = None
+    voice_to_clone_sample: Optional[str] = None
+    random_seed: Optional[int] = None
+    emotion_set: list[str] = Field(default_factory=lambda: ["neutral"])
+    intensity: int = Field(default=50, ge=1, le=100)
+    volume: int = Field(default=75, ge=1, le=100)
+    speed_adjust: float = Field(default=0.0, ge=-5.0, le=5.0)
+    pitch_adjust: float = Field(default=0.0, ge=-5.0, le=5.0)
+@app.post("/GetEngineDetails")
+async def get_engine_details(request: Request):
+    verify_auth(request)
+    return {
+        "engine_id": "indextts2",
+        "engine_name": "IndexTTS2",
+        "sample_rate": SAMPLE_RATE,
+        "bit_depth": BIT_DEPTH,
+        "channels": CHANNELS,
+        "max_seconds_per_conversion": MAX_SECONDS,
+        "supports_voice_cloning": True,
+        "builtin_voices": [],
+        "supported_emotions": CANONICAL_EMOTIONS,
+        "extra_properties": {
+            "model":
+            "IndexTeam/IndexTTS-2",
+            "max_characters":
+            MAX_CHARS,
+            "emotion_control":
+            "8-dimensional emotion vectors via fine-tuned Qwen3",
+            "features": [
+                "zero-shot voice cloning",
+                "emotion-speaker disentanglement",
+                "duration control",
+                "multilingual (Chinese, English)",
+            ],
+        }
+    }
+@app.post("/ConvertTextToSpeech")
+async def convert_text_to_speech(request: Request):
+    verify_auth(request)
+    try:
+        body = await request.json()
+        req = ConvertRequest(**body)
+    except Exception as e:
+        return JSONResponse(status_code=400,
+                            content={
+                                "error": str(e),
+                                "error_code": "INVALID_REQUEST"
+                            })
+    if not req.input_text.strip():
+        return JSONResponse(status_code=400,
+                            content={
+                                "error": "Input text is empty",
+                                "error_code": "INVALID_REQUEST"
+                            })
+    if not req.voice_to_clone_sample:
+        return JSONResponse(
+            status_code=400,
+            content={
+                "error": "IndexTTS2 requires a voice sample for cloning. "
+                "Please provide a voice_to_clone_sample.",
+                "error_code": "INVALID_REQUEST"
+            })
+    if req.random_seed is not None and req.random_seed > 0:
+        torch.manual_seed(req.random_seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed(req.random_seed)
+    temp_files = []
+    try:
+        try:
+            wav_bytes = base64.b64decode(req.voice_to_clone_sample,
+                                         validate=True)
+        except Exception:
+            return JSONResponse(
+                status_code=400,
+                content={
+                    "error": "Invalid voice_to_clone_sample: not valid base64",
+                    "error_code": "INVALID_REQUEST"
+                })
+        if len(wav_bytes) < 44:
+            return JSONResponse(
+                status_code=400,
+                content={
+                    "error":
+                    "Invalid voice_to_clone_sample: file too small to be valid audio",
+                    "error_code": "INVALID_REQUEST"
+                })
+        tmp_voice = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
+        tmp_voice.write(wav_bytes)
+        tmp_voice.close()
+        speaker_wav_path = tmp_voice.name
+        temp_files.append(tmp_voice.name)
+        tmp_out = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
+        tmp_out.close()
+        output_path = tmp_out.name
+        temp_files.append(tmp_out.name)
+        text = req.input_text.strip()
+        if len(text) > MAX_CHARS:
+            truncated = text[:MAX_CHARS]
+            last_space = truncated.rfind(' ')
+            if last_space > MAX_CHARS * 0.6:
+                truncated = truncated[:last_space]
+            text = truncated
+            logger.warning(f"Text truncated to {len(text)} characters")
+        if text and text[-1] not in '.!?;:。！？；：':
+            text += '.'
+        dominant_emotion = req.emotion_set[0].lower(
+        ) if req.emotion_set else "neutral"
+        emo_vector = blend_emotion_vectors(req.emotion_set, req.intensity)
+        emo_vector = tts_model.normalize_emo_vec(emo_vector, apply_bias=True)
+        emotion_speed = EMOTION_SPEED_MAP.get(dominant_emotion, 1.0)
+        emotion_pitch = EMOTION_PITCH_MAP.get(dominant_emotion, 0.0)
+        intensity_factor = req.intensity / 50.0
+        emotion_speed = 1.0 + (emotion_speed - 1.0) * intensity_factor
+        emotion_pitch = emotion_pitch * intensity_factor
+        is_neutral = all(e.lower() in ("neutral", "calm")
+                         for e in req.emotion_set)
+        logger.info(f"Generating with IndexTTS2: emotions={req.emotion_set}, "
+                    f"emo_vector={[f'{v:.2f}' for v in emo_vector]}, "
+                    f"intensity={req.intensity}, text_len={len(text)}, "
+                    f"is_neutral={is_neutral}")
+        kwargs = {
+            "spk_audio_prompt": speaker_wav_path,
+            "text": text,
+            "output_path": output_path,
+            "verbose": False,
+        }
+        if not is_neutral:
+            kwargs["emo_vector"] = emo_vector
+        tts_model.infer(**kwargs)
+        if not os.path.exists(output_path) or os.path.getsize(
+                output_path) == 0:
+            return JSONResponse(status_code=500,
+                                content={
+                                    "error": "IndexTTS2 produced no output",
+                                    "error_code": "GENERATION_FAILED"
+                                })
+        import torchaudio
+        wav_tensor, sr = torchaudio.load(output_path)
+        audio_np = wav_tensor.squeeze().numpy().astype(np.float32)
+        if sr != SAMPLE_RATE:
+            import librosa
+            audio_np = librosa.resample(audio_np,
+                                        orig_sr=sr,
+                                        target_sr=SAMPLE_RATE)
+        peak = np.max(np.abs(audio_np))
+        if peak > 0:
+            audio_np = audio_np / peak
+        speed_factor = emotion_speed
+        if req.speed_adjust != 0.0:
+            user_speed = 1.0 + (req.speed_adjust / 100.0)
+            speed_factor = speed_factor * user_speed
+        speed_factor = max(0.5, min(2.0, speed_factor))
+        if abs(speed_factor - 1.0) > 0.01:
+            audio_np = pyrb.time_stretch(audio_np, SAMPLE_RATE, speed_factor)
+        total_pitch = emotion_pitch
+        if req.pitch_adjust != 0.0:
+            total_pitch += req.pitch_adjust * 0.24
+        if abs(total_pitch) > 0.01:
+            audio_np = pyrb.pitch_shift(audio_np, SAMPLE_RATE, total_pitch)
+        vol_factor = req.volume / 75.0
+        audio_np = audio_np * vol_factor
+        wav_bytes_out = numpy_to_wav_bytes(audio_np, SAMPLE_RATE)
+        return Response(content=wav_bytes_out, media_type="audio/wav")
+    except Exception as e:
+        logger.exception("TTS generation failed")
+        return JSONResponse(status_code=500,
+                            content={
+                                "error": "Audio generation failed",
+                                "error_code": "GENERATION_FAILED",
+                                "details": str(e)
+                            })
+    finally:
+        for f in temp_files:
+            try:
+                os.unlink(f)
+            except OSError:
+                pass
+@app.get("/", response_class=HTMLResponse)
+async def root():
+    html_path = Path(__file__).parent / "index.html"
+    if html_path.exists():
+        return HTMLResponse(content=html_path.read_text())
+    return HTMLResponse(content="""
+    <html>
+    <head><title>IndexTTS2 Engine</title></head>
+    <body style="font-family: sans-serif; max-width: 800px; margin: 40px auto; padding: 20px;">
+        <h1>IndexTTS2 Engine</h1>
+        <p>VoxLibris-compatible TTS engine powered by
+           <a href="https://github.com/index-tts/index-tts">IndexTTS2</a>.</p>
+        <h2>Endpoints</h2>
+        <ul>
+            <li><code>POST /GetEngineDetails</code> - Get engine capabilities</li>
+            <li><code>POST /ConvertTextToSpeech</code> - Convert text to speech</li>
+            <li><code>GET /health</code> - Health check</li>
+        </ul>
+    </body>
+    </html>
+    """)
+@app.get("/health")
+async def health():
+    return {"status": "ok", "model_loaded": tts_model is not None}
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

indextts2/index.html ADDED Viewed

	@@ -0,0 +1,256 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>IndexTTS2 - Test Console</title>
+  <style>
+    *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+      background: #0f0d1a;
+      color: #e2e0eb;
+      min-height: 100vh;
+      padding: 2rem;
+    }
+    .container { max-width: 720px; margin: 0 auto; }
+    h1 {
+      font-size: 1.75rem;
+      font-weight: 700;
+      background: linear-gradient(135deg, #a78bfa, #7c3aed);
+      -webkit-background-clip: text;
+      -webkit-text-fill-color: transparent;
+      margin-bottom: 0.25rem;
+    }
+    .subtitle { color: #9490a8; font-size: 0.875rem; margin-bottom: 2rem; }
+    .card {
+      background: #1a1726;
+      border: 1px solid #2d2a3a;
+      border-radius: 12px;
+      padding: 1.5rem;
+      margin-bottom: 1.25rem;
+    }
+    .card-title {
+      font-size: 0.8rem;
+      font-weight: 600;
+      text-transform: uppercase;
+      letter-spacing: 0.05em;
+      color: #a78bfa;
+      margin-bottom: 1rem;
+    }
+    label {
+      display: block;
+      font-size: 0.8rem;
+      font-weight: 500;
+      color: #b0adc0;
+      margin-bottom: 0.35rem;
+    }
+    textarea, input[type="text"], input[type="number"], select {
+      width: 100%;
+      background: #12101e;
+      border: 1px solid #2d2a3a;
+      border-radius: 8px;
+      color: #e2e0eb;
+      padding: 0.6rem 0.75rem;
+      font-size: 0.875rem;
+      margin-bottom: 1rem;
+      outline: none;
+      transition: border-color 0.2s;
+    }
+    textarea:focus, input:focus, select:focus {
+      border-color: #7c3aed;
+    }
+    textarea { resize: vertical; min-height: 80px; }
+    .row { display: flex; gap: 1rem; }
+    .row > * { flex: 1; }
+    button.primary {
+      width: 100%;
+      padding: 0.75rem;
+      background: linear-gradient(135deg, #7c3aed, #6d28d9);
+      color: white;
+      border: none;
+      border-radius: 8px;
+      font-size: 0.95rem;
+      font-weight: 600;
+      cursor: pointer;
+      transition: opacity 0.2s;
+    }
+    button.primary:hover { opacity: 0.9; }
+    button.primary:disabled { opacity: 0.5; cursor: not-allowed; }
+    #status {
+      margin-top: 1rem;
+      padding: 0.75rem;
+      border-radius: 8px;
+      font-size: 0.85rem;
+      display: none;
+    }
+    #status.error { display: block; background: #2d1520; border: 1px solid #5c2338; color: #f87171; }
+    #status.success { display: block; background: #152d1a; border: 1px solid #235c2d; color: #4ade80; }
+    #status.loading { display: block; background: #1a1726; border: 1px solid #2d2a3a; color: #a78bfa; }
+    #audioResult { margin-top: 1rem; display: none; }
+    #audioResult audio { width: 100%; margin-top: 0.5rem; }
+    .info {
+      font-size: 0.75rem;
+      color: #706d82;
+      margin-top: -0.5rem;
+      margin-bottom: 1rem;
+    }
+  </style>
+</head>
+<body>
+  <div class="container">
+    <h1>IndexTTS2</h1>
+    <p class="subtitle">Emotionally expressive zero-shot voice cloning TTS &mdash; Test Console</p>
+    <div class="card">
+      <div class="card-title">Voice Reference</div>
+      <label for="voiceFile">Upload reference audio (WAV, 6-15 seconds recommended)</label>
+      <input type="file" id="voiceFile" accept="audio/*" style="margin-bottom:1rem">
+      <p class="info">IndexTTS2 clones the timbre from your reference audio for zero-shot voice synthesis.</p>
+    </div>
+    <div class="card">
+      <div class="card-title">Text &amp; Emotion</div>
+      <label for="inputText">Text to synthesize</label>
+      <textarea id="inputText" rows="4" placeholder="Enter text to convert to speech..."></textarea>
+      <label for="emotion">Emotion</label>
+      <select id="emotion">
+        <option value="neutral" selected>Neutral</option>
+        <option value="happy">Happy</option>
+        <option value="sad">Sad</option>
+        <option value="angry">Angry</option>
+        <option value="fear">Fear</option>
+        <option value="surprise">Surprise</option>
+        <option value="disgust">Disgust</option>
+        <option value="excited">Excited</option>
+        <option value="calm">Calm</option>
+        <option value="anxious">Anxious</option>
+        <option value="hopeful">Hopeful</option>
+        <option value="melancholy">Melancholy</option>
+        <option value="tender">Tender</option>
+        <option value="proud">Proud</option>
+      </select>
+      <div class="row">
+        <div>
+          <label for="intensity">Intensity (1-100)</label>
+          <input type="number" id="intensity" value="50" min="1" max="100">
+        </div>
+        <div>
+          <label for="volume">Volume (1-100)</label>
+          <input type="number" id="volume" value="75" min="1" max="100">
+        </div>
+      </div>
+      <div class="row">
+        <div>
+          <label for="speed">Speed adjust</label>
+          <input type="number" id="speed" value="0" min="-5" max="5" step="0.1">
+        </div>
+        <div>
+          <label for="pitch">Pitch adjust</label>
+          <input type="number" id="pitch" value="0" min="-5" max="5" step="0.1">
+        </div>
+      </div>
+    </div>
+    <div class="card">
+      <div class="card-title">Authentication</div>
+      <label for="apiKey">API Key (if set on server)</label>
+      <input type="text" id="apiKey" placeholder="Leave empty if no auth required">
+    </div>
+    <button class="primary" id="generateBtn" onclick="generate()">Generate Speech</button>
+    <div id="status"></div>
+    <div id="audioResult">
+      <audio id="audioPlayer" controls></audio>
+    </div>
+  </div>
+  <script>
+    async function fileToBase64(file) {
+      return new Promise((resolve, reject) => {
+        const reader = new FileReader();
+        reader.onload = () => {
+          const base64 = reader.result.split(',')[1];
+          resolve(base64);
+        };
+        reader.onerror = reject;
+        reader.readAsDataURL(file);
+      });
+    }
+    async function generate() {
+      const status = document.getElementById('status');
+      const btn = document.getElementById('generateBtn');
+      const audioResult = document.getElementById('audioResult');
+      const audioPlayer = document.getElementById('audioPlayer');
+      const voiceFile = document.getElementById('voiceFile').files[0];
+      const text = document.getElementById('inputText').value.trim();
+      const emotion = document.getElementById('emotion').value;
+      const intensity = parseInt(document.getElementById('intensity').value);
+      const volume = parseInt(document.getElementById('volume').value);
+      const speed = parseFloat(document.getElementById('speed').value);
+      const pitch = parseFloat(document.getElementById('pitch').value);
+      const apiKey = document.getElementById('apiKey').value.trim();
+      if (!voiceFile) {
+        status.className = 'error';
+        status.textContent = 'Please upload a reference voice audio file.';
+        return;
+      }
+      if (!text) {
+        status.className = 'error';
+        status.textContent = 'Please enter text to synthesize.';
+        return;
+      }
+      btn.disabled = true;
+      status.className = 'loading';
+      status.textContent = 'Generating speech... (this may take a moment)';
+      audioResult.style.display = 'none';
+      try {
+        const voiceBase64 = await fileToBase64(voiceFile);
+        const headers = { 'Content-Type': 'application/json' };
+        if (apiKey) headers['Authorization'] = `Bearer ${apiKey}`;
+        const resp = await fetch('/ConvertTextToSpeech', {
+          method: 'POST',
+          headers,
+          body: JSON.stringify({
+            input_text: text,
+            voice_to_clone_sample: voiceBase64,
+            emotion_set: [emotion],
+            intensity,
+            volume,
+            speed_adjust: speed,
+            pitch_adjust: pitch,
+          }),
+        });
+        if (!resp.ok) {
+          const err = await resp.json();
+          throw new Error(err.error || `HTTP ${resp.status}`);
+        }
+        const blob = await resp.blob();
+        const url = URL.createObjectURL(blob);
+        audioPlayer.src = url;
+        audioResult.style.display = 'block';
+        status.className = 'success';
+        status.textContent = 'Speech generated successfully!';
+      } catch (e) {
+        status.className = 'error';
+        status.textContent = `Error: ${e.message}`;
+      } finally {
+        btn.disabled = false;
+      }
+    }
+  </script>
+</body>
+</html>

indextts2/requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+indextts>=2.0.0
+torch>=2.0.0
+torchaudio>=2.0.0
+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+numpy
+pydantic>=2.0.0
+pyrubberband>=0.3.0
+soundfile>=0.12.0
+librosa>=0.10.0
+huggingface_hub>=0.20.0