lulavc commited on
Commit
43f8b96
·
0 Parent(s):

fix: move theme/css from gr.Blocks() to demo.launch() for Gradio 6.x compatibility

Browse files
Files changed (7) hide show
  1. README.md +238 -0
  2. app.py +539 -0
  3. dubbing.py +188 -0
  4. i18n.py +271 -0
  5. lang_codes.py +66 -0
  6. requirements.txt +34 -0
  7. styles.py +153 -0
README.md ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: AnimaStudio 🎬
3
+ emoji: 🎬
4
+ colorFrom: purple
5
+ colorTo: pink
6
+ sdk: gradio
7
+ sdk_version: 6.0.2
8
+ app_file: app.py
9
+ pinned: true
10
+ license: apache-2.0
11
+ short_description: AI talking head & video dubbing — free, 23 languages
12
+ tags:
13
+ - video-generation
14
+ - talking-head
15
+ - lip-sync
16
+ - avatar
17
+ - tts
18
+ - voice-cloning
19
+ - multilingual
20
+ - mcp-server
21
+ - echomimic
22
+ - chatterbox
23
+ - dubbing
24
+ - whisper
25
+ - nllb
26
+ ---
27
+
28
+ # 🎬 AnimaStudio
29
+
30
+ Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages — free, no sign-up required.
31
+
32
+ ---
33
+
34
+ ## ✨ Features
35
+
36
+ | Feature | Details |
37
+ |---------|---------|
38
+ | 🎭 **Realistic Lip Sync** | EchoMimic V3 Flash (AAAI 2026) — state-of-the-art talking head animation |
39
+ | 🗣️ **23-Language TTS** | Type text in any of 23 languages and generate natural speech |
40
+ | 🎙️ **Voice Cloning** | Upload a voice reference clip to clone the speaking style |
41
+ | 📤 **Audio Upload** | Upload your own WAV / MP3 / FLAC instead of using TTS |
42
+ | 🎬 **Video Dubbing** | Upload a video (up to 60 s) and dub it into any of the 23 supported languages |
43
+ | 📐 **3 Aspect Ratios** | 9:16 mobile, 1:1 square, 16:9 landscape |
44
+ | 🌐 **4 UI Languages** | Full interface in English, Português (BR), Español, and عربي |
45
+ | 📥 **Download** | One-click download of the generated MP4 |
46
+ | 🤖 **MCP Server** | Use as a tool in Claude, Cursor, and any MCP-compatible agent |
47
+
48
+ ---
49
+
50
+ ## 🗣️ Supported TTS Languages
51
+
52
+ Arabic · Danish · German · Greek · **English** · **Spanish** · Finnish · French · Hebrew · Hindi · Italian · Japanese · Korean · Malay · Dutch · Norwegian · Polish · **Portuguese** · Russian · Swedish · Swahili · Turkish · Chinese
53
+
54
+ ---
55
+
56
+ ## 📐 Output Formats
57
+
58
+ | Preset | Dimensions | Best for |
59
+ |--------|-----------|----------|
60
+ | ▮ 9:16 | 576 × 1024 | Mobile, Reels, TikTok |
61
+ | ◻ 1:1 | 512 × 512 | Social media, thumbnails |
62
+ | ▬ 16:9 | 1024 × 576 | Presentations, YouTube |
63
+
64
+ ---
65
+
66
+ ## ⚙️ Advanced Settings
67
+
68
+ | Setting | Default | Range | Description |
69
+ |---------|---------|-------|-------------|
70
+ | **Inference Steps** | 20 | 5–50 | More steps = higher quality, slower |
71
+ | **Guidance Scale** | 3.5 | 1–10 | Higher = audio followed more strictly |
72
+ | **Emotion Intensity** | 0.5 | 0–1 | Controls expressiveness of TTS voice |
73
+
74
+ ---
75
+
76
+ ## 🤖 MCP Server
77
+
78
+ AnimaStudio runs as an **MCP (Model Context Protocol) server**, enabling AI agents to generate talking head videos programmatically.
79
+
80
+ ### Using with Claude Desktop
81
+
82
+ Add to your `claude_desktop_config.json`:
83
+
84
+ ```json
85
+ {
86
+ "mcpServers": {
87
+ "animastudio": {
88
+ "url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse"
89
+ }
90
+ }
91
+ }
92
+ ```
93
+
94
+ ### Tool parameters
95
+
96
+ - **portrait_image** — portrait photo (file path or base64)
97
+ - **text** — text for the avatar to speak (text mode)
98
+ - **tts_language** — language for speech synthesis (23 options)
99
+ - **voice_ref** — optional voice reference audio for cloning
100
+ - **audio_file** — audio file path (audio mode)
101
+ - **aspect_ratio** — output format (9:16, 1:1, 16:9)
102
+ - **emotion** — emotion intensity 0–1
103
+ - **num_steps** — inference steps (default 20)
104
+ - **guidance_scale** — guidance scale (default 3.5)
105
+
106
+ ---
107
+
108
+ ## 🎬 Video Dubbing (Phase 2)
109
+
110
+ Upload any video (up to 60 seconds) and dub it into a different language. The pipeline:
111
+
112
+ 1. **Whisper Turbo** transcribes the original speech (auto-detects language)
113
+ 2. **NLLB-200** translates the transcript to the target language
114
+ 3. **Chatterbox TTS** synthesizes the translated speech (with optional voice cloning)
115
+ 4. **ffmpeg** muxes the new audio track onto the original video
116
+
117
+ ### Dubbing Settings
118
+
119
+ | Setting | Details |
120
+ |---------|---------|
121
+ | **Input Video** | Any video with speech, up to 60 seconds |
122
+ | **Target Language** | Any of the 23 supported languages |
123
+ | **Voice Reference** | Optional audio clip to clone the speaker's voice style |
124
+
125
+ > Same language as source? The pipeline skips translation and re-synthesizes the audio directly.
126
+
127
+ ---
128
+
129
+ ## 🔧 Technical Details
130
+
131
+ ### Models
132
+
133
+ | Model | Purpose | License | VRAM |
134
+ |-------|---------|---------|------|
135
+ | [EchoMimic V3 Flash](https://huggingface.co/BadToBest/EchoMimicV3) | Talking head video generation | Apache 2.0 | ~12 GB |
136
+ | [Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox) | 23-language TTS with voice cloning | MIT | ~8 GB |
137
+ | [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | Speech transcription (809M params) | MIT | ~2 GB |
138
+ | [NLLB-200 Distilled 600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | Text translation (23 languages) | CC-BY-NC-4.0 | API (no local GPU) |
139
+
140
+ ### Architecture
141
+
142
+ ```
143
+ ┌─────────────────────────────────────────────────────────────────┐
144
+ │ Tab 1: Create Video │
145
+ │ │
146
+ │ Portrait Photo + Text ──→ Chatterbox TTS ──→ Audio WAV │
147
+ │ │ │
148
+ │ Portrait Photo + Audio ───────────────────────────┤ │
149
+ │ ▼ │
150
+ │ EchoMimic V3 Flash │
151
+ │ (lip-sync animation) │
152
+ │ │ │
153
+ │ ▼ │
154
+ │ MP4 Video Output │
155
+ ├─────────────────────────────────────────────────────────────────┤
156
+ │ Tab 2: Dub Video │
157
+ │ │
158
+ │ Video ──→ ffmpeg (extract audio) │
159
+ │ │ │
160
+ │ ▼ │
161
+ │ Whisper Turbo (transcribe + detect language) │
162
+ │ │ │
163
+ │ ▼ │
164
+ │ NLLB-200 (translate to target language) │
165
+ │ │ │
166
+ │ ▼ │
167
+ │ Chatterbox TTS (synthesize translated speech) │
168
+ │ │ │
169
+ │ ▼ │
170
+ │ ffmpeg (mux new audio onto original video) │
171
+ │ │ │
172
+ │ ▼ │
173
+ │ Dubbed MP4 Output │
174
+ └─────────────────────────────────────────────────────────────────┘
175
+ ```
176
+
177
+ ### VRAM Management
178
+
179
+ Models run sequentially on ZeroGPU (A10G, 24 GB):
180
+
181
+ **Create Video tab:**
182
+ 1. Chatterbox TTS → generates audio → offloads to CPU
183
+ 2. EchoMimic V3 → generates video → offloads to CPU
184
+ 3. `torch.cuda.empty_cache()` between stages
185
+
186
+ **Dub Video tab:**
187
+ 1. Whisper Turbo → transcribes audio (~2 GB) → offloads to CPU
188
+ 2. NLLB-200 → translates via HF Inference API (no local GPU)
189
+ 3. Chatterbox TTS → synthesizes dubbed speech → offloads to CPU
190
+ 4. `torch.cuda.empty_cache()` between stages
191
+
192
+ Peak usage never exceeds ~16 GB.
193
+
194
+ ---
195
+
196
+ ## 💡 Tips for Best Results
197
+
198
+ ### Create Video
199
+ 1. **Use a clear, front-facing portrait** — well-lit, neutral background, face filling most of the frame
200
+ 2. **Keep audio under 20 seconds** — shorter = faster generation, tighter lip sync
201
+ 3. **Add a voice reference** — upload a 5–15 second clip in the target language for natural voice cloning
202
+ 4. **Match language to text** — select the correct TTS language to avoid accent issues
203
+ 5. **Emotion 0.4–0.6** — sweet spot for natural-sounding delivery
204
+ 6. **9:16 for social** — perfect for Reels, TikTok, and Stories
205
+ 7. **20–30 steps** — good quality/speed trade-off for most use cases
206
+
207
+ ### Dub Video
208
+ 8. **Keep videos under 60 seconds** — pipeline enforces this limit for VRAM and quality
209
+ 9. **Clear speech works best** — minimal background music/noise gives cleaner transcriptions
210
+ 10. **Add a voice reference** — clone the original speaker's voice for a more natural dub
211
+ 11. **Single-speaker videos** — the pipeline works best with one speaker at a time
212
+
213
+ ---
214
+
215
+ ## 🛠️ Running Locally
216
+
217
+ ```bash
218
+ git clone https://huggingface.co/spaces/lulavc/AnimaStudio
219
+ cd AnimaStudio
220
+ pip install -r requirements.txt
221
+ python app.py
222
+ ```
223
+
224
+ Requires a CUDA GPU with at least 16 GB VRAM. Set `HF_TOKEN` for private model access.
225
+
226
+ ---
227
+
228
+ ## 📄 License
229
+
230
+ - **Space code:** Apache 2.0
231
+ - **EchoMimic V3:** [Apache 2.0](https://huggingface.co/BadToBest/EchoMimicV3)
232
+ - **Chatterbox TTS:** [MIT](https://huggingface.co/ResembleAI/chatterbox)
233
+ - **Whisper Turbo:** [MIT](https://huggingface.co/openai/whisper-large-v3-turbo)
234
+ - **NLLB-200:** [CC-BY-NC-4.0](https://huggingface.co/facebook/nllb-200-distilled-600M)
235
+
236
+ ---
237
+
238
+ **Space by [lulavc](https://huggingface.co/lulavc)** · Powered by [EchoMimic V3](https://huggingface.co/BadToBest/EchoMimicV3) + [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) + [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) + [NLLB-200](https://huggingface.co/facebook/nllb-200-distilled-600M) · ZeroGPU · A10G
app.py ADDED
@@ -0,0 +1,539 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import spaces
2
+ import gradio as gr
3
+ import torch
4
+ import torchaudio
5
+ import os
6
+ import gc
7
+ import sys
8
+ import shutil
9
+ import tempfile
10
+ import subprocess
11
+ import threading
12
+ import logging
13
+
14
+ import dubbing
15
+ from i18n import T, EXAMPLES, ALL_EXAMPLES_FLAT, TTS_LANGUAGES, MAX_TEXT_LEN, MAX_AUDIO_SEC
16
+ from styles import THEME, CSS
17
+
18
+ log = logging.getLogger(__name__)
19
+
20
+ # ── Config ────────────────────────────────────────────────────────────────────
21
+ ECHOMIMIC_MODEL = os.environ.get("ECHOMIMIC_MODEL", "BadToBest/EchoMimicV3")
22
+ CHATTERBOX_MODEL = os.environ.get("CHATTERBOX_MODEL", "ResembleAI/chatterbox")
23
+ MAX_DUB_TEXT_LEN = 1500 # ~60s of typical speech at 150 wpm ≈ 900 chars; 1500 is safe headroom
24
+
25
+ ASPECT_PRESETS = {
26
+ "▮ 9:16 · 576×1024": (576, 1024),
27
+ "◻ 1:1 · 512×512": (512, 512),
28
+ "▬ 16:9 · 1024×576": (1024, 576),
29
+ }
30
+
31
+ DEFAULT_STEPS = 20
32
+ DEFAULT_CFG = 3.5
33
+ DEFAULT_FPS = 25
34
+
35
+ # ── Runtime repo installs (avoid PyPI conflicts) ──────────────────────────────
36
+ _ECHOMIMIC_REPO = "https://github.com/antgroup/echomimic_v3.git"
37
+ _ECHOMIMIC_DIR = "/tmp/echomimic_v3"
38
+ _CHATTERBOX_REPO = "https://github.com/resemble-ai/chatterbox.git"
39
+ _CHATTERBOX_DIR = "/tmp/chatterbox"
40
+ _clone_lock = threading.Lock()
41
+
42
+
43
+ def _clone_repo(repo_url: str, dest: str, label: str):
44
+ """Thread-safe shallow clone. Uses .git presence to detect complete clones."""
45
+ with _clone_lock:
46
+ if not os.path.exists(os.path.join(dest, ".git")):
47
+ if os.path.exists(dest):
48
+ shutil.rmtree(dest)
49
+ log.info("Cloning %s…", label)
50
+ subprocess.run(
51
+ ["git", "clone", "--depth=1", repo_url, dest],
52
+ check=True, timeout=180,
53
+ )
54
+ log.info("%s cloned", label)
55
+ if dest not in sys.path:
56
+ sys.path.insert(0, dest)
57
+
58
+
59
+ def _ensure_echomimic_repo():
60
+ _clone_repo(_ECHOMIMIC_REPO, _ECHOMIMIC_DIR, "EchoMimic V3")
61
+
62
+
63
+ def _ensure_chatterbox_repo():
64
+ _clone_repo(_CHATTERBOX_REPO, _CHATTERBOX_DIR, "Chatterbox TTS")
65
+
66
+
67
+ # ── Model singletons ──────────────────────────────────────────────────────────
68
+ _tts_model = None
69
+ _echo_pipe = None
70
+ _echo_mode = None
71
+
72
+
73
+ def _load_tts():
74
+ global _tts_model
75
+ if _tts_model is None:
76
+ _ensure_chatterbox_repo()
77
+ from chatterbox.tts import ChatterboxTTS
78
+ log.info("Loading Chatterbox TTS…")
79
+ _tts_model = ChatterboxTTS.from_pretrained(device="cpu")
80
+ log.info("Chatterbox TTS ready")
81
+ return _tts_model
82
+
83
+
84
+ def _load_echomimic():
85
+ global _echo_pipe, _echo_mode
86
+ if _echo_pipe is not None:
87
+ return _echo_pipe, _echo_mode
88
+
89
+ try:
90
+ _ensure_echomimic_repo()
91
+ from echomimic_v3.pipelines.pipeline_echomimic_v3 import EchoMimicV3Pipeline
92
+ log.info("Loading EchoMimic V3 (local)…")
93
+ _echo_pipe = EchoMimicV3Pipeline.from_pretrained(ECHOMIMIC_MODEL, torch_dtype=torch.float16)
94
+ _echo_mode = "local"
95
+ log.info("EchoMimic V3 ready (local)")
96
+ return _echo_pipe, _echo_mode
97
+ except Exception as e:
98
+ log.warning("EchoMimic V3 local import failed: %s", e)
99
+
100
+ try:
101
+ from diffusers import DiffusionPipeline
102
+ log.info("Loading EchoMimic V3 via diffusers…")
103
+ _echo_pipe = DiffusionPipeline.from_pretrained(
104
+ ECHOMIMIC_MODEL, torch_dtype=torch.float16, trust_remote_code=True,
105
+ )
106
+ _echo_mode = "local"
107
+ log.info("EchoMimic V3 ready (diffusers)")
108
+ return _echo_pipe, _echo_mode
109
+ except Exception as e:
110
+ log.warning("EchoMimic V3 diffusers load failed: %s", e)
111
+
112
+ raise RuntimeError("EchoMimic V3 could not be loaded. Check requirements and model availability.")
113
+
114
+
115
+ # ── Video utilities ───────────────────────────────────────────────────────────
116
+ def _coerce_frames(frames):
117
+ """Normalise pipeline output to a list of (H, W, 3) uint8 numpy arrays."""
118
+ import numpy as np
119
+ result = []
120
+ for frame in frames:
121
+ if hasattr(frame, "save"):
122
+ arr = np.array(frame.convert("RGB"))
123
+ elif hasattr(frame, "cpu"):
124
+ arr = frame.cpu().float().numpy()
125
+ if arr.ndim == 3 and arr.shape[0] in (1, 3, 4):
126
+ arr = arr.transpose(1, 2, 0)
127
+ if arr.max() <= 1.0:
128
+ arr = (arr * 255).clip(0, 255)
129
+ arr = arr.astype(np.uint8)
130
+ else:
131
+ arr = np.array(frame)
132
+ if arr.ndim == 2:
133
+ import cv2
134
+ arr = cv2.cvtColor(arr, cv2.COLOR_GRAY2RGB)
135
+ elif arr.shape[2] == 4:
136
+ arr = arr[:, :, :3]
137
+ result.append(arr)
138
+ return result
139
+
140
+
141
+ def _mux_video(frames, audio_path: str, fps: int = DEFAULT_FPS) -> str:
142
+ """Combine frames (PIL/tensor/ndarray) + audio into an MP4 file."""
143
+ import cv2
144
+
145
+ coerced = _coerce_frames(frames)
146
+ with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as f:
147
+ out_path = f.name
148
+ with tempfile.TemporaryDirectory() as tmpdir:
149
+ for i, arr in enumerate(coerced):
150
+ cv2.imwrite(os.path.join(tmpdir, f"{i:06d}.png"), cv2.cvtColor(arr, cv2.COLOR_RGB2BGR))
151
+ cmd = [
152
+ "ffmpeg", "-y", "-loglevel", "error",
153
+ "-framerate", str(fps),
154
+ "-i", os.path.join(tmpdir, "%06d.png"),
155
+ "-i", audio_path,
156
+ "-c:v", "libx264", "-preset", "fast", "-crf", "22",
157
+ "-c:a", "aac", "-b:a", "128k",
158
+ "-shortest", "-pix_fmt", "yuv420p",
159
+ out_path,
160
+ ]
161
+ subprocess.run(cmd, check=True, timeout=120)
162
+ return out_path
163
+
164
+
165
+ # ── TTS ───────────────────────────────────────────────────────────────────────
166
+ def _run_tts(text: str, voice_ref: str | None, emotion: float, language: str = "English") -> str:
167
+ """Generate speech WAV. Returns temp file path."""
168
+ model = _load_tts()
169
+ log.info("TTS: language=%s text_len=%d emotion=%.2f", language, len(text), emotion)
170
+ model.to("cuda")
171
+ try:
172
+ wav = model.generate(
173
+ text=text.strip(),
174
+ audio_prompt_path=voice_ref if voice_ref else None,
175
+ exaggeration=float(emotion),
176
+ )
177
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
178
+ out_path = f.name
179
+ torchaudio.save(out_path, wav, model.sr)
180
+ return out_path
181
+ finally:
182
+ model.to("cpu")
183
+ torch.cuda.empty_cache()
184
+
185
+
186
+ # ── EchoMimic ─────────────────────────────────────────────────────────────────
187
+ def _run_echomimic(portrait_img, audio_path: str, width: int, height: int,
188
+ num_steps: int, guidance_scale: float) -> str:
189
+ """Generate talking-head video. Returns MP4 file path."""
190
+ pipe, _ = _load_echomimic()
191
+ pipe.to("cuda")
192
+ try:
193
+ output = pipe(
194
+ ref_image=portrait_img,
195
+ audio_path=audio_path,
196
+ width=width,
197
+ height=height,
198
+ num_inference_steps=num_steps,
199
+ guidance_scale=guidance_scale,
200
+ fps=DEFAULT_FPS,
201
+ )
202
+ if hasattr(output, "frames"):
203
+ return _mux_video(output.frames[0], audio_path)
204
+ if hasattr(output, "videos"):
205
+ vid = output.videos[0]
206
+ if hasattr(vid, "unbind"):
207
+ return _mux_video(list(vid.unbind(0)), audio_path)
208
+ return _mux_video(vid, audio_path)
209
+ if isinstance(output, str):
210
+ return output
211
+ raise ValueError(f"Unexpected pipeline output type: {type(output)}")
212
+ finally:
213
+ pipe.to("cpu")
214
+ torch.cuda.empty_cache()
215
+ gc.collect()
216
+
217
+
218
+ # ── Phase 1: Generate video endpoint ─────────────────────────────────────────
219
+ @spaces.GPU(duration=180)
220
+ def generate(portrait_img, input_mode: str, text: str, tts_language: str,
221
+ voice_ref, audio_file, aspect_ratio: str, emotion: float,
222
+ num_steps: int, guidance_scale: float, lang: str,
223
+ progress=gr.Progress(track_tqdm=True)):
224
+
225
+ t = T.get(lang, T["🇺🇸 English"])
226
+ if portrait_img is None:
227
+ raise gr.Error(t["err_no_portrait"])
228
+
229
+ width, height = ASPECT_PRESETS.get(aspect_ratio, (512, 512))
230
+ _tts_tmp: str | None = None
231
+
232
+ try:
233
+ if input_mode == "text":
234
+ if not text or not text.strip():
235
+ raise gr.Error(t["err_no_text"])
236
+ if len(text) > MAX_TEXT_LEN:
237
+ raise gr.Error(t["err_text_long"])
238
+ if voice_ref and not os.path.exists(voice_ref):
239
+ voice_ref = None
240
+ _tts_tmp = _run_tts(text, voice_ref, emotion, language=tts_language)
241
+ audio_path = _tts_tmp
242
+ else:
243
+ if audio_file is None:
244
+ raise gr.Error(t["err_no_audio"])
245
+ info = torchaudio.info(audio_file)
246
+ if (info.num_frames / info.sample_rate) > MAX_AUDIO_SEC:
247
+ raise gr.Error(t["err_audio_long"])
248
+ audio_path = audio_file
249
+
250
+ return _run_echomimic(portrait_img, audio_path, width, height, int(num_steps), float(guidance_scale))
251
+
252
+ except torch.cuda.OutOfMemoryError:
253
+ raise gr.Error(t["err_oom"])
254
+ except gr.Error:
255
+ raise
256
+ except Exception as e:
257
+ raise gr.Error(f"Generation failed: {str(e)[:400]}")
258
+ finally:
259
+ if _tts_tmp and os.path.exists(_tts_tmp):
260
+ try:
261
+ os.unlink(_tts_tmp)
262
+ except Exception:
263
+ pass
264
+ torch.cuda.empty_cache()
265
+ gc.collect()
266
+
267
+
268
+ # ── Phase 2: Dubbing endpoint ─────────────────────────────────────────────────
269
+ @spaces.GPU(duration=180)
270
+ def dub_video(video_input, target_lang: str, voice_ref, emotion: float, lang: str,
271
+ progress=gr.Progress(track_tqdm=True)):
272
+
273
+ t = T.get(lang, T["🇺🇸 English"])
274
+ temp_files: list[str] = []
275
+
276
+ try:
277
+ if video_input is None:
278
+ raise gr.Error(t["err_no_video"])
279
+
280
+ duration = dubbing.get_video_duration(video_input)
281
+ if duration > dubbing.MAX_DUB_AUDIO_SEC:
282
+ raise gr.Error(t["err_video_long"])
283
+
284
+ progress(0.10, desc="Extracting audio…")
285
+ audio_path = dubbing.extract_audio(video_input)
286
+ temp_files.append(audio_path)
287
+
288
+ progress(0.25, desc="Transcribing…")
289
+ transcript = dubbing.transcribe(audio_path)
290
+ dubbing._unload_whisper()
291
+
292
+ source_display = transcript.language_display
293
+ if source_display != target_lang:
294
+ progress(0.45, desc="Translating…")
295
+ try:
296
+ translated_text = dubbing.translate(transcript.text, source_display, target_lang)
297
+ except Exception as exc:
298
+ raise gr.Error(f"{t['err_translate']} ({str(exc)[:200]}")
299
+ else:
300
+ translated_text = transcript.text
301
+
302
+ if len(translated_text) > MAX_DUB_TEXT_LEN:
303
+ raise gr.Error(t["err_dub_text_long"])
304
+
305
+ progress(0.60, desc="Synthesizing speech…")
306
+ if voice_ref and not os.path.exists(voice_ref):
307
+ voice_ref = None
308
+ dubbed_audio = _run_tts(translated_text, voice_ref, emotion, language=target_lang)
309
+ temp_files.append(dubbed_audio)
310
+
311
+ progress(0.85, desc="Combining video…")
312
+ output_path = dubbing.mux_dubbed_video(video_input, dubbed_audio)
313
+
314
+ status = f"✓ {source_display} → {target_lang} | {duration:.1f}s"
315
+ return output_path, transcript.text, translated_text, status
316
+
317
+ except torch.cuda.OutOfMemoryError:
318
+ raise gr.Error(t["err_oom"])
319
+ except gr.Error:
320
+ raise
321
+ except Exception as e:
322
+ raise gr.Error(f"Dubbing failed: {str(e)[:400]}")
323
+ finally:
324
+ for fp in temp_files:
325
+ if fp and os.path.exists(fp):
326
+ try:
327
+ os.unlink(fp)
328
+ except Exception:
329
+ pass
330
+ torch.cuda.empty_cache()
331
+ gc.collect()
332
+
333
+
334
+ # ── Language switcher ─────────────────────────────────────────────────────────
335
+ def switch_language(lang: str):
336
+ t = T.get(lang, T["🇺🇸 English"])
337
+ mode_choices = [(t["mode_text"], "text"), (t["mode_audio"], "audio")]
338
+ # 26 outputs — must match _lang_out list order below
339
+ return (
340
+ # Phase 1 (16)
341
+ gr.update(label=t["portrait_label"], info=t["portrait_info"]),
342
+ gr.update(label=t["input_mode_label"], choices=mode_choices, value="text"),
343
+ gr.update(label=t["text_label"], placeholder=t["text_ph"]),
344
+ gr.update(label=t["tts_lang_label"]),
345
+ gr.update(label=t["voice_ref_label"], info=t["voice_ref_info"]),
346
+ gr.update(label=t["emotion_label"], info=t["emotion_info"]),
347
+ gr.update(label=t["audio_label"], info=t["audio_info"]),
348
+ gr.update(label=t["aspect_label"]),
349
+ gr.update(label=t["advanced"]),
350
+ gr.update(label=t["steps_label"], info=t["steps_info"]),
351
+ gr.update(label=t["guidance_label"], info=t["guidance_info"]),
352
+ gr.update(value=t["generate"]),
353
+ gr.update(value=t["examples_header"]),
354
+ gr.update(visible=True), # text_group
355
+ gr.update(visible=False), # audio_group
356
+ gr.update(label=t["output_label"]),
357
+ # Phase 2 (10)
358
+ gr.update(label=t["dub_video_label"], info=t["dub_video_info"]),
359
+ gr.update(label=t["dub_target_label"]),
360
+ gr.update(label=t["dub_voice_label"], info=t["dub_voice_info"]),
361
+ gr.update(label=t["dub_emotion_label"]),
362
+ gr.update(value=t["dub_btn"]),
363
+ gr.update(label=t["dub_output_label"]),
364
+ gr.update(label=t["dub_transcript"]),
365
+ gr.update(label=t["dub_translation"]),
366
+ gr.update(label=t["dub_status"]),
367
+ gr.update(label=t["dub_details"]),
368
+ )
369
+
370
+
371
+ def _toggle_input_mode(mode: str, _lang: str):
372
+ is_text = (mode == "text")
373
+ return gr.update(visible=is_text), gr.update(visible=not is_text)
374
+
375
+
376
+ # ── Interface ─────────────────────────────────────────────────────────────────
377
+ with gr.Blocks(title="AnimaStudio 🎬") as demo:
378
+
379
+ gr.HTML("""
380
+ <div class="as-header">
381
+ <h1>🎬 AnimaStudio</h1>
382
+ <p class="tagline">AI Talking Head Video Creator &amp; Video Dubbing Studio</p>
383
+ <div class="badges">
384
+ <span class="badge badge-purple">🎭 Lip Sync</span>
385
+ <span class="badge badge-pink">🗣️ 23 TTS Languages</span>
386
+ <span class="badge badge-cyan">🎙️ Voice Cloning</span>
387
+ <span class="badge badge-teal">🎙️ Video Dubbing</span>
388
+ <span class="badge">⚡ EchoMimic V3</span>
389
+ <span class="badge badge-gold">🌐 EN · PT-BR · ES · AR</span>
390
+ <span class="badge">🤖 MCP Server</span>
391
+ </div>
392
+ </div>
393
+ """)
394
+
395
+ lang_selector = gr.Radio(
396
+ choices=list(T.keys()),
397
+ value="🇺🇸 English",
398
+ label=None,
399
+ container=False,
400
+ elem_id="lang-selector",
401
+ )
402
+
403
+ with gr.Tabs():
404
+
405
+ # ══ Tab 1: Create Video ════════════════════════════════════════════════
406
+ with gr.Tab("🎬 Create Video", id="tab-create"):
407
+ with gr.Row(equal_height=False):
408
+ with gr.Column(scale=1, min_width=360):
409
+ portrait = gr.Image(
410
+ label="Portrait Photo",
411
+ info="Upload a clear, front-facing face photo",
412
+ type="pil",
413
+ sources=["upload", "webcam"],
414
+ )
415
+ input_mode = gr.Radio(
416
+ choices=[(T["🇺🇸 English"]["mode_text"], "text"),
417
+ (T["🇺🇸 English"]["mode_audio"], "audio")],
418
+ value="text",
419
+ label="Audio Input",
420
+ )
421
+ with gr.Group(visible=True) as text_group:
422
+ text_input = gr.Textbox(
423
+ label="Text",
424
+ placeholder="Type what you want the avatar to say...",
425
+ lines=4, max_lines=10,
426
+ )
427
+ tts_language = gr.Dropdown(choices=TTS_LANGUAGES, value="English", label="Speech Language")
428
+ with gr.Row():
429
+ voice_ref = gr.Audio(
430
+ label="Voice Reference",
431
+ info="Optional: upload audio to clone the voice style",
432
+ type="filepath", sources=["upload"],
433
+ )
434
+ emotion = gr.Slider(0.0, 1.0, value=0.5, step=0.05,
435
+ label="Emotion Intensity", info="0 = neutral · 1 = very expressive")
436
+ with gr.Group(visible=False) as audio_group:
437
+ audio_upload = gr.Audio(
438
+ label="Audio File",
439
+ info="Upload WAV, MP3, or FLAC · max 30 seconds",
440
+ type="filepath", sources=["upload", "microphone"],
441
+ )
442
+ aspect_ratio = gr.Dropdown(choices=list(ASPECT_PRESETS.keys()),
443
+ value="◻ 1:1 · 512×512", label="Format")
444
+ with gr.Accordion("⚙️ Advanced Settings", open=False) as adv_acc:
445
+ num_steps = gr.Slider(5, 50, value=DEFAULT_STEPS, step=1,
446
+ label="Inference Steps", info="More steps = higher quality, slower")
447
+ guidance_scale = gr.Slider(1.0, 10.0, value=DEFAULT_CFG, step=0.5,
448
+ label="Guidance Scale", info="Higher = follows audio more strictly")
449
+ gen_btn = gr.Button("🎬 Generate Video", variant="primary", elem_id="gen-btn", size="lg")
450
+ examples_header = gr.Markdown("### 💡 Try These Examples")
451
+ gr.Examples(examples=ALL_EXAMPLES_FLAT, inputs=[text_input, tts_language, emotion], label=None)
452
+
453
+ with gr.Column(scale=1, min_width=440):
454
+ output_video = gr.Video(label="Generated Video", format="mp4", autoplay=True,
455
+ height=640, elem_id="output-video", show_download_button=True)
456
+
457
+ # ══ Tab 2: Dub Video ═══════════════════════════════════════════════════
458
+ with gr.Tab("🎙️ Dub Video", id="tab-dub"):
459
+ with gr.Row(equal_height=False):
460
+ with gr.Column(scale=1, min_width=360):
461
+ dub_video_input = gr.Video(label="Input Video",
462
+ info="Upload a video to dub (max 60 seconds)",
463
+ sources=["upload"])
464
+ dub_target_lang = gr.Dropdown(choices=TTS_LANGUAGES, value="English", label="Target Language")
465
+ dub_voice_ref = gr.Audio(label="Voice Reference",
466
+ info="Optional: upload audio to clone voice style for dubbing",
467
+ type="filepath", sources=["upload"])
468
+ dub_emotion = gr.Slider(0.0, 1.0, value=0.5, step=0.05, label="Emotion Intensity")
469
+ dub_btn = gr.Button("🎙️ Dub Video", variant="primary", elem_id="dub-btn", size="lg")
470
+ gr.HTML("""
471
+ <div style="color:#94a3b8;font-size:0.82rem;margin-top:0.5rem;padding:0.75rem;
472
+ background:rgba(6,182,212,0.05);border-radius:0.5rem;
473
+ border:1px solid rgba(6,182,212,0.15);">
474
+ <strong>How it works:</strong> Whisper transcribes → NLLB-200 translates →
475
+ Chatterbox TTS synthesizes → audio replaces original track.
476
+ </div>
477
+ """)
478
+
479
+ with gr.Column(scale=1, min_width=440):
480
+ dub_output_video = gr.Video(label="Dubbed Video", format="mp4", autoplay=True,
481
+ height=480, elem_id="dub-output-video", show_download_button=True)
482
+ with gr.Accordion("Details", open=False) as dub_details_acc:
483
+ dub_transcript_box = gr.Textbox(label="Detected Transcript", interactive=False, lines=4)
484
+ dub_translation_box = gr.Textbox(label="Translation", interactive=False, lines=4)
485
+ dub_status_box = gr.Textbox(label="Status", interactive=False, lines=2)
486
+
487
+ gr.HTML("""
488
+ <div class="as-footer">
489
+ <strong>Models:</strong>
490
+ <a href="https://huggingface.co/BadToBest/EchoMimicV3" target="_blank">EchoMimic V3</a>
491
+ (Apache 2.0) &nbsp;·&nbsp;
492
+ <a href="https://huggingface.co/ResembleAI/chatterbox" target="_blank">Chatterbox TTS</a>
493
+ (MIT) &nbsp;·&nbsp;
494
+ <a href="https://huggingface.co/openai/whisper-large-v3-turbo" target="_blank">Whisper Turbo</a>
495
+ (MIT) &nbsp;·&nbsp;
496
+ <a href="https://huggingface.co/facebook/nllb-200-distilled-600M" target="_blank">NLLB-200</a>
497
+ (CC-BY-NC) &nbsp;·&nbsp;
498
+ <strong>Space by:</strong>
499
+ <a href="https://huggingface.co/lulavc" target="_blank">lulavc</a>
500
+ &nbsp;·&nbsp; ZeroGPU &nbsp;·&nbsp; A10G
501
+ </div>
502
+ """)
503
+
504
+ # ── Events ────────────────────────────────────────────────────────────────
505
+ gen_btn.click(
506
+ generate,
507
+ inputs=[portrait, input_mode, text_input, tts_language,
508
+ voice_ref, audio_upload, aspect_ratio, emotion,
509
+ num_steps, guidance_scale, lang_selector],
510
+ outputs=output_video,
511
+ )
512
+
513
+ input_mode.change(_toggle_input_mode, inputs=[input_mode, lang_selector],
514
+ outputs=[text_group, audio_group])
515
+
516
+ dub_btn.click(
517
+ dub_video,
518
+ inputs=[dub_video_input, dub_target_lang, dub_voice_ref, dub_emotion, lang_selector],
519
+ outputs=[dub_output_video, dub_transcript_box, dub_translation_box, dub_status_box],
520
+ )
521
+
522
+ # Language switcher — 26 outputs, must match switch_language() return tuple order
523
+ _lang_out = [
524
+ # Phase 1 (16)
525
+ portrait, input_mode, text_input, tts_language,
526
+ voice_ref, emotion, audio_upload, aspect_ratio,
527
+ adv_acc, num_steps, guidance_scale, gen_btn, examples_header,
528
+ text_group, audio_group, output_video,
529
+ # Phase 2 (10)
530
+ dub_video_input, dub_target_lang, dub_voice_ref,
531
+ dub_emotion, dub_btn, dub_output_video,
532
+ dub_transcript_box, dub_translation_box,
533
+ dub_status_box, dub_details_acc,
534
+ ]
535
+ lang_selector.change(switch_language, inputs=lang_selector, outputs=_lang_out)
536
+
537
+
538
+ if __name__ == "__main__":
539
+ demo.launch(theme=THEME, css=CSS, mcp_server=True)
dubbing.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Video dubbing pipeline: extract audio → transcribe → translate → TTS → mux."""
2
+
3
+ import gc
4
+ import logging
5
+ import os
6
+ import subprocess
7
+ import tempfile
8
+ import time
9
+ from dataclasses import dataclass
10
+ from typing import Optional
11
+
12
+ import torch
13
+ from huggingface_hub import InferenceClient
14
+
15
+ from lang_codes import get_nllb_code, whisper_code_to_display
16
+
17
+ log = logging.getLogger(__name__)
18
+
19
+ # ── Constants ────────────────────────────────────────────────────────────────
20
+ MAX_DUB_AUDIO_SEC = 60
21
+ WHISPER_MODEL_SIZE = "turbo" # ~809M params, ~2GB VRAM on A10G
22
+ NLLB_MODEL_ID = "facebook/nllb-200-distilled-600M"
23
+
24
+ # ── Singleton ─────────────────────────────────────────────────────────────────
25
+ _whisper_model = None
26
+
27
+
28
+ @dataclass(frozen=True)
29
+ class TranscriptionResult:
30
+ text: str
31
+ language: str # Whisper ISO 639-1 code (e.g. "en")
32
+ language_display: str # Display name matching TTS_LANGUAGES (e.g. "English")
33
+ segments: tuple # tuple of {"start", "end", "text"} dicts
34
+
35
+
36
+ # ── Whisper lifecycle ─────────────────────────────────────────────────────────
37
+
38
+ def _load_whisper():
39
+ global _whisper_model
40
+ if _whisper_model is None:
41
+ import whisper
42
+ log.info("Loading Whisper %s…", WHISPER_MODEL_SIZE)
43
+ _whisper_model = whisper.load_model(WHISPER_MODEL_SIZE, device="cpu")
44
+ log.info("Whisper %s ready", WHISPER_MODEL_SIZE)
45
+ return _whisper_model
46
+
47
+
48
+ def _unload_whisper():
49
+ global _whisper_model
50
+ if _whisper_model is not None:
51
+ _whisper_model.to("cpu")
52
+ del _whisper_model
53
+ _whisper_model = None
54
+ torch.cuda.empty_cache()
55
+ gc.collect()
56
+ log.info("Whisper unloaded")
57
+
58
+
59
+ # ── Step 1: Extract audio ─────────────────────────────────────────────────────
60
+
61
+ def extract_audio(video_path: str) -> str:
62
+ """Extract audio from video as 16kHz mono WAV. Returns temp file path."""
63
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
64
+ out_path = f.name
65
+ cmd = [
66
+ "ffmpeg", "-y", "-loglevel", "error",
67
+ "-i", video_path,
68
+ "-vn", "-acodec", "pcm_s16le",
69
+ "-ar", "16000", "-ac", "1",
70
+ out_path,
71
+ ]
72
+ subprocess.run(cmd, check=True, timeout=60)
73
+ return out_path
74
+
75
+
76
+ # ── Step 2: Transcribe (GPU) ──────────────────────────────────────────────────
77
+
78
+ def transcribe(audio_path: str) -> TranscriptionResult:
79
+ """Transcribe audio with Whisper turbo. Moves model to CUDA, then back to CPU."""
80
+ model = _load_whisper()
81
+ model.to("cuda")
82
+ try:
83
+ result = model.transcribe(audio_path, task="transcribe", fp16=True)
84
+ detected_lang = result.get("language", "en")
85
+ display_name = whisper_code_to_display(detected_lang) or "English"
86
+ segments = tuple(
87
+ {"start": s["start"], "end": s["end"], "text": s["text"]}
88
+ for s in result.get("segments", [])
89
+ )
90
+ return TranscriptionResult(
91
+ text=result["text"].strip(),
92
+ language=detected_lang,
93
+ language_display=display_name,
94
+ segments=segments,
95
+ )
96
+ finally:
97
+ model.to("cpu")
98
+ torch.cuda.empty_cache()
99
+ gc.collect()
100
+
101
+
102
+ # ── Step 3: Translate via HF Inference API (no GPU) ──────────────────────────
103
+
104
+ def translate(text: str, source_lang: str, target_lang: str) -> str:
105
+ """Translate text using NLLB-200 via HF Inference API.
106
+
107
+ Args:
108
+ text: Source text.
109
+ source_lang: Display name e.g. "English".
110
+ target_lang: Display name e.g. "Portuguese".
111
+
112
+ Returns:
113
+ Translated text string.
114
+ """
115
+ if source_lang == target_lang:
116
+ return text
117
+
118
+ src_code = get_nllb_code(source_lang)
119
+ tgt_code = get_nllb_code(target_lang)
120
+
121
+ # Client instantiated once outside the retry loop
122
+ client = InferenceClient()
123
+ last_exc: Optional[Exception] = None
124
+ for attempt in range(3):
125
+ try:
126
+ result = client.translation(
127
+ text,
128
+ model=NLLB_MODEL_ID,
129
+ src_lang=src_code,
130
+ tgt_lang=tgt_code,
131
+ )
132
+ if isinstance(result, str):
133
+ return result
134
+ if isinstance(result, dict):
135
+ return result.get("translation_text") or result.get("generated_text") or str(result)
136
+ return str(result)
137
+ except Exception as exc:
138
+ last_exc = exc
139
+ log.warning("Translation attempt %d failed: %s", attempt + 1, exc)
140
+ time.sleep(2 ** attempt)
141
+
142
+ raise RuntimeError(f"Translation failed after 3 attempts: {last_exc}") from last_exc
143
+
144
+
145
+ # ── Step 4: Mux dubbed audio onto original video ──────────────────────────────
146
+
147
+ def mux_dubbed_video(video_path: str, audio_path: str) -> str:
148
+ """Replace video audio with dubbed audio track. Returns output MP4 path.
149
+
150
+ Cleans up the output file if ffmpeg fails (no partial file leak).
151
+ """
152
+ with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as f:
153
+ out_path = f.name
154
+ cmd = [
155
+ "ffmpeg", "-y", "-loglevel", "error",
156
+ "-i", video_path,
157
+ "-i", audio_path,
158
+ "-c:v", "copy",
159
+ "-c:a", "aac", "-b:a", "128k",
160
+ "-map", "0:v:0", "-map", "1:a:0",
161
+ "-shortest",
162
+ out_path,
163
+ ]
164
+ try:
165
+ subprocess.run(cmd, check=True, timeout=120)
166
+ return out_path
167
+ except Exception:
168
+ # Clean up partial output file on ffmpeg failure
169
+ if os.path.exists(out_path):
170
+ try:
171
+ os.unlink(out_path)
172
+ except OSError:
173
+ pass
174
+ raise
175
+
176
+
177
+ # ── Utility ───────────────────────────────────────────────────────────────────
178
+
179
+ def get_video_duration(video_path: str) -> float:
180
+ """Return video duration in seconds using ffprobe."""
181
+ cmd = [
182
+ "ffprobe", "-v", "error",
183
+ "-show_entries", "format=duration",
184
+ "-of", "default=noprint_wrappers=1:nokey=1",
185
+ video_path,
186
+ ]
187
+ result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=30)
188
+ return float(result.stdout.strip())
i18n.py ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Internationalization: TTS language list, example texts, and UI translations."""
2
+
3
+ MAX_TEXT_LEN = 500
4
+ MAX_AUDIO_SEC = 30
5
+
6
+ TTS_LANGUAGES = [
7
+ "Arabic", "Danish", "German", "Greek", "English",
8
+ "Spanish", "Finnish", "French", "Hebrew", "Hindi",
9
+ "Italian", "Japanese", "Korean", "Malay", "Dutch",
10
+ "Norwegian", "Polish", "Portuguese", "Russian", "Swedish",
11
+ "Swahili", "Turkish", "Chinese",
12
+ ]
13
+
14
+ # ── Examples per UI language ──────────────────────────────────────────────────
15
+ # Format: [text, tts_language, emotion]
16
+ EXAMPLES = {
17
+ "🇺🇸 English": [
18
+ ["Hello! Welcome to this presentation. Today I'll be sharing some exciting insights about artificial intelligence and how it's changing the world around us.", "English", 0.5],
19
+ ["I'm thrilled to announce the launch of our new project. After months of hard work and dedication, we've created something truly special I can't wait to share with you.", "English", 0.7],
20
+ ["Good morning, students! Today's lecture covers neural networks — the backbone of modern AI. By the end of this session you'll understand how machines learn from data.", "English", 0.4],
21
+ ["Breaking news: scientists have discovered a new method for sustainable energy production that could revolutionise how we power our cities. More details coming up.", "English", 0.6],
22
+ ],
23
+ "🇧🇷 Português": [
24
+ ["Olá a todos! Sejam bem-vindos a esta apresentação. Hoje vou compartilhar com vocês algumas descobertas incríveis sobre inteligência artificial e como ela está transformando o nosso mundo.", "Portuguese", 0.5],
25
+ ["Estou muito animado para anunciar o lançamento do nosso novo projeto. Depois de meses de trabalho dedicado, criamos algo verdadeiramente especial que mal posso esperar para mostrar a vocês.", "Portuguese", 0.7],
26
+ ["Bom dia, estudantes! A aula de hoje aborda redes neurais — a base da IA moderna. Ao final desta sessão, vocês vão entender como as máquinas aprendem com os dados.", "Portuguese", 0.4],
27
+ ["Esta receita tradicional foi passada de geração em geração na minha família. Hoje vou ensinar como preparar um prato delicioso que vai impressionar todos os seus convidados.", "Portuguese", 0.6],
28
+ ],
29
+ "🇪🇸 Español": [
30
+ ["¡Hola a todos! Bienvenidos a esta presentación. Hoy voy a compartir con ustedes algunos descubrimientos fascinantes sobre la inteligencia artificial y cómo está transformando nuestro mundo.", "Spanish", 0.5],
31
+ ["Estoy muy emocionado de anunciar el lanzamiento de nuestro nuevo proyecto. Después de meses de arduo trabajo hemos creado algo verdaderamente especial que no puedo esperar para mostrarles.", "Spanish", 0.7],
32
+ ["Buenos días, estudiantes. La clase de hoy trata sobre las redes neuronales, la columna vertebral de la IA moderna. Al final de esta sesión entenderán cómo las máquinas aprenden de los datos.", "Spanish", 0.4],
33
+ ["Esta receta tradicional ha pasado de generación en generación en mi familia. Hoy les enseñaré cómo preparar un plato delicioso que impresionará a todos sus invitados.", "Spanish", 0.6],
34
+ ],
35
+ "🇪🇬 عربي": [
36
+ ["مرحباً بالجميع! أهلاً وسهلاً بكم في هذا العرض التقديمي. اليوم سأشارككم بعض الاكتشافات المثيرة حول الذكاء الاصطناعي وكيف يُغيّر عالمنا من حولنا.", "Arabic", 0.5],
37
+ ["يسعدني الإعلان عن إطلاق مشروعنا الجديد. بعد أشهر من العمل الدؤوب أبدعنا شيئاً رائعاً حقاً لا أصبر على مشاركته معكم جميعاً.", "Arabic", 0.7],
38
+ ["صباح الخير أيها الطلاب! محاضرة اليوم تتناول الشبكات العصبية التي تمثّل الأساس التقني للذكاء الاصطناعي الحديث. بنهاية هذه الجلسة ستفهمون كيف تتعلم الآلات من البيانات.", "Arabic", 0.4],
39
+ ["هذه الوصفة التقليدية انتقلت من جيل إلى جيل في عائلتي. اليوم سأعلّمكم كيفية تحضير طبق شهي سيُبهر جميع ضيوفكم ويجعلهم يطلبون المزيد.", "Arabic", 0.6],
40
+ ],
41
+ }
42
+
43
+ ALL_EXAMPLES_FLAT = [ex for exs in EXAMPLES.values() for ex in exs]
44
+
45
+ # ── UI translations ────────────────────────────────────────────────────────────
46
+ T: dict[str, dict[str, str]] = {
47
+ "🇺🇸 English": {
48
+ # Phase 1: Create Video
49
+ "tab_create": "🎬 Create Video",
50
+ "tagline": "AI Talking Head Video Creator",
51
+ "input_mode_label": "Audio Input",
52
+ "mode_text": "Text to Speech",
53
+ "mode_audio": "Upload Audio",
54
+ "portrait_label": "Portrait Photo",
55
+ "portrait_info": "Upload a clear, front-facing face photo",
56
+ "text_label": "Text",
57
+ "text_ph": "Type what you want the avatar to say...",
58
+ "tts_lang_label": "Speech Language",
59
+ "voice_ref_label": "Voice Reference",
60
+ "voice_ref_info": "Optional: upload audio to clone the voice style",
61
+ "emotion_label": "Emotion Intensity",
62
+ "emotion_info": "0 = neutral · 1 = very expressive",
63
+ "audio_label": "Audio File",
64
+ "audio_info": "Upload WAV, MP3, or FLAC · max 30 seconds",
65
+ "aspect_label": "Format",
66
+ "advanced": "⚙️ Advanced Settings",
67
+ "steps_label": "Inference Steps",
68
+ "steps_info": "More steps = higher quality, slower",
69
+ "guidance_label": "Guidance Scale",
70
+ "guidance_info": "Higher = follows audio more strictly",
71
+ "generate": "🎬 Generate Video",
72
+ "output_label": "Generated Video",
73
+ "examples_header": "### 💡 Try These Examples",
74
+ "err_no_portrait": "Please upload a portrait photo.",
75
+ "err_no_text": "Please enter some text.",
76
+ "err_no_audio": "Please upload an audio file.",
77
+ "err_text_long": f"Text too long (max {MAX_TEXT_LEN} characters).",
78
+ "err_audio_long": f"Audio too long (max {MAX_AUDIO_SEC} seconds).",
79
+ "err_oom": "GPU out of memory. Try a smaller format or fewer steps.",
80
+ "err_no_face": "No face detected. Please upload a clear front-facing portrait.",
81
+ "err_model": "Model not loaded. Please refresh and try again.",
82
+ # Phase 2: Dub Video
83
+ "tab_dub": "🎙️ Dub Video",
84
+ "dub_tagline": "Dub any video into 23 languages",
85
+ "dub_video_label": "Input Video",
86
+ "dub_video_info": "Upload a video to dub (max 60 seconds)",
87
+ "dub_target_label": "Target Language",
88
+ "dub_voice_label": "Voice Reference",
89
+ "dub_voice_info": "Optional: upload audio to clone voice style for dubbing",
90
+ "dub_emotion_label": "Emotion Intensity",
91
+ "dub_btn": "🎙️ Dub Video",
92
+ "dub_output_label": "Dubbed Video",
93
+ "dub_transcript": "Detected Transcript",
94
+ "dub_translation": "Translation",
95
+ "dub_status": "Status",
96
+ "dub_details": "Details",
97
+ "err_no_video": "Please upload a video.",
98
+ "err_video_long": "Video too long (max 60 seconds).",
99
+ "err_translate": "Translation failed. Please try again.",
100
+ "err_transcribe": "Transcription failed. Please try again.",
101
+ "err_dub_text_long": "Transcription too long to synthesize. Please use a shorter video.",
102
+ },
103
+ "🇧🇷 Português": {
104
+ # Phase 1: Create Video
105
+ "tab_create": "🎬 Criar Vídeo",
106
+ "tagline": "Criador de Vídeo Avatar com IA",
107
+ "input_mode_label": "Entrada de Áudio",
108
+ "mode_text": "Texto para Fala",
109
+ "mode_audio": "Enviar Áudio",
110
+ "portrait_label": "Foto Retrato",
111
+ "portrait_info": "Envie uma foto frontal clara do rosto",
112
+ "text_label": "Texto",
113
+ "text_ph": "Digite o que você quer que o avatar diga...",
114
+ "tts_lang_label": "Idioma da Fala",
115
+ "voice_ref_label": "Referência de Voz",
116
+ "voice_ref_info": "Opcional: envie um áudio para clonar o estilo de voz",
117
+ "emotion_label": "Intensidade da Emoção",
118
+ "emotion_info": "0 = neutro · 1 = muito expressivo",
119
+ "audio_label": "Arquivo de Áudio",
120
+ "audio_info": "Envie WAV, MP3 ou FLAC · máx. 30 segundos",
121
+ "aspect_label": "Formato",
122
+ "advanced": "⚙️ Configurações Avançadas",
123
+ "steps_label": "Etapas de Inferência",
124
+ "steps_info": "Mais etapas = maior qualidade, mais lento",
125
+ "guidance_label": "Escala de Orientação",
126
+ "guidance_info": "Maior = segue o áudio com mais precisão",
127
+ "generate": "🎬 Gerar Vídeo",
128
+ "output_label": "Vídeo Gerado",
129
+ "examples_header": "### 💡 Experimente Estes Exemplos",
130
+ "err_no_portrait": "Por favor, envie uma foto retrato.",
131
+ "err_no_text": "Por favor, insira algum texto.",
132
+ "err_no_audio": "Por favor, envie um arquivo de áudio.",
133
+ "err_text_long": f"Texto muito longo (máx. {MAX_TEXT_LEN} caracteres).",
134
+ "err_audio_long": f"Áudio muito longo (máx. {MAX_AUDIO_SEC} segundos).",
135
+ "err_oom": "GPU sem memória. Tente um formato menor ou menos etapas.",
136
+ "err_no_face": "Nenhum rosto detectado. Envie uma foto retrato frontal clara.",
137
+ "err_model": "Modelo não carregado. Atualize a página e tente novamente.",
138
+ # Phase 2: Dub Video
139
+ "tab_dub": "🎙️ Dublar Vídeo",
140
+ "dub_tagline": "Duble qualquer vídeo em 23 idiomas",
141
+ "dub_video_label": "Vídeo de Entrada",
142
+ "dub_video_info": "Envie um vídeo para dublar (máx. 60 segundos)",
143
+ "dub_target_label": "Idioma de Destino",
144
+ "dub_voice_label": "Referência de Voz",
145
+ "dub_voice_info": "Opcional: envie áudio para clonar o estilo de voz na dublagem",
146
+ "dub_emotion_label": "Intensidade da Emoção",
147
+ "dub_btn": "🎙️ Dublar Vídeo",
148
+ "dub_output_label": "Vídeo Dublado",
149
+ "dub_transcript": "Transcrição Detectada",
150
+ "dub_translation": "Tradução",
151
+ "dub_status": "Status",
152
+ "dub_details": "Detalhes",
153
+ "err_no_video": "Por favor, envie um vídeo.",
154
+ "err_video_long": "Vídeo muito longo (máx. 60 segundos).",
155
+ "err_translate": "Tradução falhou. Por favor, tente novamente.",
156
+ "err_transcribe": "Transcrição falhou. Por favor, tente novamente.",
157
+ "err_dub_text_long": "Transcrição longa demais para sintetizar. Use um vídeo mais curto.",
158
+ },
159
+ "🇪🇸 Español": {
160
+ # Phase 1: Create Video
161
+ "tab_create": "🎬 Crear Vídeo",
162
+ "tagline": "Creador de Vídeo Avatar con IA",
163
+ "input_mode_label": "Entrada de Audio",
164
+ "mode_text": "Texto a Voz",
165
+ "mode_audio": "Subir Audio",
166
+ "portrait_label": "Foto Retrato",
167
+ "portrait_info": "Sube una foto frontal clara del rostro",
168
+ "text_label": "Texto",
169
+ "text_ph": "Escribe lo que quieres que diga el avatar...",
170
+ "tts_lang_label": "Idioma del Habla",
171
+ "voice_ref_label": "Referencia de Voz",
172
+ "voice_ref_info": "Opcional: sube un audio para clonar el estilo de voz",
173
+ "emotion_label": "Intensidad Emocional",
174
+ "emotion_info": "0 = neutro · 1 = muy expresivo",
175
+ "audio_label": "Archivo de Audio",
176
+ "audio_info": "Sube WAV, MP3 o FLAC · máx. 30 segundos",
177
+ "aspect_label": "Formato",
178
+ "advanced": "⚙️ Configuración Avanzada",
179
+ "steps_label": "Pasos de Inferencia",
180
+ "steps_info": "Más pasos = mayor calidad, más lento",
181
+ "guidance_label": "Escala de Guía",
182
+ "guidance_info": "Mayor = sigue el audio con más precisión",
183
+ "generate": "🎬 Generar Vídeo",
184
+ "output_label": "Vídeo Generado",
185
+ "examples_header": "### 💡 Prueba Estos Ejemplos",
186
+ "err_no_portrait": "Por favor, sube una foto retrato.",
187
+ "err_no_text": "Por favor, ingresa algún texto.",
188
+ "err_no_audio": "Por favor, sube un archivo de audio.",
189
+ "err_text_long": f"Texto demasiado largo (máx. {MAX_TEXT_LEN} caracteres).",
190
+ "err_audio_long": f"Audio demasiado largo (máx. {MAX_AUDIO_SEC} segundos).",
191
+ "err_oom": "Sin memoria GPU. Prueba un formato menor o menos pasos.",
192
+ "err_no_face": "No se detectó rostro. Sube una foto retrato frontal clara.",
193
+ "err_model": "Modelo no cargado. Recarga la página e intenta de nuevo.",
194
+ # Phase 2: Dub Video
195
+ "tab_dub": "🎙️ Doblar Vídeo",
196
+ "dub_tagline": "Dobla cualquier vídeo a 23 idiomas",
197
+ "dub_video_label": "Vídeo de Entrada",
198
+ "dub_video_info": "Sube un vídeo para doblar (máx. 60 segundos)",
199
+ "dub_target_label": "Idioma de Destino",
200
+ "dub_voice_label": "Referencia de Voz",
201
+ "dub_voice_info": "Opcional: sube audio para clonar el estilo de voz en el doblaje",
202
+ "dub_emotion_label": "Intensidad Emocional",
203
+ "dub_btn": "🎙️ Doblar Vídeo",
204
+ "dub_output_label": "Vídeo Doblado",
205
+ "dub_transcript": "Transcripción Detectada",
206
+ "dub_translation": "Traducción",
207
+ "dub_status": "Estado",
208
+ "dub_details": "Detalles",
209
+ "err_no_video": "Por favor, sube un vídeo.",
210
+ "err_video_long": "Vídeo demasiado largo (máx. 60 segundos).",
211
+ "err_translate": "Traducción fallida. Por favor, inténtalo de nuevo.",
212
+ "err_transcribe": "Transcripción fallida. Por favor, inténtalo de nuevo.",
213
+ "err_dub_text_long": "Transcripción demasiado larga. Usa un vídeo más corto.",
214
+ },
215
+ "🇪🇬 عربي": {
216
+ # Phase 1: Create Video
217
+ "tab_create": "🎬 إنشاء فيديو",
218
+ "tagline": "منشئ فيديو الأفاتار بالذكاء الاصطناعي",
219
+ "input_mode_label": "مدخل الصوت",
220
+ "mode_text": "نص إلى كلام",
221
+ "mode_audio": "رفع ملف صوتي",
222
+ "portrait_label": "صورة الوجه",
223
+ "portrait_info": "ارفع صورة واضحة للوجه من الأمام",
224
+ "text_label": "النص",
225
+ "text_ph": "اكتب ما تريد أن يقوله الأفاتار...",
226
+ "tts_lang_label": "لغة الكلام",
227
+ "voice_ref_label": "مرجع الصوت",
228
+ "voice_ref_info": "اختياري: ارفع ملفاً صوتياً لاستنساخ أسلوب الصوت",
229
+ "emotion_label": "شدة التعبير العاطفي",
230
+ "emotion_info": "0 = محايد · 1 = تعبيري جداً",
231
+ "audio_label": "الملف الصوتي",
232
+ "audio_info": "ارفع WAV أو MP3 أو FLAC · الحد الأقصى 30 ثانية",
233
+ "aspect_label": "التنسيق",
234
+ "advanced": "⚙️ الإعدادات المتقدمة",
235
+ "steps_label": "خطوات الاستدلال",
236
+ "steps_info": "المزيد من الخطوات = جودة أعلى، وقت أطول",
237
+ "guidance_label": "مقياس التوجيه",
238
+ "guidance_info": "أعلى = يتبع الصوت بدقة أكبر",
239
+ "generate": "🎬 توليد الفيديو",
240
+ "output_label": "الفيديو المُنشأ",
241
+ "examples_header": "### 💡 جرّب هذه الأمثلة",
242
+ "err_no_portrait": "الرجاء رفع صورة وجه.",
243
+ "err_no_text": "الرجاء إدخال نص.",
244
+ "err_no_audio": "الرجاء رفع ملف صوتي.",
245
+ "err_text_long": f"النص طويل جداً (الحد الأقصى {MAX_TEXT_LEN} حرف).",
246
+ "err_audio_long": f"الصوت طويل جداً (الحد الأقصى {MAX_AUDIO_SEC} ثانية).",
247
+ "err_oom": "نفدت ذاكرة GPU. جرّب تنسيقاً أصغر أو خطوات أقل.",
248
+ "err_no_face": "لم يُكتشف أي وجه. ارفع صورة وجه واضحة من الأمام.",
249
+ "err_model": "النموذج غير محمّل. أعد تحميل الصفحة وحاول مجدداً.",
250
+ # Phase 2: Dub Video
251
+ "tab_dub": "🎙️ دبلجة فيديو",
252
+ "dub_tagline": "دبلج أي فيديو إلى 23 لغة",
253
+ "dub_video_label": "الفيديو المُدخل",
254
+ "dub_video_info": "ارفع فيديو للدبلجة (الحد الأقصى 60 ثانية)",
255
+ "dub_target_label": "اللغة الهدف",
256
+ "dub_voice_label": "مرجع الصوت",
257
+ "dub_voice_info": "اختياري: ارفع ملفاً صوتياً لاستنساخ أسلوب الصوت في الدبلجة",
258
+ "dub_emotion_label": "شدة التعبير العاطفي",
259
+ "dub_btn": "🎙️ دبلجة الفيديو",
260
+ "dub_output_label": "الفيديو المدبلج",
261
+ "dub_transcript": "النص المُكتشف",
262
+ "dub_translation": "الترجمة",
263
+ "dub_status": "الحالة",
264
+ "dub_details": "التفاصيل",
265
+ "err_no_video": "الرجاء رفع فيديو.",
266
+ "err_video_long": "الفيديو طويل جداً (الحد الأقصى 60 ثانية).",
267
+ "err_translate": "فشلت الترجمة. الرجاء المحاولة مجدداً.",
268
+ "err_transcribe": "فشل النسخ. الرجاء المحاولة مجدداً.",
269
+ "err_dub_text_long": "النص المُكتشف طويل جداً. استخدم مقطعاً أقصر.",
270
+ },
271
+ }
lang_codes.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Language code mappings between TTS display names, Whisper ISO-639-1, and NLLB BCP-47."""
2
+
3
+ from dataclasses import dataclass
4
+ from typing import Mapping
5
+
6
+
7
+ @dataclass(frozen=True)
8
+ class LangInfo:
9
+ display: str # Display name matching TTS_LANGUAGES (e.g., "Portuguese")
10
+ nllb: str # NLLB-200 BCP-47 flores code (e.g., "por_Latn")
11
+ whisper: str # Whisper ISO 639-1 code (e.g., "pt")
12
+
13
+
14
+ LANG_MAP: Mapping[str, LangInfo] = {
15
+ "Arabic": LangInfo("Arabic", "arb_Arab", "ar"),
16
+ "Danish": LangInfo("Danish", "dan_Latn", "da"),
17
+ "German": LangInfo("German", "deu_Latn", "de"),
18
+ "Greek": LangInfo("Greek", "ell_Grek", "el"),
19
+ "English": LangInfo("English", "eng_Latn", "en"),
20
+ "Spanish": LangInfo("Spanish", "spa_Latn", "es"),
21
+ "Finnish": LangInfo("Finnish", "fin_Latn", "fi"),
22
+ "French": LangInfo("French", "fra_Latn", "fr"),
23
+ "Hebrew": LangInfo("Hebrew", "heb_Hebr", "he"),
24
+ "Hindi": LangInfo("Hindi", "hin_Deva", "hi"),
25
+ "Italian": LangInfo("Italian", "ita_Latn", "it"),
26
+ "Japanese": LangInfo("Japanese", "jpn_Jpan", "ja"),
27
+ "Korean": LangInfo("Korean", "kor_Hang", "ko"),
28
+ "Malay": LangInfo("Malay", "zsm_Latn", "ms"),
29
+ "Dutch": LangInfo("Dutch", "nld_Latn", "nl"),
30
+ "Norwegian": LangInfo("Norwegian", "nob_Latn", "no"),
31
+ "Polish": LangInfo("Polish", "pol_Latn", "pl"),
32
+ "Portuguese": LangInfo("Portuguese", "por_Latn", "pt"),
33
+ "Russian": LangInfo("Russian", "rus_Cyrl", "ru"),
34
+ "Swedish": LangInfo("Swedish", "swe_Latn", "sv"),
35
+ "Swahili": LangInfo("Swahili", "swh_Latn", "sw"),
36
+ "Turkish": LangInfo("Turkish", "tur_Latn", "tr"),
37
+ "Chinese": LangInfo("Chinese", "zho_Hans", "zh"),
38
+ }
39
+
40
+
41
+ def get_nllb_code(lang_display: str) -> str:
42
+ info = LANG_MAP.get(lang_display)
43
+ if info is None:
44
+ raise ValueError(f"Unknown language: {lang_display!r}")
45
+ return info.nllb
46
+
47
+
48
+ def get_whisper_code(lang_display: str) -> str:
49
+ info = LANG_MAP.get(lang_display)
50
+ if info is None:
51
+ raise ValueError(f"Unknown language: {lang_display!r}")
52
+ return info.whisper
53
+
54
+
55
+ def whisper_code_to_display(whisper_code: str) -> str | None:
56
+ for info in LANG_MAP.values():
57
+ if info.whisper == whisper_code:
58
+ return info.display
59
+ return None
60
+
61
+
62
+ def nllb_code_to_display(nllb_code: str) -> str | None:
63
+ for info in LANG_MAP.values():
64
+ if info.nllb == nllb_code:
65
+ return info.display
66
+ return None
requirements.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ── Exact versions matching chatterbox-tts 0.1.6 deps ────────────────────────
2
+ torch==2.6.0
3
+ torchaudio==2.6.0
4
+ torchvision==0.21.0
5
+ transformers==4.46.3
6
+ tokenizers==0.20.3
7
+ diffusers==0.29.0
8
+ safetensors==0.5.3
9
+ accelerate==1.2.1
10
+
11
+ # ── Gradio (conflicts with chatterbox-tts PyPI — use GitHub clone instead) ───
12
+ gradio==6.0.2
13
+ spaces
14
+
15
+ # ── Chatterbox runtime deps (no chatterbox-tts pip install — cloned at runtime) ─
16
+ librosa==0.11.0
17
+ s3tokenizer
18
+ resemble-perth==1.0.1
19
+ conformer==0.3.2
20
+ spacy-pkuseg
21
+ pykakasi==2.3.0
22
+ pyloudnorm
23
+ omegaconf
24
+ numpy>=1.24.0
25
+
26
+ # ── Other ─────────────────────────────────────────────────────────────────────
27
+ opencv-python-headless
28
+ Pillow>=10.0.0
29
+ huggingface_hub>=0.23.0
30
+
31
+ # ── Phase 2: Video Dubbing ───────────────────────────────────────────────────
32
+ openai-whisper
33
+ tiktoken
34
+ more-itertools
styles.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Gradio theme and CSS for AnimaStudio."""
2
+ import gradio as gr
3
+
4
+ THEME = gr.themes.Soft(
5
+ primary_hue="purple",
6
+ secondary_hue="pink",
7
+ neutral_hue="slate",
8
+ )
9
+
10
+ CSS = """
11
+ /* ── Global ─────────────────────────────────────── */
12
+ .gradio-container {
13
+ max-width: 1380px !important;
14
+ margin: 0 auto !important;
15
+ }
16
+
17
+ /* ── Header ──────────────────────────────────────── */
18
+ .as-header {
19
+ text-align: center;
20
+ padding: 2.4rem 1rem 2rem;
21
+ border-radius: 1.5rem;
22
+ margin-bottom: 1.5rem;
23
+ background: linear-gradient(135deg, #1a0b2e 0%, #2d1b4e 40%, #1a1040 70%, #0d0b1e 100%);
24
+ border: 1px solid rgba(168,85,247,0.25);
25
+ box-shadow: 0 4px 60px rgba(168,85,247,0.1), inset 0 1px 0 rgba(255,255,255,0.05);
26
+ position: relative;
27
+ overflow: hidden;
28
+ }
29
+
30
+ .as-header::before {
31
+ content: '';
32
+ position: absolute;
33
+ top: -50%; left: -50%;
34
+ width: 200%; height: 200%;
35
+ background: radial-gradient(ellipse at center, rgba(168,85,247,0.08) 0%, transparent 60%);
36
+ pointer-events: none;
37
+ }
38
+
39
+ .as-header h1 {
40
+ font-size: 3.2rem !important;
41
+ font-weight: 900 !important;
42
+ margin: 0 0 0.6rem !important;
43
+ line-height: 1.05 !important;
44
+ background: linear-gradient(90deg, #e879f9 0%, #a855f7 40%, #f472b6 80%, #fb923c 100%);
45
+ -webkit-background-clip: text;
46
+ -webkit-text-fill-color: transparent;
47
+ background-clip: text;
48
+ letter-spacing: -0.035em;
49
+ }
50
+
51
+ .as-header .tagline {
52
+ color: #94a3b8 !important;
53
+ font-size: 1.0rem !important;
54
+ margin: 0 0 1rem !important;
55
+ }
56
+
57
+ .as-header .badges {
58
+ display: flex;
59
+ justify-content: center;
60
+ gap: 0.5rem;
61
+ flex-wrap: wrap;
62
+ }
63
+
64
+ /* ── Badges ───────────────────────────────────────── */
65
+ .badge {
66
+ display: inline-flex;
67
+ align-items: center;
68
+ gap: 0.3rem;
69
+ padding: 0.25rem 0.75rem;
70
+ border-radius: 999px;
71
+ font-size: 0.78rem;
72
+ font-weight: 600;
73
+ background: rgba(255,255,255,0.06);
74
+ border: 1px solid rgba(255,255,255,0.1);
75
+ color: #cbd5e1;
76
+ }
77
+ .badge-purple { border-color: rgba(168,85,247,0.4); color: #a855f7; background: rgba(168,85,247,0.08); }
78
+ .badge-pink { border-color: rgba(244,114,182,0.4); color: #f472b6; background: rgba(244,114,182,0.08); }
79
+ .badge-cyan { border-color: rgba(34,211,238,0.4); color: #22d3ee; background: rgba(34,211,238,0.08); }
80
+ .badge-gold { border-color: rgba(251,191,36,0.4); color: #fbbf24; background: rgba(251,191,36,0.08); }
81
+ .badge-teal { border-color: rgba(20,184,166,0.4); color: #14b8a6; background: rgba(20,184,166,0.08); }
82
+
83
+ /* ── Language selector ────────────────────────────── */
84
+ #lang-selector .wrap { gap: 4px !important; justify-content: center !important; }
85
+ #lang-selector label span { font-size: 0.9rem !important; }
86
+ #lang-selector { margin-bottom: 0.5rem !important; }
87
+
88
+ /* ── Generate Button ──────────────────────────────── */
89
+ #gen-btn {
90
+ background: linear-gradient(135deg, #a855f7 0%, #ec4899 100%) !important;
91
+ border: none !important;
92
+ font-size: 1.15rem !important;
93
+ font-weight: 700 !important;
94
+ padding: 0.85rem 1rem !important;
95
+ border-radius: 0.85rem !important;
96
+ color: white !important;
97
+ box-shadow: 0 4px 24px rgba(168,85,247,0.4) !important;
98
+ transition: all 0.2s ease !important;
99
+ letter-spacing: 0.02em !important;
100
+ width: 100% !important;
101
+ }
102
+ #gen-btn:hover {
103
+ transform: translateY(-2px) !important;
104
+ box-shadow: 0 8px 32px rgba(168,85,247,0.55) !important;
105
+ }
106
+ #gen-btn:active { transform: translateY(0) !important; }
107
+
108
+ /* ── Dub Button ────────────────────────────────────── */
109
+ #dub-btn {
110
+ background: linear-gradient(135deg, #06b6d4 0%, #a855f7 100%) !important;
111
+ border: none !important;
112
+ font-size: 1.15rem !important;
113
+ font-weight: 700 !important;
114
+ padding: 0.85rem 1rem !important;
115
+ border-radius: 0.85rem !important;
116
+ color: white !important;
117
+ box-shadow: 0 4px 24px rgba(6,182,212,0.4) !important;
118
+ transition: all 0.2s ease !important;
119
+ letter-spacing: 0.02em !important;
120
+ width: 100% !important;
121
+ }
122
+ #dub-btn:hover {
123
+ transform: translateY(-2px) !important;
124
+ box-shadow: 0 8px 32px rgba(6,182,212,0.55) !important;
125
+ }
126
+ #dub-btn:active { transform: translateY(0) !important; }
127
+
128
+ /* ── Output Video ─────────────────────────────────── */
129
+ #output-video, #dub-output-video {
130
+ border-radius: 1rem !important;
131
+ overflow: hidden !important;
132
+ background: #0f172a !important;
133
+ min-height: 420px !important;
134
+ }
135
+
136
+ /* ── Footer ──���────────────────────────────────────── */
137
+ .as-footer {
138
+ text-align: center;
139
+ padding: 1.2rem 0 0.5rem;
140
+ color: #475569;
141
+ font-size: 0.82rem;
142
+ border-top: 1px solid #1e293b;
143
+ margin-top: 1rem;
144
+ }
145
+ .as-footer a { color: #a855f7 !important; text-decoration: none !important; }
146
+ .as-footer a:hover { text-decoration: underline !important; }
147
+
148
+ /* ── Mobile ───────────────────────────────────────── */
149
+ @media (max-width: 768px) {
150
+ .as-header h1 { font-size: 2rem !important; }
151
+ .badges { gap: 0.35rem !important; }
152
+ }
153
+ """