Spaces:
Running on Zero
Running on Zero
lulavc commited on
Commit ·
43f8b96
0
Parent(s):
fix: move theme/css from gr.Blocks() to demo.launch() for Gradio 6.x compatibility
Browse files- README.md +238 -0
- app.py +539 -0
- dubbing.py +188 -0
- i18n.py +271 -0
- lang_codes.py +66 -0
- requirements.txt +34 -0
- styles.py +153 -0
README.md
ADDED
|
@@ -0,0 +1,238 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: AnimaStudio 🎬
|
| 3 |
+
emoji: 🎬
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: pink
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 6.0.2
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: true
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
short_description: AI talking head & video dubbing — free, 23 languages
|
| 12 |
+
tags:
|
| 13 |
+
- video-generation
|
| 14 |
+
- talking-head
|
| 15 |
+
- lip-sync
|
| 16 |
+
- avatar
|
| 17 |
+
- tts
|
| 18 |
+
- voice-cloning
|
| 19 |
+
- multilingual
|
| 20 |
+
- mcp-server
|
| 21 |
+
- echomimic
|
| 22 |
+
- chatterbox
|
| 23 |
+
- dubbing
|
| 24 |
+
- whisper
|
| 25 |
+
- nllb
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
# 🎬 AnimaStudio
|
| 29 |
+
|
| 30 |
+
Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages — free, no sign-up required.
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## ✨ Features
|
| 35 |
+
|
| 36 |
+
| Feature | Details |
|
| 37 |
+
|---------|---------|
|
| 38 |
+
| 🎭 **Realistic Lip Sync** | EchoMimic V3 Flash (AAAI 2026) — state-of-the-art talking head animation |
|
| 39 |
+
| 🗣️ **23-Language TTS** | Type text in any of 23 languages and generate natural speech |
|
| 40 |
+
| 🎙️ **Voice Cloning** | Upload a voice reference clip to clone the speaking style |
|
| 41 |
+
| 📤 **Audio Upload** | Upload your own WAV / MP3 / FLAC instead of using TTS |
|
| 42 |
+
| 🎬 **Video Dubbing** | Upload a video (up to 60 s) and dub it into any of the 23 supported languages |
|
| 43 |
+
| 📐 **3 Aspect Ratios** | 9:16 mobile, 1:1 square, 16:9 landscape |
|
| 44 |
+
| 🌐 **4 UI Languages** | Full interface in English, Português (BR), Español, and عربي |
|
| 45 |
+
| 📥 **Download** | One-click download of the generated MP4 |
|
| 46 |
+
| 🤖 **MCP Server** | Use as a tool in Claude, Cursor, and any MCP-compatible agent |
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## 🗣️ Supported TTS Languages
|
| 51 |
+
|
| 52 |
+
Arabic · Danish · German · Greek · **English** · **Spanish** · Finnish · French · Hebrew · Hindi · Italian · Japanese · Korean · Malay · Dutch · Norwegian · Polish · **Portuguese** · Russian · Swedish · Swahili · Turkish · Chinese
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## 📐 Output Formats
|
| 57 |
+
|
| 58 |
+
| Preset | Dimensions | Best for |
|
| 59 |
+
|--------|-----------|----------|
|
| 60 |
+
| ▮ 9:16 | 576 × 1024 | Mobile, Reels, TikTok |
|
| 61 |
+
| ◻ 1:1 | 512 × 512 | Social media, thumbnails |
|
| 62 |
+
| ▬ 16:9 | 1024 × 576 | Presentations, YouTube |
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## ⚙️ Advanced Settings
|
| 67 |
+
|
| 68 |
+
| Setting | Default | Range | Description |
|
| 69 |
+
|---------|---------|-------|-------------|
|
| 70 |
+
| **Inference Steps** | 20 | 5–50 | More steps = higher quality, slower |
|
| 71 |
+
| **Guidance Scale** | 3.5 | 1–10 | Higher = audio followed more strictly |
|
| 72 |
+
| **Emotion Intensity** | 0.5 | 0–1 | Controls expressiveness of TTS voice |
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## 🤖 MCP Server
|
| 77 |
+
|
| 78 |
+
AnimaStudio runs as an **MCP (Model Context Protocol) server**, enabling AI agents to generate talking head videos programmatically.
|
| 79 |
+
|
| 80 |
+
### Using with Claude Desktop
|
| 81 |
+
|
| 82 |
+
Add to your `claude_desktop_config.json`:
|
| 83 |
+
|
| 84 |
+
```json
|
| 85 |
+
{
|
| 86 |
+
"mcpServers": {
|
| 87 |
+
"animastudio": {
|
| 88 |
+
"url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse"
|
| 89 |
+
}
|
| 90 |
+
}
|
| 91 |
+
}
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
### Tool parameters
|
| 95 |
+
|
| 96 |
+
- **portrait_image** — portrait photo (file path or base64)
|
| 97 |
+
- **text** — text for the avatar to speak (text mode)
|
| 98 |
+
- **tts_language** — language for speech synthesis (23 options)
|
| 99 |
+
- **voice_ref** — optional voice reference audio for cloning
|
| 100 |
+
- **audio_file** — audio file path (audio mode)
|
| 101 |
+
- **aspect_ratio** — output format (9:16, 1:1, 16:9)
|
| 102 |
+
- **emotion** — emotion intensity 0–1
|
| 103 |
+
- **num_steps** — inference steps (default 20)
|
| 104 |
+
- **guidance_scale** — guidance scale (default 3.5)
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
## 🎬 Video Dubbing (Phase 2)
|
| 109 |
+
|
| 110 |
+
Upload any video (up to 60 seconds) and dub it into a different language. The pipeline:
|
| 111 |
+
|
| 112 |
+
1. **Whisper Turbo** transcribes the original speech (auto-detects language)
|
| 113 |
+
2. **NLLB-200** translates the transcript to the target language
|
| 114 |
+
3. **Chatterbox TTS** synthesizes the translated speech (with optional voice cloning)
|
| 115 |
+
4. **ffmpeg** muxes the new audio track onto the original video
|
| 116 |
+
|
| 117 |
+
### Dubbing Settings
|
| 118 |
+
|
| 119 |
+
| Setting | Details |
|
| 120 |
+
|---------|---------|
|
| 121 |
+
| **Input Video** | Any video with speech, up to 60 seconds |
|
| 122 |
+
| **Target Language** | Any of the 23 supported languages |
|
| 123 |
+
| **Voice Reference** | Optional audio clip to clone the speaker's voice style |
|
| 124 |
+
|
| 125 |
+
> Same language as source? The pipeline skips translation and re-synthesizes the audio directly.
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## 🔧 Technical Details
|
| 130 |
+
|
| 131 |
+
### Models
|
| 132 |
+
|
| 133 |
+
| Model | Purpose | License | VRAM |
|
| 134 |
+
|-------|---------|---------|------|
|
| 135 |
+
| [EchoMimic V3 Flash](https://huggingface.co/BadToBest/EchoMimicV3) | Talking head video generation | Apache 2.0 | ~12 GB |
|
| 136 |
+
| [Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox) | 23-language TTS with voice cloning | MIT | ~8 GB |
|
| 137 |
+
| [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | Speech transcription (809M params) | MIT | ~2 GB |
|
| 138 |
+
| [NLLB-200 Distilled 600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | Text translation (23 languages) | CC-BY-NC-4.0 | API (no local GPU) |
|
| 139 |
+
|
| 140 |
+
### Architecture
|
| 141 |
+
|
| 142 |
+
```
|
| 143 |
+
┌─────────────────────────────────────────────────────────────────┐
|
| 144 |
+
│ Tab 1: Create Video │
|
| 145 |
+
│ │
|
| 146 |
+
│ Portrait Photo + Text ──→ Chatterbox TTS ──→ Audio WAV │
|
| 147 |
+
│ │ │
|
| 148 |
+
│ Portrait Photo + Audio ───────────────────────────┤ │
|
| 149 |
+
│ ▼ │
|
| 150 |
+
│ EchoMimic V3 Flash │
|
| 151 |
+
│ (lip-sync animation) │
|
| 152 |
+
│ │ │
|
| 153 |
+
│ ▼ │
|
| 154 |
+
│ MP4 Video Output │
|
| 155 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 156 |
+
│ Tab 2: Dub Video │
|
| 157 |
+
│ │
|
| 158 |
+
│ Video ──→ ffmpeg (extract audio) │
|
| 159 |
+
│ │ │
|
| 160 |
+
│ ▼ │
|
| 161 |
+
│ Whisper Turbo (transcribe + detect language) │
|
| 162 |
+
│ │ │
|
| 163 |
+
│ ▼ │
|
| 164 |
+
│ NLLB-200 (translate to target language) │
|
| 165 |
+
│ │ │
|
| 166 |
+
│ ▼ │
|
| 167 |
+
│ Chatterbox TTS (synthesize translated speech) │
|
| 168 |
+
│ │ │
|
| 169 |
+
│ ▼ │
|
| 170 |
+
│ ffmpeg (mux new audio onto original video) │
|
| 171 |
+
│ │ │
|
| 172 |
+
│ ▼ │
|
| 173 |
+
│ Dubbed MP4 Output │
|
| 174 |
+
└─────────────────────────────────────────────────────────────────┘
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### VRAM Management
|
| 178 |
+
|
| 179 |
+
Models run sequentially on ZeroGPU (A10G, 24 GB):
|
| 180 |
+
|
| 181 |
+
**Create Video tab:**
|
| 182 |
+
1. Chatterbox TTS → generates audio → offloads to CPU
|
| 183 |
+
2. EchoMimic V3 → generates video → offloads to CPU
|
| 184 |
+
3. `torch.cuda.empty_cache()` between stages
|
| 185 |
+
|
| 186 |
+
**Dub Video tab:**
|
| 187 |
+
1. Whisper Turbo → transcribes audio (~2 GB) → offloads to CPU
|
| 188 |
+
2. NLLB-200 → translates via HF Inference API (no local GPU)
|
| 189 |
+
3. Chatterbox TTS → synthesizes dubbed speech → offloads to CPU
|
| 190 |
+
4. `torch.cuda.empty_cache()` between stages
|
| 191 |
+
|
| 192 |
+
Peak usage never exceeds ~16 GB.
|
| 193 |
+
|
| 194 |
+
---
|
| 195 |
+
|
| 196 |
+
## 💡 Tips for Best Results
|
| 197 |
+
|
| 198 |
+
### Create Video
|
| 199 |
+
1. **Use a clear, front-facing portrait** — well-lit, neutral background, face filling most of the frame
|
| 200 |
+
2. **Keep audio under 20 seconds** — shorter = faster generation, tighter lip sync
|
| 201 |
+
3. **Add a voice reference** — upload a 5–15 second clip in the target language for natural voice cloning
|
| 202 |
+
4. **Match language to text** — select the correct TTS language to avoid accent issues
|
| 203 |
+
5. **Emotion 0.4–0.6** — sweet spot for natural-sounding delivery
|
| 204 |
+
6. **9:16 for social** — perfect for Reels, TikTok, and Stories
|
| 205 |
+
7. **20–30 steps** — good quality/speed trade-off for most use cases
|
| 206 |
+
|
| 207 |
+
### Dub Video
|
| 208 |
+
8. **Keep videos under 60 seconds** — pipeline enforces this limit for VRAM and quality
|
| 209 |
+
9. **Clear speech works best** — minimal background music/noise gives cleaner transcriptions
|
| 210 |
+
10. **Add a voice reference** — clone the original speaker's voice for a more natural dub
|
| 211 |
+
11. **Single-speaker videos** — the pipeline works best with one speaker at a time
|
| 212 |
+
|
| 213 |
+
---
|
| 214 |
+
|
| 215 |
+
## 🛠️ Running Locally
|
| 216 |
+
|
| 217 |
+
```bash
|
| 218 |
+
git clone https://huggingface.co/spaces/lulavc/AnimaStudio
|
| 219 |
+
cd AnimaStudio
|
| 220 |
+
pip install -r requirements.txt
|
| 221 |
+
python app.py
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
Requires a CUDA GPU with at least 16 GB VRAM. Set `HF_TOKEN` for private model access.
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
+
|
| 228 |
+
## 📄 License
|
| 229 |
+
|
| 230 |
+
- **Space code:** Apache 2.0
|
| 231 |
+
- **EchoMimic V3:** [Apache 2.0](https://huggingface.co/BadToBest/EchoMimicV3)
|
| 232 |
+
- **Chatterbox TTS:** [MIT](https://huggingface.co/ResembleAI/chatterbox)
|
| 233 |
+
- **Whisper Turbo:** [MIT](https://huggingface.co/openai/whisper-large-v3-turbo)
|
| 234 |
+
- **NLLB-200:** [CC-BY-NC-4.0](https://huggingface.co/facebook/nllb-200-distilled-600M)
|
| 235 |
+
|
| 236 |
+
---
|
| 237 |
+
|
| 238 |
+
**Space by [lulavc](https://huggingface.co/lulavc)** · Powered by [EchoMimic V3](https://huggingface.co/BadToBest/EchoMimicV3) + [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) + [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) + [NLLB-200](https://huggingface.co/facebook/nllb-200-distilled-600M) · ZeroGPU · A10G
|
app.py
ADDED
|
@@ -0,0 +1,539 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import spaces
|
| 2 |
+
import gradio as gr
|
| 3 |
+
import torch
|
| 4 |
+
import torchaudio
|
| 5 |
+
import os
|
| 6 |
+
import gc
|
| 7 |
+
import sys
|
| 8 |
+
import shutil
|
| 9 |
+
import tempfile
|
| 10 |
+
import subprocess
|
| 11 |
+
import threading
|
| 12 |
+
import logging
|
| 13 |
+
|
| 14 |
+
import dubbing
|
| 15 |
+
from i18n import T, EXAMPLES, ALL_EXAMPLES_FLAT, TTS_LANGUAGES, MAX_TEXT_LEN, MAX_AUDIO_SEC
|
| 16 |
+
from styles import THEME, CSS
|
| 17 |
+
|
| 18 |
+
log = logging.getLogger(__name__)
|
| 19 |
+
|
| 20 |
+
# ── Config ────────────────────────────────────────────────────────────────────
|
| 21 |
+
ECHOMIMIC_MODEL = os.environ.get("ECHOMIMIC_MODEL", "BadToBest/EchoMimicV3")
|
| 22 |
+
CHATTERBOX_MODEL = os.environ.get("CHATTERBOX_MODEL", "ResembleAI/chatterbox")
|
| 23 |
+
MAX_DUB_TEXT_LEN = 1500 # ~60s of typical speech at 150 wpm ≈ 900 chars; 1500 is safe headroom
|
| 24 |
+
|
| 25 |
+
ASPECT_PRESETS = {
|
| 26 |
+
"▮ 9:16 · 576×1024": (576, 1024),
|
| 27 |
+
"◻ 1:1 · 512×512": (512, 512),
|
| 28 |
+
"▬ 16:9 · 1024×576": (1024, 576),
|
| 29 |
+
}
|
| 30 |
+
|
| 31 |
+
DEFAULT_STEPS = 20
|
| 32 |
+
DEFAULT_CFG = 3.5
|
| 33 |
+
DEFAULT_FPS = 25
|
| 34 |
+
|
| 35 |
+
# ── Runtime repo installs (avoid PyPI conflicts) ──────────────────────────────
|
| 36 |
+
_ECHOMIMIC_REPO = "https://github.com/antgroup/echomimic_v3.git"
|
| 37 |
+
_ECHOMIMIC_DIR = "/tmp/echomimic_v3"
|
| 38 |
+
_CHATTERBOX_REPO = "https://github.com/resemble-ai/chatterbox.git"
|
| 39 |
+
_CHATTERBOX_DIR = "/tmp/chatterbox"
|
| 40 |
+
_clone_lock = threading.Lock()
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def _clone_repo(repo_url: str, dest: str, label: str):
|
| 44 |
+
"""Thread-safe shallow clone. Uses .git presence to detect complete clones."""
|
| 45 |
+
with _clone_lock:
|
| 46 |
+
if not os.path.exists(os.path.join(dest, ".git")):
|
| 47 |
+
if os.path.exists(dest):
|
| 48 |
+
shutil.rmtree(dest)
|
| 49 |
+
log.info("Cloning %s…", label)
|
| 50 |
+
subprocess.run(
|
| 51 |
+
["git", "clone", "--depth=1", repo_url, dest],
|
| 52 |
+
check=True, timeout=180,
|
| 53 |
+
)
|
| 54 |
+
log.info("%s cloned", label)
|
| 55 |
+
if dest not in sys.path:
|
| 56 |
+
sys.path.insert(0, dest)
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def _ensure_echomimic_repo():
|
| 60 |
+
_clone_repo(_ECHOMIMIC_REPO, _ECHOMIMIC_DIR, "EchoMimic V3")
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def _ensure_chatterbox_repo():
|
| 64 |
+
_clone_repo(_CHATTERBOX_REPO, _CHATTERBOX_DIR, "Chatterbox TTS")
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
# ── Model singletons ──────────────────────────────────────────────────────────
|
| 68 |
+
_tts_model = None
|
| 69 |
+
_echo_pipe = None
|
| 70 |
+
_echo_mode = None
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def _load_tts():
|
| 74 |
+
global _tts_model
|
| 75 |
+
if _tts_model is None:
|
| 76 |
+
_ensure_chatterbox_repo()
|
| 77 |
+
from chatterbox.tts import ChatterboxTTS
|
| 78 |
+
log.info("Loading Chatterbox TTS…")
|
| 79 |
+
_tts_model = ChatterboxTTS.from_pretrained(device="cpu")
|
| 80 |
+
log.info("Chatterbox TTS ready")
|
| 81 |
+
return _tts_model
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
def _load_echomimic():
|
| 85 |
+
global _echo_pipe, _echo_mode
|
| 86 |
+
if _echo_pipe is not None:
|
| 87 |
+
return _echo_pipe, _echo_mode
|
| 88 |
+
|
| 89 |
+
try:
|
| 90 |
+
_ensure_echomimic_repo()
|
| 91 |
+
from echomimic_v3.pipelines.pipeline_echomimic_v3 import EchoMimicV3Pipeline
|
| 92 |
+
log.info("Loading EchoMimic V3 (local)…")
|
| 93 |
+
_echo_pipe = EchoMimicV3Pipeline.from_pretrained(ECHOMIMIC_MODEL, torch_dtype=torch.float16)
|
| 94 |
+
_echo_mode = "local"
|
| 95 |
+
log.info("EchoMimic V3 ready (local)")
|
| 96 |
+
return _echo_pipe, _echo_mode
|
| 97 |
+
except Exception as e:
|
| 98 |
+
log.warning("EchoMimic V3 local import failed: %s", e)
|
| 99 |
+
|
| 100 |
+
try:
|
| 101 |
+
from diffusers import DiffusionPipeline
|
| 102 |
+
log.info("Loading EchoMimic V3 via diffusers…")
|
| 103 |
+
_echo_pipe = DiffusionPipeline.from_pretrained(
|
| 104 |
+
ECHOMIMIC_MODEL, torch_dtype=torch.float16, trust_remote_code=True,
|
| 105 |
+
)
|
| 106 |
+
_echo_mode = "local"
|
| 107 |
+
log.info("EchoMimic V3 ready (diffusers)")
|
| 108 |
+
return _echo_pipe, _echo_mode
|
| 109 |
+
except Exception as e:
|
| 110 |
+
log.warning("EchoMimic V3 diffusers load failed: %s", e)
|
| 111 |
+
|
| 112 |
+
raise RuntimeError("EchoMimic V3 could not be loaded. Check requirements and model availability.")
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
# ── Video utilities ───────────────────────────────────────────────────────────
|
| 116 |
+
def _coerce_frames(frames):
|
| 117 |
+
"""Normalise pipeline output to a list of (H, W, 3) uint8 numpy arrays."""
|
| 118 |
+
import numpy as np
|
| 119 |
+
result = []
|
| 120 |
+
for frame in frames:
|
| 121 |
+
if hasattr(frame, "save"):
|
| 122 |
+
arr = np.array(frame.convert("RGB"))
|
| 123 |
+
elif hasattr(frame, "cpu"):
|
| 124 |
+
arr = frame.cpu().float().numpy()
|
| 125 |
+
if arr.ndim == 3 and arr.shape[0] in (1, 3, 4):
|
| 126 |
+
arr = arr.transpose(1, 2, 0)
|
| 127 |
+
if arr.max() <= 1.0:
|
| 128 |
+
arr = (arr * 255).clip(0, 255)
|
| 129 |
+
arr = arr.astype(np.uint8)
|
| 130 |
+
else:
|
| 131 |
+
arr = np.array(frame)
|
| 132 |
+
if arr.ndim == 2:
|
| 133 |
+
import cv2
|
| 134 |
+
arr = cv2.cvtColor(arr, cv2.COLOR_GRAY2RGB)
|
| 135 |
+
elif arr.shape[2] == 4:
|
| 136 |
+
arr = arr[:, :, :3]
|
| 137 |
+
result.append(arr)
|
| 138 |
+
return result
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
def _mux_video(frames, audio_path: str, fps: int = DEFAULT_FPS) -> str:
|
| 142 |
+
"""Combine frames (PIL/tensor/ndarray) + audio into an MP4 file."""
|
| 143 |
+
import cv2
|
| 144 |
+
|
| 145 |
+
coerced = _coerce_frames(frames)
|
| 146 |
+
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as f:
|
| 147 |
+
out_path = f.name
|
| 148 |
+
with tempfile.TemporaryDirectory() as tmpdir:
|
| 149 |
+
for i, arr in enumerate(coerced):
|
| 150 |
+
cv2.imwrite(os.path.join(tmpdir, f"{i:06d}.png"), cv2.cvtColor(arr, cv2.COLOR_RGB2BGR))
|
| 151 |
+
cmd = [
|
| 152 |
+
"ffmpeg", "-y", "-loglevel", "error",
|
| 153 |
+
"-framerate", str(fps),
|
| 154 |
+
"-i", os.path.join(tmpdir, "%06d.png"),
|
| 155 |
+
"-i", audio_path,
|
| 156 |
+
"-c:v", "libx264", "-preset", "fast", "-crf", "22",
|
| 157 |
+
"-c:a", "aac", "-b:a", "128k",
|
| 158 |
+
"-shortest", "-pix_fmt", "yuv420p",
|
| 159 |
+
out_path,
|
| 160 |
+
]
|
| 161 |
+
subprocess.run(cmd, check=True, timeout=120)
|
| 162 |
+
return out_path
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
# ── TTS ───────────────────────────────────────────────────────────────────────
|
| 166 |
+
def _run_tts(text: str, voice_ref: str | None, emotion: float, language: str = "English") -> str:
|
| 167 |
+
"""Generate speech WAV. Returns temp file path."""
|
| 168 |
+
model = _load_tts()
|
| 169 |
+
log.info("TTS: language=%s text_len=%d emotion=%.2f", language, len(text), emotion)
|
| 170 |
+
model.to("cuda")
|
| 171 |
+
try:
|
| 172 |
+
wav = model.generate(
|
| 173 |
+
text=text.strip(),
|
| 174 |
+
audio_prompt_path=voice_ref if voice_ref else None,
|
| 175 |
+
exaggeration=float(emotion),
|
| 176 |
+
)
|
| 177 |
+
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
|
| 178 |
+
out_path = f.name
|
| 179 |
+
torchaudio.save(out_path, wav, model.sr)
|
| 180 |
+
return out_path
|
| 181 |
+
finally:
|
| 182 |
+
model.to("cpu")
|
| 183 |
+
torch.cuda.empty_cache()
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
# ── EchoMimic ─────────────────────────────────────────────────────────────────
|
| 187 |
+
def _run_echomimic(portrait_img, audio_path: str, width: int, height: int,
|
| 188 |
+
num_steps: int, guidance_scale: float) -> str:
|
| 189 |
+
"""Generate talking-head video. Returns MP4 file path."""
|
| 190 |
+
pipe, _ = _load_echomimic()
|
| 191 |
+
pipe.to("cuda")
|
| 192 |
+
try:
|
| 193 |
+
output = pipe(
|
| 194 |
+
ref_image=portrait_img,
|
| 195 |
+
audio_path=audio_path,
|
| 196 |
+
width=width,
|
| 197 |
+
height=height,
|
| 198 |
+
num_inference_steps=num_steps,
|
| 199 |
+
guidance_scale=guidance_scale,
|
| 200 |
+
fps=DEFAULT_FPS,
|
| 201 |
+
)
|
| 202 |
+
if hasattr(output, "frames"):
|
| 203 |
+
return _mux_video(output.frames[0], audio_path)
|
| 204 |
+
if hasattr(output, "videos"):
|
| 205 |
+
vid = output.videos[0]
|
| 206 |
+
if hasattr(vid, "unbind"):
|
| 207 |
+
return _mux_video(list(vid.unbind(0)), audio_path)
|
| 208 |
+
return _mux_video(vid, audio_path)
|
| 209 |
+
if isinstance(output, str):
|
| 210 |
+
return output
|
| 211 |
+
raise ValueError(f"Unexpected pipeline output type: {type(output)}")
|
| 212 |
+
finally:
|
| 213 |
+
pipe.to("cpu")
|
| 214 |
+
torch.cuda.empty_cache()
|
| 215 |
+
gc.collect()
|
| 216 |
+
|
| 217 |
+
|
| 218 |
+
# ── Phase 1: Generate video endpoint ─────────────────────────────────────────
|
| 219 |
+
@spaces.GPU(duration=180)
|
| 220 |
+
def generate(portrait_img, input_mode: str, text: str, tts_language: str,
|
| 221 |
+
voice_ref, audio_file, aspect_ratio: str, emotion: float,
|
| 222 |
+
num_steps: int, guidance_scale: float, lang: str,
|
| 223 |
+
progress=gr.Progress(track_tqdm=True)):
|
| 224 |
+
|
| 225 |
+
t = T.get(lang, T["🇺🇸 English"])
|
| 226 |
+
if portrait_img is None:
|
| 227 |
+
raise gr.Error(t["err_no_portrait"])
|
| 228 |
+
|
| 229 |
+
width, height = ASPECT_PRESETS.get(aspect_ratio, (512, 512))
|
| 230 |
+
_tts_tmp: str | None = None
|
| 231 |
+
|
| 232 |
+
try:
|
| 233 |
+
if input_mode == "text":
|
| 234 |
+
if not text or not text.strip():
|
| 235 |
+
raise gr.Error(t["err_no_text"])
|
| 236 |
+
if len(text) > MAX_TEXT_LEN:
|
| 237 |
+
raise gr.Error(t["err_text_long"])
|
| 238 |
+
if voice_ref and not os.path.exists(voice_ref):
|
| 239 |
+
voice_ref = None
|
| 240 |
+
_tts_tmp = _run_tts(text, voice_ref, emotion, language=tts_language)
|
| 241 |
+
audio_path = _tts_tmp
|
| 242 |
+
else:
|
| 243 |
+
if audio_file is None:
|
| 244 |
+
raise gr.Error(t["err_no_audio"])
|
| 245 |
+
info = torchaudio.info(audio_file)
|
| 246 |
+
if (info.num_frames / info.sample_rate) > MAX_AUDIO_SEC:
|
| 247 |
+
raise gr.Error(t["err_audio_long"])
|
| 248 |
+
audio_path = audio_file
|
| 249 |
+
|
| 250 |
+
return _run_echomimic(portrait_img, audio_path, width, height, int(num_steps), float(guidance_scale))
|
| 251 |
+
|
| 252 |
+
except torch.cuda.OutOfMemoryError:
|
| 253 |
+
raise gr.Error(t["err_oom"])
|
| 254 |
+
except gr.Error:
|
| 255 |
+
raise
|
| 256 |
+
except Exception as e:
|
| 257 |
+
raise gr.Error(f"Generation failed: {str(e)[:400]}")
|
| 258 |
+
finally:
|
| 259 |
+
if _tts_tmp and os.path.exists(_tts_tmp):
|
| 260 |
+
try:
|
| 261 |
+
os.unlink(_tts_tmp)
|
| 262 |
+
except Exception:
|
| 263 |
+
pass
|
| 264 |
+
torch.cuda.empty_cache()
|
| 265 |
+
gc.collect()
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
# ── Phase 2: Dubbing endpoint ─────────────────────────────────────────────────
|
| 269 |
+
@spaces.GPU(duration=180)
|
| 270 |
+
def dub_video(video_input, target_lang: str, voice_ref, emotion: float, lang: str,
|
| 271 |
+
progress=gr.Progress(track_tqdm=True)):
|
| 272 |
+
|
| 273 |
+
t = T.get(lang, T["🇺🇸 English"])
|
| 274 |
+
temp_files: list[str] = []
|
| 275 |
+
|
| 276 |
+
try:
|
| 277 |
+
if video_input is None:
|
| 278 |
+
raise gr.Error(t["err_no_video"])
|
| 279 |
+
|
| 280 |
+
duration = dubbing.get_video_duration(video_input)
|
| 281 |
+
if duration > dubbing.MAX_DUB_AUDIO_SEC:
|
| 282 |
+
raise gr.Error(t["err_video_long"])
|
| 283 |
+
|
| 284 |
+
progress(0.10, desc="Extracting audio…")
|
| 285 |
+
audio_path = dubbing.extract_audio(video_input)
|
| 286 |
+
temp_files.append(audio_path)
|
| 287 |
+
|
| 288 |
+
progress(0.25, desc="Transcribing…")
|
| 289 |
+
transcript = dubbing.transcribe(audio_path)
|
| 290 |
+
dubbing._unload_whisper()
|
| 291 |
+
|
| 292 |
+
source_display = transcript.language_display
|
| 293 |
+
if source_display != target_lang:
|
| 294 |
+
progress(0.45, desc="Translating…")
|
| 295 |
+
try:
|
| 296 |
+
translated_text = dubbing.translate(transcript.text, source_display, target_lang)
|
| 297 |
+
except Exception as exc:
|
| 298 |
+
raise gr.Error(f"{t['err_translate']} ({str(exc)[:200]}")
|
| 299 |
+
else:
|
| 300 |
+
translated_text = transcript.text
|
| 301 |
+
|
| 302 |
+
if len(translated_text) > MAX_DUB_TEXT_LEN:
|
| 303 |
+
raise gr.Error(t["err_dub_text_long"])
|
| 304 |
+
|
| 305 |
+
progress(0.60, desc="Synthesizing speech…")
|
| 306 |
+
if voice_ref and not os.path.exists(voice_ref):
|
| 307 |
+
voice_ref = None
|
| 308 |
+
dubbed_audio = _run_tts(translated_text, voice_ref, emotion, language=target_lang)
|
| 309 |
+
temp_files.append(dubbed_audio)
|
| 310 |
+
|
| 311 |
+
progress(0.85, desc="Combining video…")
|
| 312 |
+
output_path = dubbing.mux_dubbed_video(video_input, dubbed_audio)
|
| 313 |
+
|
| 314 |
+
status = f"✓ {source_display} → {target_lang} | {duration:.1f}s"
|
| 315 |
+
return output_path, transcript.text, translated_text, status
|
| 316 |
+
|
| 317 |
+
except torch.cuda.OutOfMemoryError:
|
| 318 |
+
raise gr.Error(t["err_oom"])
|
| 319 |
+
except gr.Error:
|
| 320 |
+
raise
|
| 321 |
+
except Exception as e:
|
| 322 |
+
raise gr.Error(f"Dubbing failed: {str(e)[:400]}")
|
| 323 |
+
finally:
|
| 324 |
+
for fp in temp_files:
|
| 325 |
+
if fp and os.path.exists(fp):
|
| 326 |
+
try:
|
| 327 |
+
os.unlink(fp)
|
| 328 |
+
except Exception:
|
| 329 |
+
pass
|
| 330 |
+
torch.cuda.empty_cache()
|
| 331 |
+
gc.collect()
|
| 332 |
+
|
| 333 |
+
|
| 334 |
+
# ── Language switcher ─────────────────────────────────────────────────────────
|
| 335 |
+
def switch_language(lang: str):
|
| 336 |
+
t = T.get(lang, T["🇺🇸 English"])
|
| 337 |
+
mode_choices = [(t["mode_text"], "text"), (t["mode_audio"], "audio")]
|
| 338 |
+
# 26 outputs — must match _lang_out list order below
|
| 339 |
+
return (
|
| 340 |
+
# Phase 1 (16)
|
| 341 |
+
gr.update(label=t["portrait_label"], info=t["portrait_info"]),
|
| 342 |
+
gr.update(label=t["input_mode_label"], choices=mode_choices, value="text"),
|
| 343 |
+
gr.update(label=t["text_label"], placeholder=t["text_ph"]),
|
| 344 |
+
gr.update(label=t["tts_lang_label"]),
|
| 345 |
+
gr.update(label=t["voice_ref_label"], info=t["voice_ref_info"]),
|
| 346 |
+
gr.update(label=t["emotion_label"], info=t["emotion_info"]),
|
| 347 |
+
gr.update(label=t["audio_label"], info=t["audio_info"]),
|
| 348 |
+
gr.update(label=t["aspect_label"]),
|
| 349 |
+
gr.update(label=t["advanced"]),
|
| 350 |
+
gr.update(label=t["steps_label"], info=t["steps_info"]),
|
| 351 |
+
gr.update(label=t["guidance_label"], info=t["guidance_info"]),
|
| 352 |
+
gr.update(value=t["generate"]),
|
| 353 |
+
gr.update(value=t["examples_header"]),
|
| 354 |
+
gr.update(visible=True), # text_group
|
| 355 |
+
gr.update(visible=False), # audio_group
|
| 356 |
+
gr.update(label=t["output_label"]),
|
| 357 |
+
# Phase 2 (10)
|
| 358 |
+
gr.update(label=t["dub_video_label"], info=t["dub_video_info"]),
|
| 359 |
+
gr.update(label=t["dub_target_label"]),
|
| 360 |
+
gr.update(label=t["dub_voice_label"], info=t["dub_voice_info"]),
|
| 361 |
+
gr.update(label=t["dub_emotion_label"]),
|
| 362 |
+
gr.update(value=t["dub_btn"]),
|
| 363 |
+
gr.update(label=t["dub_output_label"]),
|
| 364 |
+
gr.update(label=t["dub_transcript"]),
|
| 365 |
+
gr.update(label=t["dub_translation"]),
|
| 366 |
+
gr.update(label=t["dub_status"]),
|
| 367 |
+
gr.update(label=t["dub_details"]),
|
| 368 |
+
)
|
| 369 |
+
|
| 370 |
+
|
| 371 |
+
def _toggle_input_mode(mode: str, _lang: str):
|
| 372 |
+
is_text = (mode == "text")
|
| 373 |
+
return gr.update(visible=is_text), gr.update(visible=not is_text)
|
| 374 |
+
|
| 375 |
+
|
| 376 |
+
# ── Interface ─────────────────────────────────────────────────────────────────
|
| 377 |
+
with gr.Blocks(title="AnimaStudio 🎬") as demo:
|
| 378 |
+
|
| 379 |
+
gr.HTML("""
|
| 380 |
+
<div class="as-header">
|
| 381 |
+
<h1>🎬 AnimaStudio</h1>
|
| 382 |
+
<p class="tagline">AI Talking Head Video Creator & Video Dubbing Studio</p>
|
| 383 |
+
<div class="badges">
|
| 384 |
+
<span class="badge badge-purple">🎭 Lip Sync</span>
|
| 385 |
+
<span class="badge badge-pink">🗣️ 23 TTS Languages</span>
|
| 386 |
+
<span class="badge badge-cyan">🎙️ Voice Cloning</span>
|
| 387 |
+
<span class="badge badge-teal">🎙️ Video Dubbing</span>
|
| 388 |
+
<span class="badge">⚡ EchoMimic V3</span>
|
| 389 |
+
<span class="badge badge-gold">🌐 EN · PT-BR · ES · AR</span>
|
| 390 |
+
<span class="badge">🤖 MCP Server</span>
|
| 391 |
+
</div>
|
| 392 |
+
</div>
|
| 393 |
+
""")
|
| 394 |
+
|
| 395 |
+
lang_selector = gr.Radio(
|
| 396 |
+
choices=list(T.keys()),
|
| 397 |
+
value="🇺🇸 English",
|
| 398 |
+
label=None,
|
| 399 |
+
container=False,
|
| 400 |
+
elem_id="lang-selector",
|
| 401 |
+
)
|
| 402 |
+
|
| 403 |
+
with gr.Tabs():
|
| 404 |
+
|
| 405 |
+
# ══ Tab 1: Create Video ════════════════════════════════════════════════
|
| 406 |
+
with gr.Tab("🎬 Create Video", id="tab-create"):
|
| 407 |
+
with gr.Row(equal_height=False):
|
| 408 |
+
with gr.Column(scale=1, min_width=360):
|
| 409 |
+
portrait = gr.Image(
|
| 410 |
+
label="Portrait Photo",
|
| 411 |
+
info="Upload a clear, front-facing face photo",
|
| 412 |
+
type="pil",
|
| 413 |
+
sources=["upload", "webcam"],
|
| 414 |
+
)
|
| 415 |
+
input_mode = gr.Radio(
|
| 416 |
+
choices=[(T["🇺🇸 English"]["mode_text"], "text"),
|
| 417 |
+
(T["🇺🇸 English"]["mode_audio"], "audio")],
|
| 418 |
+
value="text",
|
| 419 |
+
label="Audio Input",
|
| 420 |
+
)
|
| 421 |
+
with gr.Group(visible=True) as text_group:
|
| 422 |
+
text_input = gr.Textbox(
|
| 423 |
+
label="Text",
|
| 424 |
+
placeholder="Type what you want the avatar to say...",
|
| 425 |
+
lines=4, max_lines=10,
|
| 426 |
+
)
|
| 427 |
+
tts_language = gr.Dropdown(choices=TTS_LANGUAGES, value="English", label="Speech Language")
|
| 428 |
+
with gr.Row():
|
| 429 |
+
voice_ref = gr.Audio(
|
| 430 |
+
label="Voice Reference",
|
| 431 |
+
info="Optional: upload audio to clone the voice style",
|
| 432 |
+
type="filepath", sources=["upload"],
|
| 433 |
+
)
|
| 434 |
+
emotion = gr.Slider(0.0, 1.0, value=0.5, step=0.05,
|
| 435 |
+
label="Emotion Intensity", info="0 = neutral · 1 = very expressive")
|
| 436 |
+
with gr.Group(visible=False) as audio_group:
|
| 437 |
+
audio_upload = gr.Audio(
|
| 438 |
+
label="Audio File",
|
| 439 |
+
info="Upload WAV, MP3, or FLAC · max 30 seconds",
|
| 440 |
+
type="filepath", sources=["upload", "microphone"],
|
| 441 |
+
)
|
| 442 |
+
aspect_ratio = gr.Dropdown(choices=list(ASPECT_PRESETS.keys()),
|
| 443 |
+
value="◻ 1:1 · 512×512", label="Format")
|
| 444 |
+
with gr.Accordion("⚙️ Advanced Settings", open=False) as adv_acc:
|
| 445 |
+
num_steps = gr.Slider(5, 50, value=DEFAULT_STEPS, step=1,
|
| 446 |
+
label="Inference Steps", info="More steps = higher quality, slower")
|
| 447 |
+
guidance_scale = gr.Slider(1.0, 10.0, value=DEFAULT_CFG, step=0.5,
|
| 448 |
+
label="Guidance Scale", info="Higher = follows audio more strictly")
|
| 449 |
+
gen_btn = gr.Button("🎬 Generate Video", variant="primary", elem_id="gen-btn", size="lg")
|
| 450 |
+
examples_header = gr.Markdown("### 💡 Try These Examples")
|
| 451 |
+
gr.Examples(examples=ALL_EXAMPLES_FLAT, inputs=[text_input, tts_language, emotion], label=None)
|
| 452 |
+
|
| 453 |
+
with gr.Column(scale=1, min_width=440):
|
| 454 |
+
output_video = gr.Video(label="Generated Video", format="mp4", autoplay=True,
|
| 455 |
+
height=640, elem_id="output-video", show_download_button=True)
|
| 456 |
+
|
| 457 |
+
# ══ Tab 2: Dub Video ═══════════════════════════════════════════════════
|
| 458 |
+
with gr.Tab("🎙️ Dub Video", id="tab-dub"):
|
| 459 |
+
with gr.Row(equal_height=False):
|
| 460 |
+
with gr.Column(scale=1, min_width=360):
|
| 461 |
+
dub_video_input = gr.Video(label="Input Video",
|
| 462 |
+
info="Upload a video to dub (max 60 seconds)",
|
| 463 |
+
sources=["upload"])
|
| 464 |
+
dub_target_lang = gr.Dropdown(choices=TTS_LANGUAGES, value="English", label="Target Language")
|
| 465 |
+
dub_voice_ref = gr.Audio(label="Voice Reference",
|
| 466 |
+
info="Optional: upload audio to clone voice style for dubbing",
|
| 467 |
+
type="filepath", sources=["upload"])
|
| 468 |
+
dub_emotion = gr.Slider(0.0, 1.0, value=0.5, step=0.05, label="Emotion Intensity")
|
| 469 |
+
dub_btn = gr.Button("🎙️ Dub Video", variant="primary", elem_id="dub-btn", size="lg")
|
| 470 |
+
gr.HTML("""
|
| 471 |
+
<div style="color:#94a3b8;font-size:0.82rem;margin-top:0.5rem;padding:0.75rem;
|
| 472 |
+
background:rgba(6,182,212,0.05);border-radius:0.5rem;
|
| 473 |
+
border:1px solid rgba(6,182,212,0.15);">
|
| 474 |
+
<strong>How it works:</strong> Whisper transcribes → NLLB-200 translates →
|
| 475 |
+
Chatterbox TTS synthesizes → audio replaces original track.
|
| 476 |
+
</div>
|
| 477 |
+
""")
|
| 478 |
+
|
| 479 |
+
with gr.Column(scale=1, min_width=440):
|
| 480 |
+
dub_output_video = gr.Video(label="Dubbed Video", format="mp4", autoplay=True,
|
| 481 |
+
height=480, elem_id="dub-output-video", show_download_button=True)
|
| 482 |
+
with gr.Accordion("Details", open=False) as dub_details_acc:
|
| 483 |
+
dub_transcript_box = gr.Textbox(label="Detected Transcript", interactive=False, lines=4)
|
| 484 |
+
dub_translation_box = gr.Textbox(label="Translation", interactive=False, lines=4)
|
| 485 |
+
dub_status_box = gr.Textbox(label="Status", interactive=False, lines=2)
|
| 486 |
+
|
| 487 |
+
gr.HTML("""
|
| 488 |
+
<div class="as-footer">
|
| 489 |
+
<strong>Models:</strong>
|
| 490 |
+
<a href="https://huggingface.co/BadToBest/EchoMimicV3" target="_blank">EchoMimic V3</a>
|
| 491 |
+
(Apache 2.0) ·
|
| 492 |
+
<a href="https://huggingface.co/ResembleAI/chatterbox" target="_blank">Chatterbox TTS</a>
|
| 493 |
+
(MIT) ·
|
| 494 |
+
<a href="https://huggingface.co/openai/whisper-large-v3-turbo" target="_blank">Whisper Turbo</a>
|
| 495 |
+
(MIT) ·
|
| 496 |
+
<a href="https://huggingface.co/facebook/nllb-200-distilled-600M" target="_blank">NLLB-200</a>
|
| 497 |
+
(CC-BY-NC) ·
|
| 498 |
+
<strong>Space by:</strong>
|
| 499 |
+
<a href="https://huggingface.co/lulavc" target="_blank">lulavc</a>
|
| 500 |
+
· ZeroGPU · A10G
|
| 501 |
+
</div>
|
| 502 |
+
""")
|
| 503 |
+
|
| 504 |
+
# ── Events ────────────────────────────────────────────────────────────────
|
| 505 |
+
gen_btn.click(
|
| 506 |
+
generate,
|
| 507 |
+
inputs=[portrait, input_mode, text_input, tts_language,
|
| 508 |
+
voice_ref, audio_upload, aspect_ratio, emotion,
|
| 509 |
+
num_steps, guidance_scale, lang_selector],
|
| 510 |
+
outputs=output_video,
|
| 511 |
+
)
|
| 512 |
+
|
| 513 |
+
input_mode.change(_toggle_input_mode, inputs=[input_mode, lang_selector],
|
| 514 |
+
outputs=[text_group, audio_group])
|
| 515 |
+
|
| 516 |
+
dub_btn.click(
|
| 517 |
+
dub_video,
|
| 518 |
+
inputs=[dub_video_input, dub_target_lang, dub_voice_ref, dub_emotion, lang_selector],
|
| 519 |
+
outputs=[dub_output_video, dub_transcript_box, dub_translation_box, dub_status_box],
|
| 520 |
+
)
|
| 521 |
+
|
| 522 |
+
# Language switcher — 26 outputs, must match switch_language() return tuple order
|
| 523 |
+
_lang_out = [
|
| 524 |
+
# Phase 1 (16)
|
| 525 |
+
portrait, input_mode, text_input, tts_language,
|
| 526 |
+
voice_ref, emotion, audio_upload, aspect_ratio,
|
| 527 |
+
adv_acc, num_steps, guidance_scale, gen_btn, examples_header,
|
| 528 |
+
text_group, audio_group, output_video,
|
| 529 |
+
# Phase 2 (10)
|
| 530 |
+
dub_video_input, dub_target_lang, dub_voice_ref,
|
| 531 |
+
dub_emotion, dub_btn, dub_output_video,
|
| 532 |
+
dub_transcript_box, dub_translation_box,
|
| 533 |
+
dub_status_box, dub_details_acc,
|
| 534 |
+
]
|
| 535 |
+
lang_selector.change(switch_language, inputs=lang_selector, outputs=_lang_out)
|
| 536 |
+
|
| 537 |
+
|
| 538 |
+
if __name__ == "__main__":
|
| 539 |
+
demo.launch(theme=THEME, css=CSS, mcp_server=True)
|
dubbing.py
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Video dubbing pipeline: extract audio → transcribe → translate → TTS → mux."""
|
| 2 |
+
|
| 3 |
+
import gc
|
| 4 |
+
import logging
|
| 5 |
+
import os
|
| 6 |
+
import subprocess
|
| 7 |
+
import tempfile
|
| 8 |
+
import time
|
| 9 |
+
from dataclasses import dataclass
|
| 10 |
+
from typing import Optional
|
| 11 |
+
|
| 12 |
+
import torch
|
| 13 |
+
from huggingface_hub import InferenceClient
|
| 14 |
+
|
| 15 |
+
from lang_codes import get_nllb_code, whisper_code_to_display
|
| 16 |
+
|
| 17 |
+
log = logging.getLogger(__name__)
|
| 18 |
+
|
| 19 |
+
# ── Constants ────────────────────────────────────────────────────────────────
|
| 20 |
+
MAX_DUB_AUDIO_SEC = 60
|
| 21 |
+
WHISPER_MODEL_SIZE = "turbo" # ~809M params, ~2GB VRAM on A10G
|
| 22 |
+
NLLB_MODEL_ID = "facebook/nllb-200-distilled-600M"
|
| 23 |
+
|
| 24 |
+
# ── Singleton ─────────────────────────────────────────────────────────────────
|
| 25 |
+
_whisper_model = None
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
@dataclass(frozen=True)
|
| 29 |
+
class TranscriptionResult:
|
| 30 |
+
text: str
|
| 31 |
+
language: str # Whisper ISO 639-1 code (e.g. "en")
|
| 32 |
+
language_display: str # Display name matching TTS_LANGUAGES (e.g. "English")
|
| 33 |
+
segments: tuple # tuple of {"start", "end", "text"} dicts
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
# ── Whisper lifecycle ─────────────────────────────────────────────────────────
|
| 37 |
+
|
| 38 |
+
def _load_whisper():
|
| 39 |
+
global _whisper_model
|
| 40 |
+
if _whisper_model is None:
|
| 41 |
+
import whisper
|
| 42 |
+
log.info("Loading Whisper %s…", WHISPER_MODEL_SIZE)
|
| 43 |
+
_whisper_model = whisper.load_model(WHISPER_MODEL_SIZE, device="cpu")
|
| 44 |
+
log.info("Whisper %s ready", WHISPER_MODEL_SIZE)
|
| 45 |
+
return _whisper_model
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def _unload_whisper():
|
| 49 |
+
global _whisper_model
|
| 50 |
+
if _whisper_model is not None:
|
| 51 |
+
_whisper_model.to("cpu")
|
| 52 |
+
del _whisper_model
|
| 53 |
+
_whisper_model = None
|
| 54 |
+
torch.cuda.empty_cache()
|
| 55 |
+
gc.collect()
|
| 56 |
+
log.info("Whisper unloaded")
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
# ── Step 1: Extract audio ─────────────────────────────────────────────────────
|
| 60 |
+
|
| 61 |
+
def extract_audio(video_path: str) -> str:
|
| 62 |
+
"""Extract audio from video as 16kHz mono WAV. Returns temp file path."""
|
| 63 |
+
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
|
| 64 |
+
out_path = f.name
|
| 65 |
+
cmd = [
|
| 66 |
+
"ffmpeg", "-y", "-loglevel", "error",
|
| 67 |
+
"-i", video_path,
|
| 68 |
+
"-vn", "-acodec", "pcm_s16le",
|
| 69 |
+
"-ar", "16000", "-ac", "1",
|
| 70 |
+
out_path,
|
| 71 |
+
]
|
| 72 |
+
subprocess.run(cmd, check=True, timeout=60)
|
| 73 |
+
return out_path
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
# ── Step 2: Transcribe (GPU) ──────────────────────────────────────────────────
|
| 77 |
+
|
| 78 |
+
def transcribe(audio_path: str) -> TranscriptionResult:
|
| 79 |
+
"""Transcribe audio with Whisper turbo. Moves model to CUDA, then back to CPU."""
|
| 80 |
+
model = _load_whisper()
|
| 81 |
+
model.to("cuda")
|
| 82 |
+
try:
|
| 83 |
+
result = model.transcribe(audio_path, task="transcribe", fp16=True)
|
| 84 |
+
detected_lang = result.get("language", "en")
|
| 85 |
+
display_name = whisper_code_to_display(detected_lang) or "English"
|
| 86 |
+
segments = tuple(
|
| 87 |
+
{"start": s["start"], "end": s["end"], "text": s["text"]}
|
| 88 |
+
for s in result.get("segments", [])
|
| 89 |
+
)
|
| 90 |
+
return TranscriptionResult(
|
| 91 |
+
text=result["text"].strip(),
|
| 92 |
+
language=detected_lang,
|
| 93 |
+
language_display=display_name,
|
| 94 |
+
segments=segments,
|
| 95 |
+
)
|
| 96 |
+
finally:
|
| 97 |
+
model.to("cpu")
|
| 98 |
+
torch.cuda.empty_cache()
|
| 99 |
+
gc.collect()
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
# ── Step 3: Translate via HF Inference API (no GPU) ──────────────────────────
|
| 103 |
+
|
| 104 |
+
def translate(text: str, source_lang: str, target_lang: str) -> str:
|
| 105 |
+
"""Translate text using NLLB-200 via HF Inference API.
|
| 106 |
+
|
| 107 |
+
Args:
|
| 108 |
+
text: Source text.
|
| 109 |
+
source_lang: Display name e.g. "English".
|
| 110 |
+
target_lang: Display name e.g. "Portuguese".
|
| 111 |
+
|
| 112 |
+
Returns:
|
| 113 |
+
Translated text string.
|
| 114 |
+
"""
|
| 115 |
+
if source_lang == target_lang:
|
| 116 |
+
return text
|
| 117 |
+
|
| 118 |
+
src_code = get_nllb_code(source_lang)
|
| 119 |
+
tgt_code = get_nllb_code(target_lang)
|
| 120 |
+
|
| 121 |
+
# Client instantiated once outside the retry loop
|
| 122 |
+
client = InferenceClient()
|
| 123 |
+
last_exc: Optional[Exception] = None
|
| 124 |
+
for attempt in range(3):
|
| 125 |
+
try:
|
| 126 |
+
result = client.translation(
|
| 127 |
+
text,
|
| 128 |
+
model=NLLB_MODEL_ID,
|
| 129 |
+
src_lang=src_code,
|
| 130 |
+
tgt_lang=tgt_code,
|
| 131 |
+
)
|
| 132 |
+
if isinstance(result, str):
|
| 133 |
+
return result
|
| 134 |
+
if isinstance(result, dict):
|
| 135 |
+
return result.get("translation_text") or result.get("generated_text") or str(result)
|
| 136 |
+
return str(result)
|
| 137 |
+
except Exception as exc:
|
| 138 |
+
last_exc = exc
|
| 139 |
+
log.warning("Translation attempt %d failed: %s", attempt + 1, exc)
|
| 140 |
+
time.sleep(2 ** attempt)
|
| 141 |
+
|
| 142 |
+
raise RuntimeError(f"Translation failed after 3 attempts: {last_exc}") from last_exc
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
# ── Step 4: Mux dubbed audio onto original video ──────────────────────────────
|
| 146 |
+
|
| 147 |
+
def mux_dubbed_video(video_path: str, audio_path: str) -> str:
|
| 148 |
+
"""Replace video audio with dubbed audio track. Returns output MP4 path.
|
| 149 |
+
|
| 150 |
+
Cleans up the output file if ffmpeg fails (no partial file leak).
|
| 151 |
+
"""
|
| 152 |
+
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as f:
|
| 153 |
+
out_path = f.name
|
| 154 |
+
cmd = [
|
| 155 |
+
"ffmpeg", "-y", "-loglevel", "error",
|
| 156 |
+
"-i", video_path,
|
| 157 |
+
"-i", audio_path,
|
| 158 |
+
"-c:v", "copy",
|
| 159 |
+
"-c:a", "aac", "-b:a", "128k",
|
| 160 |
+
"-map", "0:v:0", "-map", "1:a:0",
|
| 161 |
+
"-shortest",
|
| 162 |
+
out_path,
|
| 163 |
+
]
|
| 164 |
+
try:
|
| 165 |
+
subprocess.run(cmd, check=True, timeout=120)
|
| 166 |
+
return out_path
|
| 167 |
+
except Exception:
|
| 168 |
+
# Clean up partial output file on ffmpeg failure
|
| 169 |
+
if os.path.exists(out_path):
|
| 170 |
+
try:
|
| 171 |
+
os.unlink(out_path)
|
| 172 |
+
except OSError:
|
| 173 |
+
pass
|
| 174 |
+
raise
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
# ── Utility ───────────────────────────────────────────────────────────────────
|
| 178 |
+
|
| 179 |
+
def get_video_duration(video_path: str) -> float:
|
| 180 |
+
"""Return video duration in seconds using ffprobe."""
|
| 181 |
+
cmd = [
|
| 182 |
+
"ffprobe", "-v", "error",
|
| 183 |
+
"-show_entries", "format=duration",
|
| 184 |
+
"-of", "default=noprint_wrappers=1:nokey=1",
|
| 185 |
+
video_path,
|
| 186 |
+
]
|
| 187 |
+
result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=30)
|
| 188 |
+
return float(result.stdout.strip())
|
i18n.py
ADDED
|
@@ -0,0 +1,271 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Internationalization: TTS language list, example texts, and UI translations."""
|
| 2 |
+
|
| 3 |
+
MAX_TEXT_LEN = 500
|
| 4 |
+
MAX_AUDIO_SEC = 30
|
| 5 |
+
|
| 6 |
+
TTS_LANGUAGES = [
|
| 7 |
+
"Arabic", "Danish", "German", "Greek", "English",
|
| 8 |
+
"Spanish", "Finnish", "French", "Hebrew", "Hindi",
|
| 9 |
+
"Italian", "Japanese", "Korean", "Malay", "Dutch",
|
| 10 |
+
"Norwegian", "Polish", "Portuguese", "Russian", "Swedish",
|
| 11 |
+
"Swahili", "Turkish", "Chinese",
|
| 12 |
+
]
|
| 13 |
+
|
| 14 |
+
# ── Examples per UI language ──────────────────────────────────────────────────
|
| 15 |
+
# Format: [text, tts_language, emotion]
|
| 16 |
+
EXAMPLES = {
|
| 17 |
+
"🇺🇸 English": [
|
| 18 |
+
["Hello! Welcome to this presentation. Today I'll be sharing some exciting insights about artificial intelligence and how it's changing the world around us.", "English", 0.5],
|
| 19 |
+
["I'm thrilled to announce the launch of our new project. After months of hard work and dedication, we've created something truly special I can't wait to share with you.", "English", 0.7],
|
| 20 |
+
["Good morning, students! Today's lecture covers neural networks — the backbone of modern AI. By the end of this session you'll understand how machines learn from data.", "English", 0.4],
|
| 21 |
+
["Breaking news: scientists have discovered a new method for sustainable energy production that could revolutionise how we power our cities. More details coming up.", "English", 0.6],
|
| 22 |
+
],
|
| 23 |
+
"🇧🇷 Português": [
|
| 24 |
+
["Olá a todos! Sejam bem-vindos a esta apresentação. Hoje vou compartilhar com vocês algumas descobertas incríveis sobre inteligência artificial e como ela está transformando o nosso mundo.", "Portuguese", 0.5],
|
| 25 |
+
["Estou muito animado para anunciar o lançamento do nosso novo projeto. Depois de meses de trabalho dedicado, criamos algo verdadeiramente especial que mal posso esperar para mostrar a vocês.", "Portuguese", 0.7],
|
| 26 |
+
["Bom dia, estudantes! A aula de hoje aborda redes neurais — a base da IA moderna. Ao final desta sessão, vocês vão entender como as máquinas aprendem com os dados.", "Portuguese", 0.4],
|
| 27 |
+
["Esta receita tradicional foi passada de geração em geração na minha família. Hoje vou ensinar como preparar um prato delicioso que vai impressionar todos os seus convidados.", "Portuguese", 0.6],
|
| 28 |
+
],
|
| 29 |
+
"🇪🇸 Español": [
|
| 30 |
+
["¡Hola a todos! Bienvenidos a esta presentación. Hoy voy a compartir con ustedes algunos descubrimientos fascinantes sobre la inteligencia artificial y cómo está transformando nuestro mundo.", "Spanish", 0.5],
|
| 31 |
+
["Estoy muy emocionado de anunciar el lanzamiento de nuestro nuevo proyecto. Después de meses de arduo trabajo hemos creado algo verdaderamente especial que no puedo esperar para mostrarles.", "Spanish", 0.7],
|
| 32 |
+
["Buenos días, estudiantes. La clase de hoy trata sobre las redes neuronales, la columna vertebral de la IA moderna. Al final de esta sesión entenderán cómo las máquinas aprenden de los datos.", "Spanish", 0.4],
|
| 33 |
+
["Esta receta tradicional ha pasado de generación en generación en mi familia. Hoy les enseñaré cómo preparar un plato delicioso que impresionará a todos sus invitados.", "Spanish", 0.6],
|
| 34 |
+
],
|
| 35 |
+
"🇪🇬 عربي": [
|
| 36 |
+
["مرحباً بالجميع! أهلاً وسهلاً بكم في هذا العرض التقديمي. اليوم سأشارككم بعض الاكتشافات المثيرة حول الذكاء الاصطناعي وكيف يُغيّر عالمنا من حولنا.", "Arabic", 0.5],
|
| 37 |
+
["يسعدني الإعلان عن إطلاق مشروعنا الجديد. بعد أشهر من العمل الدؤوب أبدعنا شيئاً رائعاً حقاً لا أصبر على مشاركته معكم جميعاً.", "Arabic", 0.7],
|
| 38 |
+
["صباح الخير أيها الطلاب! محاضرة اليوم تتناول الشبكات العصبية التي تمثّل الأساس التقني للذكاء الاصطناعي الحديث. بنهاية هذه الجلسة ستفهمون كيف تتعلم الآلات من البيانات.", "Arabic", 0.4],
|
| 39 |
+
["هذه الوصفة التقليدية انتقلت من جيل إلى جيل في عائلتي. اليوم سأعلّمكم كيفية تحضير طبق شهي سيُبهر جميع ضيوفكم ويجعلهم يطلبون المزيد.", "Arabic", 0.6],
|
| 40 |
+
],
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
ALL_EXAMPLES_FLAT = [ex for exs in EXAMPLES.values() for ex in exs]
|
| 44 |
+
|
| 45 |
+
# ── UI translations ────────────────────────────────────────────────────────────
|
| 46 |
+
T: dict[str, dict[str, str]] = {
|
| 47 |
+
"🇺🇸 English": {
|
| 48 |
+
# Phase 1: Create Video
|
| 49 |
+
"tab_create": "🎬 Create Video",
|
| 50 |
+
"tagline": "AI Talking Head Video Creator",
|
| 51 |
+
"input_mode_label": "Audio Input",
|
| 52 |
+
"mode_text": "Text to Speech",
|
| 53 |
+
"mode_audio": "Upload Audio",
|
| 54 |
+
"portrait_label": "Portrait Photo",
|
| 55 |
+
"portrait_info": "Upload a clear, front-facing face photo",
|
| 56 |
+
"text_label": "Text",
|
| 57 |
+
"text_ph": "Type what you want the avatar to say...",
|
| 58 |
+
"tts_lang_label": "Speech Language",
|
| 59 |
+
"voice_ref_label": "Voice Reference",
|
| 60 |
+
"voice_ref_info": "Optional: upload audio to clone the voice style",
|
| 61 |
+
"emotion_label": "Emotion Intensity",
|
| 62 |
+
"emotion_info": "0 = neutral · 1 = very expressive",
|
| 63 |
+
"audio_label": "Audio File",
|
| 64 |
+
"audio_info": "Upload WAV, MP3, or FLAC · max 30 seconds",
|
| 65 |
+
"aspect_label": "Format",
|
| 66 |
+
"advanced": "⚙️ Advanced Settings",
|
| 67 |
+
"steps_label": "Inference Steps",
|
| 68 |
+
"steps_info": "More steps = higher quality, slower",
|
| 69 |
+
"guidance_label": "Guidance Scale",
|
| 70 |
+
"guidance_info": "Higher = follows audio more strictly",
|
| 71 |
+
"generate": "🎬 Generate Video",
|
| 72 |
+
"output_label": "Generated Video",
|
| 73 |
+
"examples_header": "### 💡 Try These Examples",
|
| 74 |
+
"err_no_portrait": "Please upload a portrait photo.",
|
| 75 |
+
"err_no_text": "Please enter some text.",
|
| 76 |
+
"err_no_audio": "Please upload an audio file.",
|
| 77 |
+
"err_text_long": f"Text too long (max {MAX_TEXT_LEN} characters).",
|
| 78 |
+
"err_audio_long": f"Audio too long (max {MAX_AUDIO_SEC} seconds).",
|
| 79 |
+
"err_oom": "GPU out of memory. Try a smaller format or fewer steps.",
|
| 80 |
+
"err_no_face": "No face detected. Please upload a clear front-facing portrait.",
|
| 81 |
+
"err_model": "Model not loaded. Please refresh and try again.",
|
| 82 |
+
# Phase 2: Dub Video
|
| 83 |
+
"tab_dub": "🎙️ Dub Video",
|
| 84 |
+
"dub_tagline": "Dub any video into 23 languages",
|
| 85 |
+
"dub_video_label": "Input Video",
|
| 86 |
+
"dub_video_info": "Upload a video to dub (max 60 seconds)",
|
| 87 |
+
"dub_target_label": "Target Language",
|
| 88 |
+
"dub_voice_label": "Voice Reference",
|
| 89 |
+
"dub_voice_info": "Optional: upload audio to clone voice style for dubbing",
|
| 90 |
+
"dub_emotion_label": "Emotion Intensity",
|
| 91 |
+
"dub_btn": "🎙️ Dub Video",
|
| 92 |
+
"dub_output_label": "Dubbed Video",
|
| 93 |
+
"dub_transcript": "Detected Transcript",
|
| 94 |
+
"dub_translation": "Translation",
|
| 95 |
+
"dub_status": "Status",
|
| 96 |
+
"dub_details": "Details",
|
| 97 |
+
"err_no_video": "Please upload a video.",
|
| 98 |
+
"err_video_long": "Video too long (max 60 seconds).",
|
| 99 |
+
"err_translate": "Translation failed. Please try again.",
|
| 100 |
+
"err_transcribe": "Transcription failed. Please try again.",
|
| 101 |
+
"err_dub_text_long": "Transcription too long to synthesize. Please use a shorter video.",
|
| 102 |
+
},
|
| 103 |
+
"🇧🇷 Português": {
|
| 104 |
+
# Phase 1: Create Video
|
| 105 |
+
"tab_create": "🎬 Criar Vídeo",
|
| 106 |
+
"tagline": "Criador de Vídeo Avatar com IA",
|
| 107 |
+
"input_mode_label": "Entrada de Áudio",
|
| 108 |
+
"mode_text": "Texto para Fala",
|
| 109 |
+
"mode_audio": "Enviar Áudio",
|
| 110 |
+
"portrait_label": "Foto Retrato",
|
| 111 |
+
"portrait_info": "Envie uma foto frontal clara do rosto",
|
| 112 |
+
"text_label": "Texto",
|
| 113 |
+
"text_ph": "Digite o que você quer que o avatar diga...",
|
| 114 |
+
"tts_lang_label": "Idioma da Fala",
|
| 115 |
+
"voice_ref_label": "Referência de Voz",
|
| 116 |
+
"voice_ref_info": "Opcional: envie um áudio para clonar o estilo de voz",
|
| 117 |
+
"emotion_label": "Intensidade da Emoção",
|
| 118 |
+
"emotion_info": "0 = neutro · 1 = muito expressivo",
|
| 119 |
+
"audio_label": "Arquivo de Áudio",
|
| 120 |
+
"audio_info": "Envie WAV, MP3 ou FLAC · máx. 30 segundos",
|
| 121 |
+
"aspect_label": "Formato",
|
| 122 |
+
"advanced": "⚙️ Configurações Avançadas",
|
| 123 |
+
"steps_label": "Etapas de Inferência",
|
| 124 |
+
"steps_info": "Mais etapas = maior qualidade, mais lento",
|
| 125 |
+
"guidance_label": "Escala de Orientação",
|
| 126 |
+
"guidance_info": "Maior = segue o áudio com mais precisão",
|
| 127 |
+
"generate": "🎬 Gerar Vídeo",
|
| 128 |
+
"output_label": "Vídeo Gerado",
|
| 129 |
+
"examples_header": "### 💡 Experimente Estes Exemplos",
|
| 130 |
+
"err_no_portrait": "Por favor, envie uma foto retrato.",
|
| 131 |
+
"err_no_text": "Por favor, insira algum texto.",
|
| 132 |
+
"err_no_audio": "Por favor, envie um arquivo de áudio.",
|
| 133 |
+
"err_text_long": f"Texto muito longo (máx. {MAX_TEXT_LEN} caracteres).",
|
| 134 |
+
"err_audio_long": f"Áudio muito longo (máx. {MAX_AUDIO_SEC} segundos).",
|
| 135 |
+
"err_oom": "GPU sem memória. Tente um formato menor ou menos etapas.",
|
| 136 |
+
"err_no_face": "Nenhum rosto detectado. Envie uma foto retrato frontal clara.",
|
| 137 |
+
"err_model": "Modelo não carregado. Atualize a página e tente novamente.",
|
| 138 |
+
# Phase 2: Dub Video
|
| 139 |
+
"tab_dub": "🎙️ Dublar Vídeo",
|
| 140 |
+
"dub_tagline": "Duble qualquer vídeo em 23 idiomas",
|
| 141 |
+
"dub_video_label": "Vídeo de Entrada",
|
| 142 |
+
"dub_video_info": "Envie um vídeo para dublar (máx. 60 segundos)",
|
| 143 |
+
"dub_target_label": "Idioma de Destino",
|
| 144 |
+
"dub_voice_label": "Referência de Voz",
|
| 145 |
+
"dub_voice_info": "Opcional: envie áudio para clonar o estilo de voz na dublagem",
|
| 146 |
+
"dub_emotion_label": "Intensidade da Emoção",
|
| 147 |
+
"dub_btn": "🎙️ Dublar Vídeo",
|
| 148 |
+
"dub_output_label": "Vídeo Dublado",
|
| 149 |
+
"dub_transcript": "Transcrição Detectada",
|
| 150 |
+
"dub_translation": "Tradução",
|
| 151 |
+
"dub_status": "Status",
|
| 152 |
+
"dub_details": "Detalhes",
|
| 153 |
+
"err_no_video": "Por favor, envie um vídeo.",
|
| 154 |
+
"err_video_long": "Vídeo muito longo (máx. 60 segundos).",
|
| 155 |
+
"err_translate": "Tradução falhou. Por favor, tente novamente.",
|
| 156 |
+
"err_transcribe": "Transcrição falhou. Por favor, tente novamente.",
|
| 157 |
+
"err_dub_text_long": "Transcrição longa demais para sintetizar. Use um vídeo mais curto.",
|
| 158 |
+
},
|
| 159 |
+
"🇪🇸 Español": {
|
| 160 |
+
# Phase 1: Create Video
|
| 161 |
+
"tab_create": "🎬 Crear Vídeo",
|
| 162 |
+
"tagline": "Creador de Vídeo Avatar con IA",
|
| 163 |
+
"input_mode_label": "Entrada de Audio",
|
| 164 |
+
"mode_text": "Texto a Voz",
|
| 165 |
+
"mode_audio": "Subir Audio",
|
| 166 |
+
"portrait_label": "Foto Retrato",
|
| 167 |
+
"portrait_info": "Sube una foto frontal clara del rostro",
|
| 168 |
+
"text_label": "Texto",
|
| 169 |
+
"text_ph": "Escribe lo que quieres que diga el avatar...",
|
| 170 |
+
"tts_lang_label": "Idioma del Habla",
|
| 171 |
+
"voice_ref_label": "Referencia de Voz",
|
| 172 |
+
"voice_ref_info": "Opcional: sube un audio para clonar el estilo de voz",
|
| 173 |
+
"emotion_label": "Intensidad Emocional",
|
| 174 |
+
"emotion_info": "0 = neutro · 1 = muy expresivo",
|
| 175 |
+
"audio_label": "Archivo de Audio",
|
| 176 |
+
"audio_info": "Sube WAV, MP3 o FLAC · máx. 30 segundos",
|
| 177 |
+
"aspect_label": "Formato",
|
| 178 |
+
"advanced": "⚙️ Configuración Avanzada",
|
| 179 |
+
"steps_label": "Pasos de Inferencia",
|
| 180 |
+
"steps_info": "Más pasos = mayor calidad, más lento",
|
| 181 |
+
"guidance_label": "Escala de Guía",
|
| 182 |
+
"guidance_info": "Mayor = sigue el audio con más precisión",
|
| 183 |
+
"generate": "🎬 Generar Vídeo",
|
| 184 |
+
"output_label": "Vídeo Generado",
|
| 185 |
+
"examples_header": "### 💡 Prueba Estos Ejemplos",
|
| 186 |
+
"err_no_portrait": "Por favor, sube una foto retrato.",
|
| 187 |
+
"err_no_text": "Por favor, ingresa algún texto.",
|
| 188 |
+
"err_no_audio": "Por favor, sube un archivo de audio.",
|
| 189 |
+
"err_text_long": f"Texto demasiado largo (máx. {MAX_TEXT_LEN} caracteres).",
|
| 190 |
+
"err_audio_long": f"Audio demasiado largo (máx. {MAX_AUDIO_SEC} segundos).",
|
| 191 |
+
"err_oom": "Sin memoria GPU. Prueba un formato menor o menos pasos.",
|
| 192 |
+
"err_no_face": "No se detectó rostro. Sube una foto retrato frontal clara.",
|
| 193 |
+
"err_model": "Modelo no cargado. Recarga la página e intenta de nuevo.",
|
| 194 |
+
# Phase 2: Dub Video
|
| 195 |
+
"tab_dub": "🎙️ Doblar Vídeo",
|
| 196 |
+
"dub_tagline": "Dobla cualquier vídeo a 23 idiomas",
|
| 197 |
+
"dub_video_label": "Vídeo de Entrada",
|
| 198 |
+
"dub_video_info": "Sube un vídeo para doblar (máx. 60 segundos)",
|
| 199 |
+
"dub_target_label": "Idioma de Destino",
|
| 200 |
+
"dub_voice_label": "Referencia de Voz",
|
| 201 |
+
"dub_voice_info": "Opcional: sube audio para clonar el estilo de voz en el doblaje",
|
| 202 |
+
"dub_emotion_label": "Intensidad Emocional",
|
| 203 |
+
"dub_btn": "🎙️ Doblar Vídeo",
|
| 204 |
+
"dub_output_label": "Vídeo Doblado",
|
| 205 |
+
"dub_transcript": "Transcripción Detectada",
|
| 206 |
+
"dub_translation": "Traducción",
|
| 207 |
+
"dub_status": "Estado",
|
| 208 |
+
"dub_details": "Detalles",
|
| 209 |
+
"err_no_video": "Por favor, sube un vídeo.",
|
| 210 |
+
"err_video_long": "Vídeo demasiado largo (máx. 60 segundos).",
|
| 211 |
+
"err_translate": "Traducción fallida. Por favor, inténtalo de nuevo.",
|
| 212 |
+
"err_transcribe": "Transcripción fallida. Por favor, inténtalo de nuevo.",
|
| 213 |
+
"err_dub_text_long": "Transcripción demasiado larga. Usa un vídeo más corto.",
|
| 214 |
+
},
|
| 215 |
+
"🇪🇬 عربي": {
|
| 216 |
+
# Phase 1: Create Video
|
| 217 |
+
"tab_create": "🎬 إنشاء فيديو",
|
| 218 |
+
"tagline": "منشئ فيديو الأفاتار بالذكاء الاصطناعي",
|
| 219 |
+
"input_mode_label": "مدخل الصوت",
|
| 220 |
+
"mode_text": "نص إلى كلام",
|
| 221 |
+
"mode_audio": "رفع ملف صوتي",
|
| 222 |
+
"portrait_label": "صورة الوجه",
|
| 223 |
+
"portrait_info": "ارفع صورة واضحة للوجه من الأمام",
|
| 224 |
+
"text_label": "النص",
|
| 225 |
+
"text_ph": "اكتب ما تريد أن يقوله الأفاتار...",
|
| 226 |
+
"tts_lang_label": "لغة الكلام",
|
| 227 |
+
"voice_ref_label": "مرجع الصوت",
|
| 228 |
+
"voice_ref_info": "اختياري: ارفع ملفاً صوتياً لاستنساخ أسلوب الصوت",
|
| 229 |
+
"emotion_label": "شدة التعبير العاطفي",
|
| 230 |
+
"emotion_info": "0 = محايد · 1 = تعبيري جداً",
|
| 231 |
+
"audio_label": "الملف الصوتي",
|
| 232 |
+
"audio_info": "ارفع WAV أو MP3 أو FLAC · الحد الأقصى 30 ثانية",
|
| 233 |
+
"aspect_label": "التنسيق",
|
| 234 |
+
"advanced": "⚙️ الإعدادات المتقدمة",
|
| 235 |
+
"steps_label": "خطوات الاستدلال",
|
| 236 |
+
"steps_info": "المزيد من الخطوات = جودة أعلى، وقت أطول",
|
| 237 |
+
"guidance_label": "مقياس التوجيه",
|
| 238 |
+
"guidance_info": "أعلى = يتبع الصوت بدقة أكبر",
|
| 239 |
+
"generate": "🎬 توليد الفيديو",
|
| 240 |
+
"output_label": "الفيديو المُنشأ",
|
| 241 |
+
"examples_header": "### 💡 جرّب هذه الأمثلة",
|
| 242 |
+
"err_no_portrait": "الرجاء رفع صورة وجه.",
|
| 243 |
+
"err_no_text": "الرجاء إدخال نص.",
|
| 244 |
+
"err_no_audio": "الرجاء رفع ملف صوتي.",
|
| 245 |
+
"err_text_long": f"النص طويل جداً (الحد الأقصى {MAX_TEXT_LEN} حرف).",
|
| 246 |
+
"err_audio_long": f"الصوت طويل جداً (الحد الأقصى {MAX_AUDIO_SEC} ثانية).",
|
| 247 |
+
"err_oom": "نفدت ذاكرة GPU. جرّب تنسيقاً أصغر أو خطوات أقل.",
|
| 248 |
+
"err_no_face": "لم يُكتشف أي وجه. ارفع صورة وجه واضحة من الأمام.",
|
| 249 |
+
"err_model": "النموذج غير محمّل. أعد تحميل الصفحة وحاول مجدداً.",
|
| 250 |
+
# Phase 2: Dub Video
|
| 251 |
+
"tab_dub": "🎙️ دبلجة فيديو",
|
| 252 |
+
"dub_tagline": "دبلج أي فيديو إلى 23 لغة",
|
| 253 |
+
"dub_video_label": "الفيديو المُدخل",
|
| 254 |
+
"dub_video_info": "ارفع فيديو للدبلجة (الحد الأقصى 60 ثانية)",
|
| 255 |
+
"dub_target_label": "اللغة الهدف",
|
| 256 |
+
"dub_voice_label": "مرجع الصوت",
|
| 257 |
+
"dub_voice_info": "اختياري: ارفع ملفاً صوتياً لاستنساخ أسلوب الصوت في الدبلجة",
|
| 258 |
+
"dub_emotion_label": "شدة التعبير العاطفي",
|
| 259 |
+
"dub_btn": "🎙️ دبلجة الفيديو",
|
| 260 |
+
"dub_output_label": "الفيديو المدبلج",
|
| 261 |
+
"dub_transcript": "النص المُكتشف",
|
| 262 |
+
"dub_translation": "الترجمة",
|
| 263 |
+
"dub_status": "الحالة",
|
| 264 |
+
"dub_details": "التفاصيل",
|
| 265 |
+
"err_no_video": "الرجاء رفع فيديو.",
|
| 266 |
+
"err_video_long": "الفيديو طويل جداً (الحد الأقصى 60 ثانية).",
|
| 267 |
+
"err_translate": "فشلت الترجمة. الرجاء المحاولة مجدداً.",
|
| 268 |
+
"err_transcribe": "فشل النسخ. الرجاء المحاولة مجدداً.",
|
| 269 |
+
"err_dub_text_long": "النص المُكتشف طويل جداً. استخدم مقطعاً أقصر.",
|
| 270 |
+
},
|
| 271 |
+
}
|
lang_codes.py
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Language code mappings between TTS display names, Whisper ISO-639-1, and NLLB BCP-47."""
|
| 2 |
+
|
| 3 |
+
from dataclasses import dataclass
|
| 4 |
+
from typing import Mapping
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
@dataclass(frozen=True)
|
| 8 |
+
class LangInfo:
|
| 9 |
+
display: str # Display name matching TTS_LANGUAGES (e.g., "Portuguese")
|
| 10 |
+
nllb: str # NLLB-200 BCP-47 flores code (e.g., "por_Latn")
|
| 11 |
+
whisper: str # Whisper ISO 639-1 code (e.g., "pt")
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
LANG_MAP: Mapping[str, LangInfo] = {
|
| 15 |
+
"Arabic": LangInfo("Arabic", "arb_Arab", "ar"),
|
| 16 |
+
"Danish": LangInfo("Danish", "dan_Latn", "da"),
|
| 17 |
+
"German": LangInfo("German", "deu_Latn", "de"),
|
| 18 |
+
"Greek": LangInfo("Greek", "ell_Grek", "el"),
|
| 19 |
+
"English": LangInfo("English", "eng_Latn", "en"),
|
| 20 |
+
"Spanish": LangInfo("Spanish", "spa_Latn", "es"),
|
| 21 |
+
"Finnish": LangInfo("Finnish", "fin_Latn", "fi"),
|
| 22 |
+
"French": LangInfo("French", "fra_Latn", "fr"),
|
| 23 |
+
"Hebrew": LangInfo("Hebrew", "heb_Hebr", "he"),
|
| 24 |
+
"Hindi": LangInfo("Hindi", "hin_Deva", "hi"),
|
| 25 |
+
"Italian": LangInfo("Italian", "ita_Latn", "it"),
|
| 26 |
+
"Japanese": LangInfo("Japanese", "jpn_Jpan", "ja"),
|
| 27 |
+
"Korean": LangInfo("Korean", "kor_Hang", "ko"),
|
| 28 |
+
"Malay": LangInfo("Malay", "zsm_Latn", "ms"),
|
| 29 |
+
"Dutch": LangInfo("Dutch", "nld_Latn", "nl"),
|
| 30 |
+
"Norwegian": LangInfo("Norwegian", "nob_Latn", "no"),
|
| 31 |
+
"Polish": LangInfo("Polish", "pol_Latn", "pl"),
|
| 32 |
+
"Portuguese": LangInfo("Portuguese", "por_Latn", "pt"),
|
| 33 |
+
"Russian": LangInfo("Russian", "rus_Cyrl", "ru"),
|
| 34 |
+
"Swedish": LangInfo("Swedish", "swe_Latn", "sv"),
|
| 35 |
+
"Swahili": LangInfo("Swahili", "swh_Latn", "sw"),
|
| 36 |
+
"Turkish": LangInfo("Turkish", "tur_Latn", "tr"),
|
| 37 |
+
"Chinese": LangInfo("Chinese", "zho_Hans", "zh"),
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def get_nllb_code(lang_display: str) -> str:
|
| 42 |
+
info = LANG_MAP.get(lang_display)
|
| 43 |
+
if info is None:
|
| 44 |
+
raise ValueError(f"Unknown language: {lang_display!r}")
|
| 45 |
+
return info.nllb
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def get_whisper_code(lang_display: str) -> str:
|
| 49 |
+
info = LANG_MAP.get(lang_display)
|
| 50 |
+
if info is None:
|
| 51 |
+
raise ValueError(f"Unknown language: {lang_display!r}")
|
| 52 |
+
return info.whisper
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def whisper_code_to_display(whisper_code: str) -> str | None:
|
| 56 |
+
for info in LANG_MAP.values():
|
| 57 |
+
if info.whisper == whisper_code:
|
| 58 |
+
return info.display
|
| 59 |
+
return None
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def nllb_code_to_display(nllb_code: str) -> str | None:
|
| 63 |
+
for info in LANG_MAP.values():
|
| 64 |
+
if info.nllb == nllb_code:
|
| 65 |
+
return info.display
|
| 66 |
+
return None
|
requirements.txt
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ── Exact versions matching chatterbox-tts 0.1.6 deps ────────────────────────
|
| 2 |
+
torch==2.6.0
|
| 3 |
+
torchaudio==2.6.0
|
| 4 |
+
torchvision==0.21.0
|
| 5 |
+
transformers==4.46.3
|
| 6 |
+
tokenizers==0.20.3
|
| 7 |
+
diffusers==0.29.0
|
| 8 |
+
safetensors==0.5.3
|
| 9 |
+
accelerate==1.2.1
|
| 10 |
+
|
| 11 |
+
# ── Gradio (conflicts with chatterbox-tts PyPI — use GitHub clone instead) ───
|
| 12 |
+
gradio==6.0.2
|
| 13 |
+
spaces
|
| 14 |
+
|
| 15 |
+
# ── Chatterbox runtime deps (no chatterbox-tts pip install — cloned at runtime) ─
|
| 16 |
+
librosa==0.11.0
|
| 17 |
+
s3tokenizer
|
| 18 |
+
resemble-perth==1.0.1
|
| 19 |
+
conformer==0.3.2
|
| 20 |
+
spacy-pkuseg
|
| 21 |
+
pykakasi==2.3.0
|
| 22 |
+
pyloudnorm
|
| 23 |
+
omegaconf
|
| 24 |
+
numpy>=1.24.0
|
| 25 |
+
|
| 26 |
+
# ── Other ─────────────────────────────────────────────────────────────────────
|
| 27 |
+
opencv-python-headless
|
| 28 |
+
Pillow>=10.0.0
|
| 29 |
+
huggingface_hub>=0.23.0
|
| 30 |
+
|
| 31 |
+
# ── Phase 2: Video Dubbing ───────────────────────────────────────────────────
|
| 32 |
+
openai-whisper
|
| 33 |
+
tiktoken
|
| 34 |
+
more-itertools
|
styles.py
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Gradio theme and CSS for AnimaStudio."""
|
| 2 |
+
import gradio as gr
|
| 3 |
+
|
| 4 |
+
THEME = gr.themes.Soft(
|
| 5 |
+
primary_hue="purple",
|
| 6 |
+
secondary_hue="pink",
|
| 7 |
+
neutral_hue="slate",
|
| 8 |
+
)
|
| 9 |
+
|
| 10 |
+
CSS = """
|
| 11 |
+
/* ── Global ─────────────────────────────────────── */
|
| 12 |
+
.gradio-container {
|
| 13 |
+
max-width: 1380px !important;
|
| 14 |
+
margin: 0 auto !important;
|
| 15 |
+
}
|
| 16 |
+
|
| 17 |
+
/* ── Header ──────────────────────────────────────── */
|
| 18 |
+
.as-header {
|
| 19 |
+
text-align: center;
|
| 20 |
+
padding: 2.4rem 1rem 2rem;
|
| 21 |
+
border-radius: 1.5rem;
|
| 22 |
+
margin-bottom: 1.5rem;
|
| 23 |
+
background: linear-gradient(135deg, #1a0b2e 0%, #2d1b4e 40%, #1a1040 70%, #0d0b1e 100%);
|
| 24 |
+
border: 1px solid rgba(168,85,247,0.25);
|
| 25 |
+
box-shadow: 0 4px 60px rgba(168,85,247,0.1), inset 0 1px 0 rgba(255,255,255,0.05);
|
| 26 |
+
position: relative;
|
| 27 |
+
overflow: hidden;
|
| 28 |
+
}
|
| 29 |
+
|
| 30 |
+
.as-header::before {
|
| 31 |
+
content: '';
|
| 32 |
+
position: absolute;
|
| 33 |
+
top: -50%; left: -50%;
|
| 34 |
+
width: 200%; height: 200%;
|
| 35 |
+
background: radial-gradient(ellipse at center, rgba(168,85,247,0.08) 0%, transparent 60%);
|
| 36 |
+
pointer-events: none;
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
.as-header h1 {
|
| 40 |
+
font-size: 3.2rem !important;
|
| 41 |
+
font-weight: 900 !important;
|
| 42 |
+
margin: 0 0 0.6rem !important;
|
| 43 |
+
line-height: 1.05 !important;
|
| 44 |
+
background: linear-gradient(90deg, #e879f9 0%, #a855f7 40%, #f472b6 80%, #fb923c 100%);
|
| 45 |
+
-webkit-background-clip: text;
|
| 46 |
+
-webkit-text-fill-color: transparent;
|
| 47 |
+
background-clip: text;
|
| 48 |
+
letter-spacing: -0.035em;
|
| 49 |
+
}
|
| 50 |
+
|
| 51 |
+
.as-header .tagline {
|
| 52 |
+
color: #94a3b8 !important;
|
| 53 |
+
font-size: 1.0rem !important;
|
| 54 |
+
margin: 0 0 1rem !important;
|
| 55 |
+
}
|
| 56 |
+
|
| 57 |
+
.as-header .badges {
|
| 58 |
+
display: flex;
|
| 59 |
+
justify-content: center;
|
| 60 |
+
gap: 0.5rem;
|
| 61 |
+
flex-wrap: wrap;
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
/* ── Badges ───────────────────────────────────────── */
|
| 65 |
+
.badge {
|
| 66 |
+
display: inline-flex;
|
| 67 |
+
align-items: center;
|
| 68 |
+
gap: 0.3rem;
|
| 69 |
+
padding: 0.25rem 0.75rem;
|
| 70 |
+
border-radius: 999px;
|
| 71 |
+
font-size: 0.78rem;
|
| 72 |
+
font-weight: 600;
|
| 73 |
+
background: rgba(255,255,255,0.06);
|
| 74 |
+
border: 1px solid rgba(255,255,255,0.1);
|
| 75 |
+
color: #cbd5e1;
|
| 76 |
+
}
|
| 77 |
+
.badge-purple { border-color: rgba(168,85,247,0.4); color: #a855f7; background: rgba(168,85,247,0.08); }
|
| 78 |
+
.badge-pink { border-color: rgba(244,114,182,0.4); color: #f472b6; background: rgba(244,114,182,0.08); }
|
| 79 |
+
.badge-cyan { border-color: rgba(34,211,238,0.4); color: #22d3ee; background: rgba(34,211,238,0.08); }
|
| 80 |
+
.badge-gold { border-color: rgba(251,191,36,0.4); color: #fbbf24; background: rgba(251,191,36,0.08); }
|
| 81 |
+
.badge-teal { border-color: rgba(20,184,166,0.4); color: #14b8a6; background: rgba(20,184,166,0.08); }
|
| 82 |
+
|
| 83 |
+
/* ── Language selector ────────────────────────────── */
|
| 84 |
+
#lang-selector .wrap { gap: 4px !important; justify-content: center !important; }
|
| 85 |
+
#lang-selector label span { font-size: 0.9rem !important; }
|
| 86 |
+
#lang-selector { margin-bottom: 0.5rem !important; }
|
| 87 |
+
|
| 88 |
+
/* ── Generate Button ──────────────────────────────── */
|
| 89 |
+
#gen-btn {
|
| 90 |
+
background: linear-gradient(135deg, #a855f7 0%, #ec4899 100%) !important;
|
| 91 |
+
border: none !important;
|
| 92 |
+
font-size: 1.15rem !important;
|
| 93 |
+
font-weight: 700 !important;
|
| 94 |
+
padding: 0.85rem 1rem !important;
|
| 95 |
+
border-radius: 0.85rem !important;
|
| 96 |
+
color: white !important;
|
| 97 |
+
box-shadow: 0 4px 24px rgba(168,85,247,0.4) !important;
|
| 98 |
+
transition: all 0.2s ease !important;
|
| 99 |
+
letter-spacing: 0.02em !important;
|
| 100 |
+
width: 100% !important;
|
| 101 |
+
}
|
| 102 |
+
#gen-btn:hover {
|
| 103 |
+
transform: translateY(-2px) !important;
|
| 104 |
+
box-shadow: 0 8px 32px rgba(168,85,247,0.55) !important;
|
| 105 |
+
}
|
| 106 |
+
#gen-btn:active { transform: translateY(0) !important; }
|
| 107 |
+
|
| 108 |
+
/* ── Dub Button ────────────────────────────────────── */
|
| 109 |
+
#dub-btn {
|
| 110 |
+
background: linear-gradient(135deg, #06b6d4 0%, #a855f7 100%) !important;
|
| 111 |
+
border: none !important;
|
| 112 |
+
font-size: 1.15rem !important;
|
| 113 |
+
font-weight: 700 !important;
|
| 114 |
+
padding: 0.85rem 1rem !important;
|
| 115 |
+
border-radius: 0.85rem !important;
|
| 116 |
+
color: white !important;
|
| 117 |
+
box-shadow: 0 4px 24px rgba(6,182,212,0.4) !important;
|
| 118 |
+
transition: all 0.2s ease !important;
|
| 119 |
+
letter-spacing: 0.02em !important;
|
| 120 |
+
width: 100% !important;
|
| 121 |
+
}
|
| 122 |
+
#dub-btn:hover {
|
| 123 |
+
transform: translateY(-2px) !important;
|
| 124 |
+
box-shadow: 0 8px 32px rgba(6,182,212,0.55) !important;
|
| 125 |
+
}
|
| 126 |
+
#dub-btn:active { transform: translateY(0) !important; }
|
| 127 |
+
|
| 128 |
+
/* ── Output Video ─────────────────────────────────── */
|
| 129 |
+
#output-video, #dub-output-video {
|
| 130 |
+
border-radius: 1rem !important;
|
| 131 |
+
overflow: hidden !important;
|
| 132 |
+
background: #0f172a !important;
|
| 133 |
+
min-height: 420px !important;
|
| 134 |
+
}
|
| 135 |
+
|
| 136 |
+
/* ── Footer ──���────────────────────────────────────── */
|
| 137 |
+
.as-footer {
|
| 138 |
+
text-align: center;
|
| 139 |
+
padding: 1.2rem 0 0.5rem;
|
| 140 |
+
color: #475569;
|
| 141 |
+
font-size: 0.82rem;
|
| 142 |
+
border-top: 1px solid #1e293b;
|
| 143 |
+
margin-top: 1rem;
|
| 144 |
+
}
|
| 145 |
+
.as-footer a { color: #a855f7 !important; text-decoration: none !important; }
|
| 146 |
+
.as-footer a:hover { text-decoration: underline !important; }
|
| 147 |
+
|
| 148 |
+
/* ── Mobile ───────────────────────────────────────── */
|
| 149 |
+
@media (max-width: 768px) {
|
| 150 |
+
.as-header h1 { font-size: 2rem !important; }
|
| 151 |
+
.badges { gap: 0.35rem !important; }
|
| 152 |
+
}
|
| 153 |
+
"""
|