--- title: AnimaStudio ๐ŸŽฌ emoji: ๐ŸŽฌ colorFrom: purple colorTo: pink sdk: gradio sdk_version: 6.0.2 app_file: app.py pinned: true license: apache-2.0 short_description: AI talking head & video dubbing โ€” free, 23 languages tags: - video-generation - talking-head - lip-sync - avatar - tts - voice-cloning - multilingual - mcp-server - echomimic - chatterbox - dubbing - whisper - nllb --- # ๐ŸŽฌ AnimaStudio Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages โ€” free, no sign-up required. --- ## โœจ Features | Feature | Details | |---------|---------| | ๐ŸŽญ **Realistic Lip Sync** | EchoMimic V3 Flash (AAAI 2026) โ€” state-of-the-art talking head animation | | ๐Ÿ—ฃ๏ธ **23-Language TTS** | Type text in any of 23 languages and generate natural speech | | ๐ŸŽ™๏ธ **Voice Cloning** | Upload a voice reference clip to clone the speaking style | | ๐Ÿ“ค **Audio Upload** | Upload your own WAV / MP3 / FLAC instead of using TTS | | ๐ŸŽฌ **Video Dubbing** | Upload a video (up to 60 s) and dub it into any of the 23 supported languages | | ๐Ÿ“ **3 Aspect Ratios** | 9:16 mobile, 1:1 square, 16:9 landscape | | ๐ŸŒ **4 UI Languages** | Full interface in English, Portuguรชs (BR), Espaรฑol, and ุนุฑุจูŠ | | ๐Ÿ“ฅ **Download** | One-click download of the generated MP4 | | ๐Ÿค– **MCP Server** | Use as a tool in Claude, Cursor, and any MCP-compatible agent | --- ## ๐Ÿ—ฃ๏ธ Supported TTS Languages Arabic ยท Danish ยท German ยท Greek ยท **English** ยท **Spanish** ยท Finnish ยท French ยท Hebrew ยท Hindi ยท Italian ยท Japanese ยท Korean ยท Malay ยท Dutch ยท Norwegian ยท Polish ยท **Portuguese** ยท Russian ยท Swedish ยท Swahili ยท Turkish ยท Chinese --- ## ๐Ÿ“ Output Formats | Preset | Dimensions | Best for | |--------|-----------|----------| | โ–ฎ 9:16 | 576 ร— 1024 | Mobile, Reels, TikTok | | โ—ป 1:1 | 512 ร— 512 | Social media, thumbnails | | โ–ฌ 16:9 | 1024 ร— 576 | Presentations, YouTube | --- ## โš™๏ธ Advanced Settings | Setting | Default | Range | Description | |---------|---------|-------|-------------| | **Inference Steps** | 20 | 5โ€“50 | More steps = higher quality, slower | | **Guidance Scale** | 3.5 | 1โ€“10 | Higher = audio followed more strictly | | **Emotion Intensity** | 0.5 | 0โ€“1 | Controls expressiveness of TTS voice | --- ## ๐Ÿค– MCP Server AnimaStudio runs as an **MCP (Model Context Protocol) server**, enabling AI agents to generate talking head videos programmatically. ### Using with Claude Desktop Add to your `claude_desktop_config.json`: ```json { "mcpServers": { "animastudio": { "url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse" } } } ``` ### Tool parameters - **portrait_image** โ€” portrait photo (file path or base64) - **text** โ€” text for the avatar to speak (text mode) - **tts_language** โ€” language for speech synthesis (23 options) - **voice_ref** โ€” optional voice reference audio for cloning - **audio_file** โ€” audio file path (audio mode) - **aspect_ratio** โ€” output format (9:16, 1:1, 16:9) - **emotion** โ€” emotion intensity 0โ€“1 - **num_steps** โ€” inference steps (default 20) - **guidance_scale** โ€” guidance scale (default 3.5) --- ## ๐ŸŽฌ Video Dubbing (Phase 2) Upload any video (up to 60 seconds) and dub it into a different language. The pipeline: 1. **Whisper Turbo** transcribes the original speech (auto-detects language) 2. **NLLB-200** translates the transcript to the target language 3. **Chatterbox TTS** synthesizes the translated speech (with optional voice cloning) 4. **ffmpeg** muxes the new audio track onto the original video ### Dubbing Settings | Setting | Details | |---------|---------| | **Input Video** | Any video with speech, up to 60 seconds | | **Target Language** | Any of the 23 supported languages | | **Voice Reference** | Optional audio clip to clone the speaker's voice style | > Same language as source? The pipeline skips translation and re-synthesizes the audio directly. --- ## ๐Ÿ”ง Technical Details ### Models | Model | Purpose | License | VRAM | |-------|---------|---------|------| | [EchoMimic V3 Flash](https://huggingface.co/BadToBest/EchoMimicV3) | Talking head video generation | Apache 2.0 | ~12 GB | | [Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox) | 23-language TTS with voice cloning | MIT | ~8 GB | | [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | Speech transcription (809M params) | MIT | ~2 GB | | [NLLB-200 Distilled 600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | Text translation (23 languages) | CC-BY-NC-4.0 | API (no local GPU) | ### Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Tab 1: Create Video โ”‚ โ”‚ โ”‚ โ”‚ Portrait Photo + Text โ”€โ”€โ†’ Chatterbox TTS โ”€โ”€โ†’ Audio WAV โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Portrait Photo + Audio โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ EchoMimic V3 Flash โ”‚ โ”‚ (lip-sync animation) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ MP4 Video Output โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Tab 2: Dub Video โ”‚ โ”‚ โ”‚ โ”‚ Video โ”€โ”€โ†’ ffmpeg (extract audio) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Whisper Turbo (transcribe + detect language) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ NLLB-200 (translate to target language) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Chatterbox TTS (synthesize translated speech) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ ffmpeg (mux new audio onto original video) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Dubbed MP4 Output โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### VRAM Management Models run sequentially on ZeroGPU (A10G, 24 GB): **Create Video tab:** 1. Chatterbox TTS โ†’ generates audio โ†’ offloads to CPU 2. EchoMimic V3 โ†’ generates video โ†’ offloads to CPU 3. `torch.cuda.empty_cache()` between stages **Dub Video tab:** 1. Whisper Turbo โ†’ transcribes audio (~2 GB) โ†’ offloads to CPU 2. NLLB-200 โ†’ translates via HF Inference API (no local GPU) 3. Chatterbox TTS โ†’ synthesizes dubbed speech โ†’ offloads to CPU 4. `torch.cuda.empty_cache()` between stages Peak usage never exceeds ~16 GB. --- ## ๐Ÿ’ก Tips for Best Results ### Create Video 1. **Use a clear, front-facing portrait** โ€” well-lit, neutral background, face filling most of the frame 2. **Keep audio under 20 seconds** โ€” shorter = faster generation, tighter lip sync 3. **Add a voice reference** โ€” upload a 5โ€“15 second clip in the target language for natural voice cloning 4. **Match language to text** โ€” select the correct TTS language to avoid accent issues 5. **Emotion 0.4โ€“0.6** โ€” sweet spot for natural-sounding delivery 6. **9:16 for social** โ€” perfect for Reels, TikTok, and Stories 7. **20โ€“30 steps** โ€” good quality/speed trade-off for most use cases ### Dub Video 8. **Keep videos under 60 seconds** โ€” pipeline enforces this limit for VRAM and quality 9. **Clear speech works best** โ€” minimal background music/noise gives cleaner transcriptions 10. **Add a voice reference** โ€” clone the original speaker's voice for a more natural dub 11. **Single-speaker videos** โ€” the pipeline works best with one speaker at a time --- ## ๐Ÿ› ๏ธ Running Locally ```bash git clone https://huggingface.co/spaces/lulavc/AnimaStudio cd AnimaStudio pip install -r requirements.txt python app.py ``` Requires a CUDA GPU with at least 16 GB VRAM. Set `HF_TOKEN` for private model access. --- ## ๐Ÿ“„ License - **Space code:** Apache 2.0 - **EchoMimic V3:** [Apache 2.0](https://huggingface.co/BadToBest/EchoMimicV3) - **Chatterbox TTS:** [MIT](https://huggingface.co/ResembleAI/chatterbox) - **Whisper Turbo:** [MIT](https://huggingface.co/openai/whisper-large-v3-turbo) - **NLLB-200:** [CC-BY-NC-4.0](https://huggingface.co/facebook/nllb-200-distilled-600M) --- **Space by [lulavc](https://huggingface.co/lulavc)** ยท Powered by [EchoMimic V3](https://huggingface.co/BadToBest/EchoMimicV3) + [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) + [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) + [NLLB-200](https://huggingface.co/facebook/nllb-200-distilled-600M) ยท ZeroGPU ยท A10G