Spaces:
Running on Zero
Running on Zero
| title: AnimaStudio ๐ฌ | |
| emoji: ๐ฌ | |
| colorFrom: purple | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: 6.0.2 | |
| app_file: app.py | |
| pinned: true | |
| license: apache-2.0 | |
| short_description: AI talking head & video dubbing โ free, 23 languages | |
| tags: | |
| - video-generation | |
| - talking-head | |
| - lip-sync | |
| - avatar | |
| - tts | |
| - voice-cloning | |
| - multilingual | |
| - mcp-server | |
| - echomimic | |
| - chatterbox | |
| - dubbing | |
| - whisper | |
| - nllb | |
| # ๐ฌ AnimaStudio | |
| Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages โ free, no sign-up required. | |
| --- | |
| ## โจ Features | |
| | Feature | Details | | |
| |---------|---------| | |
| | ๐ญ **Realistic Lip Sync** | EchoMimic V3 Flash (AAAI 2026) โ state-of-the-art talking head animation | | |
| | ๐ฃ๏ธ **23-Language TTS** | Type text in any of 23 languages and generate natural speech | | |
| | ๐๏ธ **Voice Cloning** | Upload a voice reference clip to clone the speaking style | | |
| | ๐ค **Audio Upload** | Upload your own WAV / MP3 / FLAC instead of using TTS | | |
| | ๐ฌ **Video Dubbing** | Upload a video (up to 60 s) and dub it into any of the 23 supported languages | | |
| | ๐ **3 Aspect Ratios** | 9:16 mobile, 1:1 square, 16:9 landscape | | |
| | ๐ **4 UI Languages** | Full interface in English, Portuguรชs (BR), Espaรฑol, and ุนุฑุจู | | |
| | ๐ฅ **Download** | One-click download of the generated MP4 | | |
| | ๐ค **MCP Server** | Use as a tool in Claude, Cursor, and any MCP-compatible agent | | |
| --- | |
| ## ๐ฃ๏ธ Supported TTS Languages | |
| Arabic ยท Danish ยท German ยท Greek ยท **English** ยท **Spanish** ยท Finnish ยท French ยท Hebrew ยท Hindi ยท Italian ยท Japanese ยท Korean ยท Malay ยท Dutch ยท Norwegian ยท Polish ยท **Portuguese** ยท Russian ยท Swedish ยท Swahili ยท Turkish ยท Chinese | |
| --- | |
| ## ๐ Output Formats | |
| | Preset | Dimensions | Best for | | |
| |--------|-----------|----------| | |
| | โฎ 9:16 | 576 ร 1024 | Mobile, Reels, TikTok | | |
| | โป 1:1 | 512 ร 512 | Social media, thumbnails | | |
| | โฌ 16:9 | 1024 ร 576 | Presentations, YouTube | | |
| --- | |
| ## โ๏ธ Advanced Settings | |
| | Setting | Default | Range | Description | | |
| |---------|---------|-------|-------------| | |
| | **Inference Steps** | 20 | 5โ50 | More steps = higher quality, slower | | |
| | **Guidance Scale** | 3.5 | 1โ10 | Higher = audio followed more strictly | | |
| | **Emotion Intensity** | 0.5 | 0โ1 | Controls expressiveness of TTS voice | | |
| --- | |
| ## ๐ค MCP Server | |
| AnimaStudio runs as an **MCP (Model Context Protocol) server**, enabling AI agents to generate talking head videos programmatically. | |
| ### Using with Claude Desktop | |
| Add to your `claude_desktop_config.json`: | |
| ```json | |
| { | |
| "mcpServers": { | |
| "animastudio": { | |
| "url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse" | |
| } | |
| } | |
| } | |
| ``` | |
| ### Tool parameters | |
| - **portrait_image** โ portrait photo (file path or base64) | |
| - **text** โ text for the avatar to speak (text mode) | |
| - **tts_language** โ language for speech synthesis (23 options) | |
| - **voice_ref** โ optional voice reference audio for cloning | |
| - **audio_file** โ audio file path (audio mode) | |
| - **aspect_ratio** โ output format (9:16, 1:1, 16:9) | |
| - **emotion** โ emotion intensity 0โ1 | |
| - **num_steps** โ inference steps (default 20) | |
| - **guidance_scale** โ guidance scale (default 3.5) | |
| --- | |
| ## ๐ฌ Video Dubbing (Phase 2) | |
| Upload any video (up to 60 seconds) and dub it into a different language. The pipeline: | |
| 1. **Whisper Turbo** transcribes the original speech (auto-detects language) | |
| 2. **NLLB-200** translates the transcript to the target language | |
| 3. **Chatterbox TTS** synthesizes the translated speech (with optional voice cloning) | |
| 4. **ffmpeg** muxes the new audio track onto the original video | |
| ### Dubbing Settings | |
| | Setting | Details | | |
| |---------|---------| | |
| | **Input Video** | Any video with speech, up to 60 seconds | | |
| | **Target Language** | Any of the 23 supported languages | | |
| | **Voice Reference** | Optional audio clip to clone the speaker's voice style | | |
| > Same language as source? The pipeline skips translation and re-synthesizes the audio directly. | |
| --- | |
| ## ๐ง Technical Details | |
| ### Models | |
| | Model | Purpose | License | VRAM | | |
| |-------|---------|---------|------| | |
| | [EchoMimic V3 Flash](https://huggingface.co/BadToBest/EchoMimicV3) | Talking head video generation | Apache 2.0 | ~12 GB | | |
| | [Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox) | 23-language TTS with voice cloning | MIT | ~8 GB | | |
| | [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | Speech transcription (809M params) | MIT | ~2 GB | | |
| | [NLLB-200 Distilled 600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | Text translation (23 languages) | CC-BY-NC-4.0 | API (no local GPU) | | |
| ### Architecture | |
| ``` | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Tab 1: Create Video โ | |
| โ โ | |
| โ Portrait Photo + Text โโโ Chatterbox TTS โโโ Audio WAV โ | |
| โ โ โ | |
| โ Portrait Photo + Audio โโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ | |
| โ โผ โ | |
| โ EchoMimic V3 Flash โ | |
| โ (lip-sync animation) โ | |
| โ โ โ | |
| โ โผ โ | |
| โ MP4 Video Output โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค | |
| โ Tab 2: Dub Video โ | |
| โ โ | |
| โ Video โโโ ffmpeg (extract audio) โ | |
| โ โ โ | |
| โ โผ โ | |
| โ Whisper Turbo (transcribe + detect language) โ | |
| โ โ โ | |
| โ โผ โ | |
| โ NLLB-200 (translate to target language) โ | |
| โ โ โ | |
| โ โผ โ | |
| โ Chatterbox TTS (synthesize translated speech) โ | |
| โ โ โ | |
| โ โผ โ | |
| โ ffmpeg (mux new audio onto original video) โ | |
| โ โ โ | |
| โ โผ โ | |
| โ Dubbed MP4 Output โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ``` | |
| ### VRAM Management | |
| Models run sequentially on ZeroGPU (A10G, 24 GB): | |
| **Create Video tab:** | |
| 1. Chatterbox TTS โ generates audio โ offloads to CPU | |
| 2. EchoMimic V3 โ generates video โ offloads to CPU | |
| 3. `torch.cuda.empty_cache()` between stages | |
| **Dub Video tab:** | |
| 1. Whisper Turbo โ transcribes audio (~2 GB) โ offloads to CPU | |
| 2. NLLB-200 โ translates via HF Inference API (no local GPU) | |
| 3. Chatterbox TTS โ synthesizes dubbed speech โ offloads to CPU | |
| 4. `torch.cuda.empty_cache()` between stages | |
| Peak usage never exceeds ~16 GB. | |
| --- | |
| ## ๐ก Tips for Best Results | |
| ### Create Video | |
| 1. **Use a clear, front-facing portrait** โ well-lit, neutral background, face filling most of the frame | |
| 2. **Keep audio under 20 seconds** โ shorter = faster generation, tighter lip sync | |
| 3. **Add a voice reference** โ upload a 5โ15 second clip in the target language for natural voice cloning | |
| 4. **Match language to text** โ select the correct TTS language to avoid accent issues | |
| 5. **Emotion 0.4โ0.6** โ sweet spot for natural-sounding delivery | |
| 6. **9:16 for social** โ perfect for Reels, TikTok, and Stories | |
| 7. **20โ30 steps** โ good quality/speed trade-off for most use cases | |
| ### Dub Video | |
| 8. **Keep videos under 60 seconds** โ pipeline enforces this limit for VRAM and quality | |
| 9. **Clear speech works best** โ minimal background music/noise gives cleaner transcriptions | |
| 10. **Add a voice reference** โ clone the original speaker's voice for a more natural dub | |
| 11. **Single-speaker videos** โ the pipeline works best with one speaker at a time | |
| --- | |
| ## ๐ ๏ธ Running Locally | |
| ```bash | |
| git clone https://huggingface.co/spaces/lulavc/AnimaStudio | |
| cd AnimaStudio | |
| pip install -r requirements.txt | |
| python app.py | |
| ``` | |
| Requires a CUDA GPU with at least 16 GB VRAM. Set `HF_TOKEN` for private model access. | |
| --- | |
| ## ๐ License | |
| - **Space code:** Apache 2.0 | |
| - **EchoMimic V3:** [Apache 2.0](https://huggingface.co/BadToBest/EchoMimicV3) | |
| - **Chatterbox TTS:** [MIT](https://huggingface.co/ResembleAI/chatterbox) | |
| - **Whisper Turbo:** [MIT](https://huggingface.co/openai/whisper-large-v3-turbo) | |
| - **NLLB-200:** [CC-BY-NC-4.0](https://huggingface.co/facebook/nllb-200-distilled-600M) | |
| --- | |
| **Space by [lulavc](https://huggingface.co/lulavc)** ยท Powered by [EchoMimic V3](https://huggingface.co/BadToBest/EchoMimicV3) + [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) + [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) + [NLLB-200](https://huggingface.co/facebook/nllb-200-distilled-600M) ยท ZeroGPU ยท A10G | |