Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.9.0
title: AnimaStudio ๐ฌ
emoji: ๐ฌ
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.0.2
app_file: app.py
pinned: true
license: apache-2.0
short_description: AI talking head & video dubbing โ free, 23 languages
tags:
- video-generation
- talking-head
- lip-sync
- avatar
- tts
- voice-cloning
- multilingual
- mcp-server
- echomimic
- chatterbox
- dubbing
- whisper
- nllb
๐ฌ AnimaStudio
Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages โ free, no sign-up required.
โจ Features
| Feature | Details |
|---|---|
| ๐ญ Realistic Lip Sync | EchoMimic V3 Flash (AAAI 2026) โ state-of-the-art talking head animation |
| ๐ฃ๏ธ 23-Language TTS | Type text in any of 23 languages and generate natural speech |
| ๐๏ธ Voice Cloning | Upload a voice reference clip to clone the speaking style |
| ๐ค Audio Upload | Upload your own WAV / MP3 / FLAC instead of using TTS |
| ๐ฌ Video Dubbing | Upload a video (up to 60 s) and dub it into any of the 23 supported languages |
| ๐ 3 Aspect Ratios | 9:16 mobile, 1:1 square, 16:9 landscape |
| ๐ 4 UI Languages | Full interface in English, Portuguรชs (BR), Espaรฑol, and ุนุฑุจู |
| ๐ฅ Download | One-click download of the generated MP4 |
| ๐ค MCP Server | Use as a tool in Claude, Cursor, and any MCP-compatible agent |
๐ฃ๏ธ Supported TTS Languages
Arabic ยท Danish ยท German ยท Greek ยท English ยท Spanish ยท Finnish ยท French ยท Hebrew ยท Hindi ยท Italian ยท Japanese ยท Korean ยท Malay ยท Dutch ยท Norwegian ยท Polish ยท Portuguese ยท Russian ยท Swedish ยท Swahili ยท Turkish ยท Chinese
๐ Output Formats
| Preset | Dimensions | Best for |
|---|---|---|
| โฎ 9:16 | 576 ร 1024 | Mobile, Reels, TikTok |
| โป 1:1 | 512 ร 512 | Social media, thumbnails |
| โฌ 16:9 | 1024 ร 576 | Presentations, YouTube |
โ๏ธ Advanced Settings
| Setting | Default | Range | Description |
|---|---|---|---|
| Inference Steps | 20 | 5โ50 | More steps = higher quality, slower |
| Guidance Scale | 3.5 | 1โ10 | Higher = audio followed more strictly |
| Emotion Intensity | 0.5 | 0โ1 | Controls expressiveness of TTS voice |
๐ค MCP Server
AnimaStudio runs as an MCP (Model Context Protocol) server, enabling AI agents to generate talking head videos programmatically.
Using with Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"animastudio": {
"url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse"
}
}
}
Tool parameters
- portrait_image โ portrait photo (file path or base64)
- text โ text for the avatar to speak (text mode)
- tts_language โ language for speech synthesis (23 options)
- voice_ref โ optional voice reference audio for cloning
- audio_file โ audio file path (audio mode)
- aspect_ratio โ output format (9:16, 1:1, 16:9)
- emotion โ emotion intensity 0โ1
- num_steps โ inference steps (default 20)
- guidance_scale โ guidance scale (default 3.5)
๐ฌ Video Dubbing (Phase 2)
Upload any video (up to 60 seconds) and dub it into a different language. The pipeline:
- Whisper Turbo transcribes the original speech (auto-detects language)
- NLLB-200 translates the transcript to the target language
- Chatterbox TTS synthesizes the translated speech (with optional voice cloning)
- ffmpeg muxes the new audio track onto the original video
Dubbing Settings
| Setting | Details |
|---|---|
| Input Video | Any video with speech, up to 60 seconds |
| Target Language | Any of the 23 supported languages |
| Voice Reference | Optional audio clip to clone the speaker's voice style |
Same language as source? The pipeline skips translation and re-synthesizes the audio directly.
๐ง Technical Details
Models
| Model | Purpose | License | VRAM |
|---|---|---|---|
| EchoMimic V3 Flash | Talking head video generation | Apache 2.0 | ~12 GB |
| Chatterbox Multilingual | 23-language TTS with voice cloning | MIT | ~8 GB |
| Whisper Turbo | Speech transcription (809M params) | MIT | ~2 GB |
| NLLB-200 Distilled 600M | Text translation (23 languages) | CC-BY-NC-4.0 | API (no local GPU) |
Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Tab 1: Create Video โ
โ โ
โ Portrait Photo + Text โโโ Chatterbox TTS โโโ Audio WAV โ
โ โ โ
โ Portrait Photo + Audio โโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โผ โ
โ EchoMimic V3 Flash โ
โ (lip-sync animation) โ
โ โ โ
โ โผ โ
โ MP4 Video Output โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Tab 2: Dub Video โ
โ โ
โ Video โโโ ffmpeg (extract audio) โ
โ โ โ
โ โผ โ
โ Whisper Turbo (transcribe + detect language) โ
โ โ โ
โ โผ โ
โ NLLB-200 (translate to target language) โ
โ โ โ
โ โผ โ
โ Chatterbox TTS (synthesize translated speech) โ
โ โ โ
โ โผ โ
โ ffmpeg (mux new audio onto original video) โ
โ โ โ
โ โผ โ
โ Dubbed MP4 Output โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
VRAM Management
Models run sequentially on ZeroGPU (A10G, 24 GB):
Create Video tab:
- Chatterbox TTS โ generates audio โ offloads to CPU
- EchoMimic V3 โ generates video โ offloads to CPU
torch.cuda.empty_cache()between stages
Dub Video tab:
- Whisper Turbo โ transcribes audio (~2 GB) โ offloads to CPU
- NLLB-200 โ translates via HF Inference API (no local GPU)
- Chatterbox TTS โ synthesizes dubbed speech โ offloads to CPU
torch.cuda.empty_cache()between stages
Peak usage never exceeds ~16 GB.
๐ก Tips for Best Results
Create Video
- Use a clear, front-facing portrait โ well-lit, neutral background, face filling most of the frame
- Keep audio under 20 seconds โ shorter = faster generation, tighter lip sync
- Add a voice reference โ upload a 5โ15 second clip in the target language for natural voice cloning
- Match language to text โ select the correct TTS language to avoid accent issues
- Emotion 0.4โ0.6 โ sweet spot for natural-sounding delivery
- 9:16 for social โ perfect for Reels, TikTok, and Stories
- 20โ30 steps โ good quality/speed trade-off for most use cases
Dub Video
- Keep videos under 60 seconds โ pipeline enforces this limit for VRAM and quality
- Clear speech works best โ minimal background music/noise gives cleaner transcriptions
- Add a voice reference โ clone the original speaker's voice for a more natural dub
- Single-speaker videos โ the pipeline works best with one speaker at a time
๐ ๏ธ Running Locally
git clone https://huggingface.co/spaces/lulavc/AnimaStudio
cd AnimaStudio
pip install -r requirements.txt
python app.py
Requires a CUDA GPU with at least 16 GB VRAM. Set HF_TOKEN for private model access.
๐ License
- Space code: Apache 2.0
- EchoMimic V3: Apache 2.0
- Chatterbox TTS: MIT
- Whisper Turbo: MIT
- NLLB-200: CC-BY-NC-4.0
Space by lulavc ยท Powered by EchoMimic V3 + Chatterbox + Whisper Turbo + NLLB-200 ยท ZeroGPU ยท A10G