---
title: AnimaStudio 🎬
emoji: 🎬
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.0.2
app_file: app.py
pinned: true
license: apache-2.0
short_description: AI talking head & video dubbing — free, 23 languages
tags:
  - video-generation
  - talking-head
  - lip-sync
  - avatar
  - tts
  - voice-cloning
  - multilingual
  - mcp-server
  - echomimic
  - chatterbox
  - dubbing
  - whisper
  - nllb
---

# 🎬 AnimaStudio

Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages — free, no sign-up required.

---

## ✨ Features

| Feature | Details |
|---------|---------|
| 🎭 **Realistic Lip Sync** | EchoMimic V3 Flash (AAAI 2026) — state-of-the-art talking head animation |
| 🗣️ **23-Language TTS** | Type text in any of 23 languages and generate natural speech |
| 🎙️ **Voice Cloning** | Upload a voice reference clip to clone the speaking style |
| 📤 **Audio Upload** | Upload your own WAV / MP3 / FLAC instead of using TTS |
| 🎬 **Video Dubbing** | Upload a video (up to 60 s) and dub it into any of the 23 supported languages |
| 📐 **3 Aspect Ratios** | 9:16 mobile, 1:1 square, 16:9 landscape |
| 🌐 **4 UI Languages** | Full interface in English, Português (BR), Español, and عربي |
| 📥 **Download** | One-click download of the generated MP4 |
| 🤖 **MCP Server** | Use as a tool in Claude, Cursor, and any MCP-compatible agent |

---

## 🗣️ Supported TTS Languages

Arabic · Danish · German · Greek · **English** · **Spanish** · Finnish · French · Hebrew · Hindi · Italian · Japanese · Korean · Malay · Dutch · Norwegian · Polish · **Portuguese** · Russian · Swedish · Swahili · Turkish · Chinese

---

## 📐 Output Formats

| Preset | Dimensions | Best for |
|--------|-----------|----------|
| ▮ 9:16 | 576 × 1024 | Mobile, Reels, TikTok |
| ◻ 1:1 | 512 × 512 | Social media, thumbnails |
| ▬ 16:9 | 1024 × 576 | Presentations, YouTube |

---

## ⚙️ Advanced Settings

| Setting | Default | Range | Description |
|---------|---------|-------|-------------|
| **Inference Steps** | 20 | 5–50 | More steps = higher quality, slower |
| **Guidance Scale** | 3.5 | 1–10 | Higher = audio followed more strictly |
| **Emotion Intensity** | 0.5 | 0–1 | Controls expressiveness of TTS voice |

---

## 🤖 MCP Server

AnimaStudio runs as an **MCP (Model Context Protocol) server**, enabling AI agents to generate talking head videos programmatically.

### Using with Claude Desktop

Add to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "animastudio": {
      "url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse"
    }
  }
}
```

### Tool parameters

- **portrait_image** — portrait photo (file path or base64)
- **text** — text for the avatar to speak (text mode)
- **tts_language** — language for speech synthesis (23 options)
- **voice_ref** — optional voice reference audio for cloning
- **audio_file** — audio file path (audio mode)
- **aspect_ratio** — output format (9:16, 1:1, 16:9)
- **emotion** — emotion intensity 0–1
- **num_steps** — inference steps (default 20)
- **guidance_scale** — guidance scale (default 3.5)

---

## 🎬 Video Dubbing (Phase 2)

Upload any video (up to 60 seconds) and dub it into a different language. The pipeline:

1. **Whisper Turbo** transcribes the original speech (auto-detects language)
2. **NLLB-200** translates the transcript to the target language
3. **Chatterbox TTS** synthesizes the translated speech (with optional voice cloning)
4. **ffmpeg** muxes the new audio track onto the original video

### Dubbing Settings

| Setting | Details |
|---------|---------|
| **Input Video** | Any video with speech, up to 60 seconds |
| **Target Language** | Any of the 23 supported languages |
| **Voice Reference** | Optional audio clip to clone the speaker's voice style |

> Same language as source? The pipeline skips translation and re-synthesizes the audio directly.

---

## 🔧 Technical Details

### Models

| Model | Purpose | License | VRAM |
|-------|---------|---------|------|
| [EchoMimic V3 Flash](https://huggingface.co/BadToBest/EchoMimicV3) | Talking head video generation | Apache 2.0 | ~12 GB |
| [Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox) | 23-language TTS with voice cloning | MIT | ~8 GB |
| [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | Speech transcription (809M params) | MIT | ~2 GB |
| [NLLB-200 Distilled 600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | Text translation (23 languages) | CC-BY-NC-4.0 | API (no local GPU) |

### Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│  Tab 1: Create Video                                            │
│                                                                 │
│  Portrait Photo + Text ──→ Chatterbox TTS ──→ Audio WAV         │
│                                                    │            │
│  Portrait Photo + Audio ───────────────────────────┤            │
│                                                    ▼            │
│                                        EchoMimic V3 Flash       │
│                                        (lip-sync animation)     │
│                                                    │            │
│                                                    ▼            │
│                                        MP4 Video Output         │
├─────────────────────────────────────────────────────────────────┤
│  Tab 2: Dub Video                                               │
│                                                                 │
│  Video ──→ ffmpeg (extract audio)                               │
│                  │                                              │
│                  ▼                                              │
│            Whisper Turbo (transcribe + detect language)          │
│                  │                                              │
│                  ▼                                              │
│            NLLB-200 (translate to target language)               │
│                  │                                              │
│                  ▼                                              │
│            Chatterbox TTS (synthesize translated speech)         │
│                  │                                              │
│                  ▼                                              │
│            ffmpeg (mux new audio onto original video)            │
│                  │                                              │
│                  ▼                                              │
│            Dubbed MP4 Output                                    │
└─────────────────────────────────────────────────────────────────┘
```

### VRAM Management

Models run sequentially on ZeroGPU (A10G, 24 GB):

**Create Video tab:**
1. Chatterbox TTS → generates audio → offloads to CPU
2. EchoMimic V3 → generates video → offloads to CPU
3. `torch.cuda.empty_cache()` between stages

**Dub Video tab:**
1. Whisper Turbo → transcribes audio (~2 GB) → offloads to CPU
2. NLLB-200 → translates via HF Inference API (no local GPU)
3. Chatterbox TTS → synthesizes dubbed speech → offloads to CPU
4. `torch.cuda.empty_cache()` between stages

Peak usage never exceeds ~16 GB.

---

## 💡 Tips for Best Results

### Create Video
1. **Use a clear, front-facing portrait** — well-lit, neutral background, face filling most of the frame
2. **Keep audio under 20 seconds** — shorter = faster generation, tighter lip sync
3. **Add a voice reference** — upload a 5–15 second clip in the target language for natural voice cloning
4. **Match language to text** — select the correct TTS language to avoid accent issues
5. **Emotion 0.4–0.6** — sweet spot for natural-sounding delivery
6. **9:16 for social** — perfect for Reels, TikTok, and Stories
7. **20–30 steps** — good quality/speed trade-off for most use cases

### Dub Video
8. **Keep videos under 60 seconds** — pipeline enforces this limit for VRAM and quality
9. **Clear speech works best** — minimal background music/noise gives cleaner transcriptions
10. **Add a voice reference** — clone the original speaker's voice for a more natural dub
11. **Single-speaker videos** — the pipeline works best with one speaker at a time

---

## 🛠️ Running Locally

```bash
git clone https://huggingface.co/spaces/lulavc/AnimaStudio
cd AnimaStudio
pip install -r requirements.txt
python app.py
```

Requires a CUDA GPU with at least 16 GB VRAM. Set `HF_TOKEN` for private model access.

---

## 📄 License

- **Space code:** Apache 2.0
- **EchoMimic V3:** [Apache 2.0](https://huggingface.co/BadToBest/EchoMimicV3)
- **Chatterbox TTS:** [MIT](https://huggingface.co/ResembleAI/chatterbox)
- **Whisper Turbo:** [MIT](https://huggingface.co/openai/whisper-large-v3-turbo)
- **NLLB-200:** [CC-BY-NC-4.0](https://huggingface.co/facebook/nllb-200-distilled-600M)

---

**Space by [lulavc](https://huggingface.co/lulavc)** · Powered by [EchoMimic V3](https://huggingface.co/BadToBest/EchoMimicV3) + [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) + [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) + [NLLB-200](https://huggingface.co/facebook/nllb-200-distilled-600M) · ZeroGPU · A10G