AnimaStudio / README.md
lulavc
fix: move theme/css from gr.Blocks() to demo.launch() for Gradio 6.x compatibility
43f8b96

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: AnimaStudio ๐ŸŽฌ
emoji: ๐ŸŽฌ
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.0.2
app_file: app.py
pinned: true
license: apache-2.0
short_description: AI talking head & video dubbing โ€” free, 23 languages
tags:
  - video-generation
  - talking-head
  - lip-sync
  - avatar
  - tts
  - voice-cloning
  - multilingual
  - mcp-server
  - echomimic
  - chatterbox
  - dubbing
  - whisper
  - nllb

๐ŸŽฌ AnimaStudio

Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages โ€” free, no sign-up required.


โœจ Features

Feature Details
๐ŸŽญ Realistic Lip Sync EchoMimic V3 Flash (AAAI 2026) โ€” state-of-the-art talking head animation
๐Ÿ—ฃ๏ธ 23-Language TTS Type text in any of 23 languages and generate natural speech
๐ŸŽ™๏ธ Voice Cloning Upload a voice reference clip to clone the speaking style
๐Ÿ“ค Audio Upload Upload your own WAV / MP3 / FLAC instead of using TTS
๐ŸŽฌ Video Dubbing Upload a video (up to 60 s) and dub it into any of the 23 supported languages
๐Ÿ“ 3 Aspect Ratios 9:16 mobile, 1:1 square, 16:9 landscape
๐ŸŒ 4 UI Languages Full interface in English, Portuguรชs (BR), Espaรฑol, and ุนุฑุจูŠ
๐Ÿ“ฅ Download One-click download of the generated MP4
๐Ÿค– MCP Server Use as a tool in Claude, Cursor, and any MCP-compatible agent

๐Ÿ—ฃ๏ธ Supported TTS Languages

Arabic ยท Danish ยท German ยท Greek ยท English ยท Spanish ยท Finnish ยท French ยท Hebrew ยท Hindi ยท Italian ยท Japanese ยท Korean ยท Malay ยท Dutch ยท Norwegian ยท Polish ยท Portuguese ยท Russian ยท Swedish ยท Swahili ยท Turkish ยท Chinese


๐Ÿ“ Output Formats

Preset Dimensions Best for
โ–ฎ 9:16 576 ร— 1024 Mobile, Reels, TikTok
โ—ป 1:1 512 ร— 512 Social media, thumbnails
โ–ฌ 16:9 1024 ร— 576 Presentations, YouTube

โš™๏ธ Advanced Settings

Setting Default Range Description
Inference Steps 20 5โ€“50 More steps = higher quality, slower
Guidance Scale 3.5 1โ€“10 Higher = audio followed more strictly
Emotion Intensity 0.5 0โ€“1 Controls expressiveness of TTS voice

๐Ÿค– MCP Server

AnimaStudio runs as an MCP (Model Context Protocol) server, enabling AI agents to generate talking head videos programmatically.

Using with Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "animastudio": {
      "url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse"
    }
  }
}

Tool parameters

  • portrait_image โ€” portrait photo (file path or base64)
  • text โ€” text for the avatar to speak (text mode)
  • tts_language โ€” language for speech synthesis (23 options)
  • voice_ref โ€” optional voice reference audio for cloning
  • audio_file โ€” audio file path (audio mode)
  • aspect_ratio โ€” output format (9:16, 1:1, 16:9)
  • emotion โ€” emotion intensity 0โ€“1
  • num_steps โ€” inference steps (default 20)
  • guidance_scale โ€” guidance scale (default 3.5)

๐ŸŽฌ Video Dubbing (Phase 2)

Upload any video (up to 60 seconds) and dub it into a different language. The pipeline:

  1. Whisper Turbo transcribes the original speech (auto-detects language)
  2. NLLB-200 translates the transcript to the target language
  3. Chatterbox TTS synthesizes the translated speech (with optional voice cloning)
  4. ffmpeg muxes the new audio track onto the original video

Dubbing Settings

Setting Details
Input Video Any video with speech, up to 60 seconds
Target Language Any of the 23 supported languages
Voice Reference Optional audio clip to clone the speaker's voice style

Same language as source? The pipeline skips translation and re-synthesizes the audio directly.


๐Ÿ”ง Technical Details

Models

Model Purpose License VRAM
EchoMimic V3 Flash Talking head video generation Apache 2.0 ~12 GB
Chatterbox Multilingual 23-language TTS with voice cloning MIT ~8 GB
Whisper Turbo Speech transcription (809M params) MIT ~2 GB
NLLB-200 Distilled 600M Text translation (23 languages) CC-BY-NC-4.0 API (no local GPU)

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Tab 1: Create Video                                            โ”‚
โ”‚                                                                 โ”‚
โ”‚  Portrait Photo + Text โ”€โ”€โ†’ Chatterbox TTS โ”€โ”€โ†’ Audio WAV         โ”‚
โ”‚                                                    โ”‚            โ”‚
โ”‚  Portrait Photo + Audio โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค            โ”‚
โ”‚                                                    โ–ผ            โ”‚
โ”‚                                        EchoMimic V3 Flash       โ”‚
โ”‚                                        (lip-sync animation)     โ”‚
โ”‚                                                    โ”‚            โ”‚
โ”‚                                                    โ–ผ            โ”‚
โ”‚                                        MP4 Video Output         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Tab 2: Dub Video                                               โ”‚
โ”‚                                                                 โ”‚
โ”‚  Video โ”€โ”€โ†’ ffmpeg (extract audio)                               โ”‚
โ”‚                  โ”‚                                              โ”‚
โ”‚                  โ–ผ                                              โ”‚
โ”‚            Whisper Turbo (transcribe + detect language)          โ”‚
โ”‚                  โ”‚                                              โ”‚
โ”‚                  โ–ผ                                              โ”‚
โ”‚            NLLB-200 (translate to target language)               โ”‚
โ”‚                  โ”‚                                              โ”‚
โ”‚                  โ–ผ                                              โ”‚
โ”‚            Chatterbox TTS (synthesize translated speech)         โ”‚
โ”‚                  โ”‚                                              โ”‚
โ”‚                  โ–ผ                                              โ”‚
โ”‚            ffmpeg (mux new audio onto original video)            โ”‚
โ”‚                  โ”‚                                              โ”‚
โ”‚                  โ–ผ                                              โ”‚
โ”‚            Dubbed MP4 Output                                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

VRAM Management

Models run sequentially on ZeroGPU (A10G, 24 GB):

Create Video tab:

  1. Chatterbox TTS โ†’ generates audio โ†’ offloads to CPU
  2. EchoMimic V3 โ†’ generates video โ†’ offloads to CPU
  3. torch.cuda.empty_cache() between stages

Dub Video tab:

  1. Whisper Turbo โ†’ transcribes audio (~2 GB) โ†’ offloads to CPU
  2. NLLB-200 โ†’ translates via HF Inference API (no local GPU)
  3. Chatterbox TTS โ†’ synthesizes dubbed speech โ†’ offloads to CPU
  4. torch.cuda.empty_cache() between stages

Peak usage never exceeds ~16 GB.


๐Ÿ’ก Tips for Best Results

Create Video

  1. Use a clear, front-facing portrait โ€” well-lit, neutral background, face filling most of the frame
  2. Keep audio under 20 seconds โ€” shorter = faster generation, tighter lip sync
  3. Add a voice reference โ€” upload a 5โ€“15 second clip in the target language for natural voice cloning
  4. Match language to text โ€” select the correct TTS language to avoid accent issues
  5. Emotion 0.4โ€“0.6 โ€” sweet spot for natural-sounding delivery
  6. 9:16 for social โ€” perfect for Reels, TikTok, and Stories
  7. 20โ€“30 steps โ€” good quality/speed trade-off for most use cases

Dub Video

  1. Keep videos under 60 seconds โ€” pipeline enforces this limit for VRAM and quality
  2. Clear speech works best โ€” minimal background music/noise gives cleaner transcriptions
  3. Add a voice reference โ€” clone the original speaker's voice for a more natural dub
  4. Single-speaker videos โ€” the pipeline works best with one speaker at a time

๐Ÿ› ๏ธ Running Locally

git clone https://huggingface.co/spaces/lulavc/AnimaStudio
cd AnimaStudio
pip install -r requirements.txt
python app.py

Requires a CUDA GPU with at least 16 GB VRAM. Set HF_TOKEN for private model access.


๐Ÿ“„ License


Space by lulavc ยท Powered by EchoMimic V3 + Chatterbox + Whisper Turbo + NLLB-200 ยท ZeroGPU ยท A10G