Spaces:

lulavc
/

AnimaStudio

Running on Zero

App Files Files Community

AnimaStudio / README.md

lulavc

fix: move theme/css from gr.Blocks() to demo.launch() for Gradio 6.x compatibility

43f8b96 16 days ago

preview code

raw

history blame contribute delete

10.1 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: AnimaStudio 🎬
emoji: 🎬
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: 6.0.2
app_file: app.py
pinned: true
license: apache-2.0
short_description: AI talking head & video dubbing — free, 23 languages
tags:
  - video-generation
  - talking-head
  - lip-sync
  - avatar
  - tts
  - voice-cloning
  - multilingual
  - mcp-server
  - echomimic
  - chatterbox
  - dubbing
  - whisper
  - nllb

🎬 AnimaStudio

Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages — free, no sign-up required.

✨ Features

Feature	Details
🎭 Realistic Lip Sync	EchoMimic V3 Flash (AAAI 2026) — state-of-the-art talking head animation
🗣️ 23-Language TTS	Type text in any of 23 languages and generate natural speech
🎙️ Voice Cloning	Upload a voice reference clip to clone the speaking style
📤 Audio Upload	Upload your own WAV / MP3 / FLAC instead of using TTS
🎬 Video Dubbing	Upload a video (up to 60 s) and dub it into any of the 23 supported languages
📐 3 Aspect Ratios	9:16 mobile, 1:1 square, 16:9 landscape
🌐 4 UI Languages	Full interface in English, Português (BR), Español, and عربي
📥 Download	One-click download of the generated MP4
🤖 MCP Server	Use as a tool in Claude, Cursor, and any MCP-compatible agent

🗣️ Supported TTS Languages

Arabic · Danish · German · Greek · English · Spanish · Finnish · French · Hebrew · Hindi · Italian · Japanese · Korean · Malay · Dutch · Norwegian · Polish · Portuguese · Russian · Swedish · Swahili · Turkish · Chinese

📐 Output Formats

Preset	Dimensions	Best for
▮ 9:16	576 × 1024	Mobile, Reels, TikTok
◻ 1:1	512 × 512	Social media, thumbnails
▬ 16:9	1024 × 576	Presentations, YouTube

⚙️ Advanced Settings

Setting	Default	Range	Description
Inference Steps	20	5–50	More steps = higher quality, slower
Guidance Scale	3.5	1–10	Higher = audio followed more strictly
Emotion Intensity	0.5	0–1	Controls expressiveness of TTS voice

🤖 MCP Server

AnimaStudio runs as an MCP (Model Context Protocol) server, enabling AI agents to generate talking head videos programmatically.

Using with Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "animastudio": {
      "url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse"
    }
  }
}

Tool parameters

portrait_image — portrait photo (file path or base64)
text — text for the avatar to speak (text mode)
tts_language — language for speech synthesis (23 options)
voice_ref — optional voice reference audio for cloning
audio_file — audio file path (audio mode)
aspect_ratio — output format (9:16, 1:1, 16:9)
emotion — emotion intensity 0–1
num_steps — inference steps (default 20)
guidance_scale — guidance scale (default 3.5)

🎬 Video Dubbing (Phase 2)

Upload any video (up to 60 seconds) and dub it into a different language. The pipeline:

Whisper Turbo transcribes the original speech (auto-detects language)
NLLB-200 translates the transcript to the target language
Chatterbox TTS synthesizes the translated speech (with optional voice cloning)
ffmpeg muxes the new audio track onto the original video

Dubbing Settings

Setting	Details
Input Video	Any video with speech, up to 60 seconds
Target Language	Any of the 23 supported languages
Voice Reference	Optional audio clip to clone the speaker's voice style

Same language as source? The pipeline skips translation and re-synthesizes the audio directly.

🔧 Technical Details

Models

Model	Purpose	License	VRAM
EchoMimic V3 Flash	Talking head video generation	Apache 2.0	~12 GB
Chatterbox Multilingual	23-language TTS with voice cloning	MIT	~8 GB
Whisper Turbo	Speech transcription (809M params)	MIT	~2 GB
NLLB-200 Distilled 600M	Text translation (23 languages)	CC-BY-NC-4.0	API (no local GPU)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Tab 1: Create Video                                            │
│                                                                 │
│  Portrait Photo + Text ──→ Chatterbox TTS ──→ Audio WAV         │
│                                                    │            │
│  Portrait Photo + Audio ───────────────────────────┤            │
│                                                    ▼            │
│                                        EchoMimic V3 Flash       │
│                                        (lip-sync animation)     │
│                                                    │            │
│                                                    ▼            │
│                                        MP4 Video Output         │
├─────────────────────────────────────────────────────────────────┤
│  Tab 2: Dub Video                                               │
│                                                                 │
│  Video ──→ ffmpeg (extract audio)                               │
│                  │                                              │
│                  ▼                                              │
│            Whisper Turbo (transcribe + detect language)          │
│                  │                                              │
│                  ▼                                              │
│            NLLB-200 (translate to target language)               │
│                  │                                              │
│                  ▼                                              │
│            Chatterbox TTS (synthesize translated speech)         │
│                  │                                              │
│                  ▼                                              │
│            ffmpeg (mux new audio onto original video)            │
│                  │                                              │
│                  ▼                                              │
│            Dubbed MP4 Output                                    │
└─────────────────────────────────────────────────────────────────┘

VRAM Management

Models run sequentially on ZeroGPU (A10G, 24 GB):

Create Video tab:

Chatterbox TTS → generates audio → offloads to CPU
EchoMimic V3 → generates video → offloads to CPU
torch.cuda.empty_cache() between stages

Dub Video tab:

Whisper Turbo → transcribes audio (~2 GB) → offloads to CPU
NLLB-200 → translates via HF Inference API (no local GPU)
Chatterbox TTS → synthesizes dubbed speech → offloads to CPU
torch.cuda.empty_cache() between stages

Peak usage never exceeds ~16 GB.

💡 Tips for Best Results

Create Video

Use a clear, front-facing portrait — well-lit, neutral background, face filling most of the frame
Keep audio under 20 seconds — shorter = faster generation, tighter lip sync
Add a voice reference — upload a 5–15 second clip in the target language for natural voice cloning
Match language to text — select the correct TTS language to avoid accent issues
Emotion 0.4–0.6 — sweet spot for natural-sounding delivery
9:16 for social — perfect for Reels, TikTok, and Stories
20–30 steps — good quality/speed trade-off for most use cases

Dub Video

Keep videos under 60 seconds — pipeline enforces this limit for VRAM and quality
Clear speech works best — minimal background music/noise gives cleaner transcriptions
Add a voice reference — clone the original speaker's voice for a more natural dub
Single-speaker videos — the pipeline works best with one speaker at a time

🛠️ Running Locally

git clone https://huggingface.co/spaces/lulavc/AnimaStudio
cd AnimaStudio
pip install -r requirements.txt
python app.py

Requires a CUDA GPU with at least 16 GB VRAM. Set HF_TOKEN for private model access.

📄 License

Space code: Apache 2.0
EchoMimic V3: Apache 2.0
Chatterbox TTS: MIT
Whisper Turbo: MIT
NLLB-200: CC-BY-NC-4.0

Space by lulavc · Powered by EchoMimic V3 + Chatterbox + Whisper Turbo + NLLB-200 · ZeroGPU · A10G