Spaces:

lulavc
/

AnimaStudio

Running on Zero

App Files Files Community

AnimaStudio / README.md

lulavc

fix: move theme/css from gr.Blocks() to demo.launch() for Gradio 6.x compatibility

43f8b96 about 1 month ago

preview code

raw

history blame contribute delete

10.1 kB

	---
	title: AnimaStudio 🎬
	emoji: 🎬
	colorFrom: purple
	colorTo: pink
	sdk: gradio
	sdk_version: 6.0.2
	app_file: app.py
	pinned: true
	license: apache-2.0
	short_description: AI talking head & video dubbing — free, 23 languages
	tags:
	- video-generation
	- talking-head
	- lip-sync
	- avatar
	- tts
	- voice-cloning
	- multilingual
	- mcp-server
	- echomimic
	- chatterbox
	- dubbing
	- whisper
	- nllb
	---

	# 🎬 AnimaStudio

	Turn any portrait photo into a talking head video using your voice or typed text, or dub any video into 23 languages — free, no sign-up required.

	---

	## ✨ Features

	\| Feature \| Details \|
	\|---------\|---------\|
	\| 🎭 Realistic Lip Sync \| EchoMimic V3 Flash (AAAI 2026) — state-of-the-art talking head animation \|
	\| 🗣️ 23-Language TTS \| Type text in any of 23 languages and generate natural speech \|
	\| 🎙️ Voice Cloning \| Upload a voice reference clip to clone the speaking style \|
	\| 📤 Audio Upload \| Upload your own WAV / MP3 / FLAC instead of using TTS \|
	\| 🎬 Video Dubbing \| Upload a video (up to 60 s) and dub it into any of the 23 supported languages \|
	\| 📐 3 Aspect Ratios \| 9:16 mobile, 1:1 square, 16:9 landscape \|
	\| 🌐 4 UI Languages \| Full interface in English, Português (BR), Español, and عربي \|
	\| 📥 Download \| One-click download of the generated MP4 \|
	\| 🤖 MCP Server \| Use as a tool in Claude, Cursor, and any MCP-compatible agent \|

	---

	## 🗣️ Supported TTS Languages

	Arabic · Danish · German · Greek · English · Spanish · Finnish · French · Hebrew · Hindi · Italian · Japanese · Korean · Malay · Dutch · Norwegian · Polish · Portuguese · Russian · Swedish · Swahili · Turkish · Chinese

	---

	## 📐 Output Formats

	\| Preset \| Dimensions \| Best for \|
	\|--------\|-----------\|----------\|
	\| ▮ 9:16 \| 576 × 1024 \| Mobile, Reels, TikTok \|
	\| ◻ 1:1 \| 512 × 512 \| Social media, thumbnails \|
	\| ▬ 16:9 \| 1024 × 576 \| Presentations, YouTube \|

	---

	## ⚙️ Advanced Settings

	\| Setting \| Default \| Range \| Description \|
	\|---------\|---------\|-------\|-------------\|
	\| Inference Steps \| 20 \| 5–50 \| More steps = higher quality, slower \|
	\| Guidance Scale \| 3.5 \| 1–10 \| Higher = audio followed more strictly \|
	\| Emotion Intensity \| 0.5 \| 0–1 \| Controls expressiveness of TTS voice \|

	---

	## 🤖 MCP Server

	AnimaStudio runs as an MCP (Model Context Protocol) server, enabling AI agents to generate talking head videos programmatically.

	### Using with Claude Desktop

	Add to your `claude_desktop_config.json`:

	```json
	{
	"mcpServers": {
	"animastudio": {
	"url": "https://lulavc-animastudio.hf.space/gradio_api/mcp/sse"
	}
	}
	}
	```

	### Tool parameters

	- portrait_image — portrait photo (file path or base64)
	- text — text for the avatar to speak (text mode)
	- tts_language — language for speech synthesis (23 options)
	- voice_ref — optional voice reference audio for cloning
	- audio_file — audio file path (audio mode)
	- aspect_ratio — output format (9:16, 1:1, 16:9)
	- emotion — emotion intensity 0–1
	- num_steps — inference steps (default 20)
	- guidance_scale — guidance scale (default 3.5)

	---

	## 🎬 Video Dubbing (Phase 2)

	Upload any video (up to 60 seconds) and dub it into a different language. The pipeline:

	1. Whisper Turbo transcribes the original speech (auto-detects language)
	2. NLLB-200 translates the transcript to the target language
	3. Chatterbox TTS synthesizes the translated speech (with optional voice cloning)
	4. ffmpeg muxes the new audio track onto the original video

	### Dubbing Settings

	\| Setting \| Details \|
	\|---------\|---------\|
	\| Input Video \| Any video with speech, up to 60 seconds \|
	\| Target Language \| Any of the 23 supported languages \|
	\| Voice Reference \| Optional audio clip to clone the speaker's voice style \|

	> Same language as source? The pipeline skips translation and re-synthesizes the audio directly.

	---

	## 🔧 Technical Details

	### Models

	\| Model \| Purpose \| License \| VRAM \|
	\|-------\|---------\|---------\|------\|
	\| [EchoMimic V3 Flash](https://huggingface.co/BadToBest/EchoMimicV3) \| Talking head video generation \| Apache 2.0 \| ~12 GB \|
	\| [Chatterbox Multilingual](https://huggingface.co/ResembleAI/chatterbox) \| 23-language TTS with voice cloning \| MIT \| ~8 GB \|
	\| [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) \| Speech transcription (809M params) \| MIT \| ~2 GB \|
	\| [NLLB-200 Distilled 600M](https://huggingface.co/facebook/nllb-200-distilled-600M) \| Text translation (23 languages) \| CC-BY-NC-4.0 \| API (no local GPU) \|

	### Architecture

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ Tab 1: Create Video │
	│ │
	│ Portrait Photo + Text ──→ Chatterbox TTS ──→ Audio WAV │
	│ │ │
	│ Portrait Photo + Audio ───────────────────────────┤ │
	│ ▼ │
	│ EchoMimic V3 Flash │
	│ (lip-sync animation) │
	│ │ │
	│ ▼ │
	│ MP4 Video Output │
	├─────────────────────────────────────────────────────────────────┤
	│ Tab 2: Dub Video │
	│ │
	│ Video ──→ ffmpeg (extract audio) │
	│ │ │
	│ ▼ │
	│ Whisper Turbo (transcribe + detect language) │
	│ │ │
	│ ▼ │
	│ NLLB-200 (translate to target language) │
	│ │ │
	│ ▼ │
	│ Chatterbox TTS (synthesize translated speech) │
	│ │ │
	│ ▼ │
	│ ffmpeg (mux new audio onto original video) │
	│ │ │
	│ ▼ │
	│ Dubbed MP4 Output │
	└─────────────────────────────────────────────────────────────────┘
	```

	### VRAM Management

	Models run sequentially on ZeroGPU (A10G, 24 GB):

	Create Video tab:
	1. Chatterbox TTS → generates audio → offloads to CPU
	2. EchoMimic V3 → generates video → offloads to CPU
	3. `torch.cuda.empty_cache()` between stages

	Dub Video tab:
	1. Whisper Turbo → transcribes audio (~2 GB) → offloads to CPU
	2. NLLB-200 → translates via HF Inference API (no local GPU)
	3. Chatterbox TTS → synthesizes dubbed speech → offloads to CPU
	4. `torch.cuda.empty_cache()` between stages

	Peak usage never exceeds ~16 GB.

	---

	## 💡 Tips for Best Results

	### Create Video
	1. Use a clear, front-facing portrait — well-lit, neutral background, face filling most of the frame
	2. Keep audio under 20 seconds — shorter = faster generation, tighter lip sync
	3. Add a voice reference — upload a 5–15 second clip in the target language for natural voice cloning
	4. Match language to text — select the correct TTS language to avoid accent issues
	5. Emotion 0.4–0.6 — sweet spot for natural-sounding delivery
	6. 9:16 for social — perfect for Reels, TikTok, and Stories
	7. 20–30 steps — good quality/speed trade-off for most use cases

	### Dub Video
	8. Keep videos under 60 seconds — pipeline enforces this limit for VRAM and quality
	9. Clear speech works best — minimal background music/noise gives cleaner transcriptions
	10. Add a voice reference — clone the original speaker's voice for a more natural dub
	11. Single-speaker videos — the pipeline works best with one speaker at a time

	---

	## 🛠️ Running Locally

	```bash
	git clone https://huggingface.co/spaces/lulavc/AnimaStudio
	cd AnimaStudio
	pip install -r requirements.txt
	python app.py
	```

	Requires a CUDA GPU with at least 16 GB VRAM. Set `HF_TOKEN` for private model access.

	---

	## 📄 License

	- Space code: Apache 2.0
	- EchoMimic V3: [Apache 2.0](https://huggingface.co/BadToBest/EchoMimicV3)
	- Chatterbox TTS: [MIT](https://huggingface.co/ResembleAI/chatterbox)
	- Whisper Turbo: [MIT](https://huggingface.co/openai/whisper-large-v3-turbo)
	- NLLB-200: [CC-BY-NC-4.0](https://huggingface.co/facebook/nllb-200-distilled-600M)

	---

	Space by [lulavc](https://huggingface.co/lulavc) · Powered by [EchoMimic V3](https://huggingface.co/BadToBest/EchoMimicV3) + [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) + [Whisper Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) + [NLLB-200](https://huggingface.co/facebook/nllb-200-distilled-600M) · ZeroGPU · A10G