Zen Translator - AI Knowledge Base
Project: zen-translator Organization: zenlm Repository: https://github.com/zenlm/zen-translator Version: 0.1.0 Last Updated: 2025-11-27
Project Overview
Zen Translator is a real-time multimodal translation pipeline that combines speech translation, voice cloning, and lip synchronization for seamless video dubbing and live translation.
Core Technology Stack
| Component | Model | Parameters | Latency |
|---|---|---|---|
| Translation | Qwen3-Omni-30B-A3B | 30B (3B active MoE) | ~500ms |
| Voice Cloning | CosyVoice 2.0 | 0.5B | ~150ms |
| Lip Sync | Wav2Lip | ~100M | ~200ms |
| Total | - | - | <1 second |
Language Support
Input (18 languages + 6 dialects):
- English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian
- Arabic, Hindi, Thai, Vietnamese, Indonesian, Malay, Turkish, Polish
- Cantonese (yue), Shanghainese (wuu), Xiang (hsn), Min Nan (nan), Hakka (hak), Min Dong (cdo)
Output (10 languages):
- English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Zen Translator Pipeline β
βββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββββββββββββββ€
β Audio/Video β Qwen3-Omni β Translation + Understanding β
β Input β (30B MoE) β ~500ms β
βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββββββββ€
β Translated β CosyVoice 2.0 β Voice Cloning β
β Text β (0.5B) β ~150ms β
βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββββββββ€
β Cloned Audio β Wav2Lip β Lip Synchronization β
β + Video β β ~200ms β
βββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββββββββββββββ€
β Total End-to-End Latency: <1 second β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Project Structure
zen-translator/
βββ src/zen_translator/
β βββ __init__.py # Package exports
β βββ config.py # TranslatorConfig, NewsAnchorConfig
β βββ pipeline.py # Main TranslationPipeline orchestrator
β βββ cli.py # Typer CLI (zen-translate command)
β βββ translation/
β β βββ __init__.py
β β βββ qwen3_omni.py # Qwen3-Omni translation
β βββ voice_clone/
β β βββ __init__.py
β β βββ cosyvoice.py # CosyVoice 2.0 voice cloning
β βββ lip_sync/
β β βββ __init__.py
β β βββ wav2lip.py # Wav2Lip lip synchronization
β β βββ wav2lip_model.py # Wav2Lip neural network architecture
β βββ streaming/
β β βββ __init__.py
β β βββ server.py # FastAPI + WebSocket server
β βββ training/
β βββ __init__.py
β βββ swift_config.py # ms-swift finetuning configs
β βββ news_anchor_dataset.py # News anchor data collection
βββ configs/
β βββ train_identity.yaml # Zen identity finetuning
β βββ train_anchor.yaml # News anchor adaptation
βββ scripts/
β βββ download_models.py # Model download utility
βββ tests/ # Test suite
βββ data/ # Training data directory
β βββ news_anchors/
β βββ voices/
βββ models/ # Downloaded model cache
βββ pyproject.toml # Package configuration (uv/pip)
βββ Makefile # Build automation
βββ README.md # User documentation
βββ LLM.md # AI assistant knowledge base (this file)
Key Components
1. TranslationPipeline (pipeline.py)
Main orchestrator that coordinates all translation stages:
from zen_translator import TranslationPipeline, TranslatorConfig
config = TranslatorConfig(target_language="es")
pipeline = TranslationPipeline(config)
await pipeline.load()
# Audio translation
result = await pipeline.translate_audio(
audio="input.wav",
target_lang="es",
speaker_id="john_doe"
)
# Video translation with lip sync
result = await pipeline.translate_video(
video="news.mp4",
target_lang="zh",
output_path="news_zh.mp4"
)
2. Qwen3OmniTranslator (translation/qwen3_omni.py)
Handles speech understanding and translation using Qwen3-Omni:
- Audio input processing
- Video multimodal analysis (lip reading, visual context)
- Streaming translation support
- Built-in TTS when voice cloning not needed
3. CosyVoiceCloner (voice_clone/cosyvoice.py)
Voice cloning with 3-second reference audio:
- Speaker embedding extraction
- Emotion preservation
- Streaming synthesis (~150ms first packet)
- NewsAnchorVoiceBank for pre-registered voices
4. Wav2LipSync (lip_sync/wav2lip.py)
Lip synchronization for video dubbing:
- Face detection (face_alignment or OpenCV fallback)
- Mel spectrogram audio processing
- Batch processing for efficiency
- Quality presets: fast, balanced, quality
5. TranslationServer (streaming/server.py)
FastAPI server for real-time translation:
| Endpoint | Method | Description |
|---|---|---|
/translate/audio |
POST | Translate audio file |
/translate/video |
POST | Translate video with lip sync |
/speakers/register |
POST | Register voice for cloning |
/speakers |
GET | List registered speakers |
/languages |
GET | Get supported languages |
/ws/translate |
WS | Real-time streaming translation |
Configuration
TranslatorConfig
config = TranslatorConfig(
# Models
qwen3_omni_model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
cosyvoice_model="FunAudioLLM/CosyVoice2-0.5B",
wav2lip_model="numz/wav2lip_studio",
# Translation
target_language="en",
# Voice cloning
voice_reference_seconds=3.0,
preserve_emotion=True,
# Lip sync
enable_lip_sync=True,
lip_sync_quality="balanced",
# Hardware
device="cuda",
dtype="bfloat16",
use_flash_attention=True,
)
Environment Variables
ZEN_TRANSLATOR_TARGET_LANGUAGE=es
ZEN_TRANSLATOR_DEVICE=cuda
ZEN_TRANSLATOR_DTYPE=bfloat16
ZEN_TRANSLATOR_ENABLE_LIP_SYNC=true
Training Infrastructure
Identity Finetuning (ZenIdentityConfig)
Finetunes Qwen3-Omni with Zen Translator identity:
- Professional translation persona
- Consistent behavior and responses
- Uses ms-swift for LoRA training
News Anchor Adaptation (NewsAnchorConfig)
Specialized training for broadcast translation:
- Collects data from YouTube news channels (CNN, BBC, NHK, DW, etc.)
- Segments into training samples
- Creates translation pairs
- Exports in ms-swift format
Training Commands
# Build news anchor dataset
make dataset-build
# Generate training config
make train-anchor
# Run ms-swift training
swift sft --config outputs/anchor/train_config.yaml
Development
Setup
# Create venv and install
make install
# Install with dev dependencies
make dev
# Download models (~62GB full, ~16GB quantized)
make download
make download-quantized
Testing
make test # Run tests
make lint # Run ruff linter
make format # Format code
make typecheck # Run mypy
CLI Commands
# Translate file
zen-translate video.mp4 -o translated.mp4 -t spanish
# Start server
zen-serve --host 0.0.0.0 --port 8000
# Register speaker
zen-translate register-speaker john_doe reference.wav
# Download models
zen-translate download all
# Train
zen-translate train --type anchor --output ./outputs
Model Requirements
| Model | Parameters | VRAM | Disk |
|---|---|---|---|
| Qwen3-Omni | 30B (3B active) | 16GB | 60GB |
| CosyVoice 2.0 | 0.5B | 2GB | 1GB |
| Wav2Lip | ~100M | 2GB | 500MB |
| Total | - | ~20GB | ~62GB |
For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk).
Dependencies
Core
- torch>=2.1.0
- transformers>=4.45.0
- accelerate>=0.25.0
Audio
- librosa>=0.10.0
- soundfile>=0.12.0
- webrtcvad>=2.0.10
Video
- opencv-python>=4.8.0
- ffmpeg-python>=0.2.0
- av>=11.0.0
Streaming
- fastapi>=0.109.0
- uvicorn>=0.27.0
- websockets>=12.0
Training
- ms-swift>=2.4.0
- peft>=0.7.0
- deepspeed>=0.13.0
Key Files
src/zen_translator/pipeline.py- Main orchestration (line 23: TranslationPipeline)src/zen_translator/translation/qwen3_omni.py- Qwen3-Omni (line 25: Qwen3OmniTranslator)src/zen_translator/voice_clone/cosyvoice.py- CosyVoice (line 23: CosyVoiceCloner)src/zen_translator/lip_sync/wav2lip.py- Wav2Lip (line 21: Wav2LipSync)src/zen_translator/streaming/server.py- FastAPI server (line 92: create_app)
Notes for AI Assistants
- ALWAYS update this file with significant discoveries or changes
- NEVER commit model files or weights (they're in .gitignore)
- All Zen models are based on Qwen3 (not Qwen2!)
- Use
uvfor Python environment management - Use
makecommands for standard operations - The Wav2Lip model requires
wav2lip_model.pyfor architecture definition - CosyVoice has fallback mode when not installed
- Flash Attention 2 is recommended for performance
Related Projects
- zen - Zen AI model family
- Qwen3-Omni - Base translation model
- CosyVoice - Voice cloning
- Wav2Lip - Lip synchronization
- ms-swift - Training framework
Zen Translator: Real-time translation with voice cloning and lip sync.