# Zen Translator - AI Knowledge Base **Project**: zen-translator **Organization**: zenlm **Repository**: https://github.com/zenlm/zen-translator **Version**: 0.1.0 **Last Updated**: 2025-11-27 ## Project Overview Zen Translator is a real-time multimodal translation pipeline that combines speech translation, voice cloning, and lip synchronization for seamless video dubbing and live translation. ### Core Technology Stack | Component | Model | Parameters | Latency | |-----------|-------|------------|---------| | Translation | Qwen3-Omni-30B-A3B | 30B (3B active MoE) | ~500ms | | Voice Cloning | CosyVoice 2.0 | 0.5B | ~150ms | | Lip Sync | Wav2Lip | ~100M | ~200ms | | **Total** | - | - | **<1 second** | ### Language Support **Input (18 languages + 6 dialects)**: - English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian - Arabic, Hindi, Thai, Vietnamese, Indonesian, Malay, Turkish, Polish - Cantonese (yue), Shanghainese (wuu), Xiang (hsn), Min Nan (nan), Hakka (hak), Min Dong (cdo) **Output (10 languages)**: - English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Zen Translator Pipeline │ ├─────────────────┬─────────────────┬─────────────────────────────┤ │ Audio/Video │ Qwen3-Omni │ Translation + Understanding │ │ Input │ (30B MoE) │ ~500ms │ ├─────────────────┼─────────────────┼─────────────────────────────┤ │ Translated │ CosyVoice 2.0 │ Voice Cloning │ │ Text │ (0.5B) │ ~150ms │ ├─────────────────┼─────────────────┼─────────────────────────────┤ │ Cloned Audio │ Wav2Lip │ Lip Synchronization │ │ + Video │ │ ~200ms │ ├─────────────────┴─────────────────┴─────────────────────────────┤ │ Total End-to-End Latency: <1 second │ └─────────────────────────────────────────────────────────────────┘ ``` ## Project Structure ``` zen-translator/ ├── src/zen_translator/ │ ├── __init__.py # Package exports │ ├── config.py # TranslatorConfig, NewsAnchorConfig │ ├── pipeline.py # Main TranslationPipeline orchestrator │ ├── cli.py # Typer CLI (zen-translate command) │ ├── translation/ │ │ ├── __init__.py │ │ └── qwen3_omni.py # Qwen3-Omni translation │ ├── voice_clone/ │ │ ├── __init__.py │ │ └── cosyvoice.py # CosyVoice 2.0 voice cloning │ ├── lip_sync/ │ │ ├── __init__.py │ │ ├── wav2lip.py # Wav2Lip lip synchronization │ │ └── wav2lip_model.py # Wav2Lip neural network architecture │ ├── streaming/ │ │ ├── __init__.py │ │ └── server.py # FastAPI + WebSocket server │ └── training/ │ ├── __init__.py │ ├── swift_config.py # ms-swift finetuning configs │ └── news_anchor_dataset.py # News anchor data collection ├── configs/ │ ├── train_identity.yaml # Zen identity finetuning │ └── train_anchor.yaml # News anchor adaptation ├── scripts/ │ └── download_models.py # Model download utility ├── tests/ # Test suite ├── data/ # Training data directory │ ├── news_anchors/ │ └── voices/ ├── models/ # Downloaded model cache ├── pyproject.toml # Package configuration (uv/pip) ├── Makefile # Build automation ├── README.md # User documentation └── LLM.md # AI assistant knowledge base (this file) ``` ## Key Components ### 1. TranslationPipeline (pipeline.py) Main orchestrator that coordinates all translation stages: ```python from zen_translator import TranslationPipeline, TranslatorConfig config = TranslatorConfig(target_language="es") pipeline = TranslationPipeline(config) await pipeline.load() # Audio translation result = await pipeline.translate_audio( audio="input.wav", target_lang="es", speaker_id="john_doe" ) # Video translation with lip sync result = await pipeline.translate_video( video="news.mp4", target_lang="zh", output_path="news_zh.mp4" ) ``` ### 2. Qwen3OmniTranslator (translation/qwen3_omni.py) Handles speech understanding and translation using Qwen3-Omni: - Audio input processing - Video multimodal analysis (lip reading, visual context) - Streaming translation support - Built-in TTS when voice cloning not needed ### 3. CosyVoiceCloner (voice_clone/cosyvoice.py) Voice cloning with 3-second reference audio: - Speaker embedding extraction - Emotion preservation - Streaming synthesis (~150ms first packet) - NewsAnchorVoiceBank for pre-registered voices ### 4. Wav2LipSync (lip_sync/wav2lip.py) Lip synchronization for video dubbing: - Face detection (face_alignment or OpenCV fallback) - Mel spectrogram audio processing - Batch processing for efficiency - Quality presets: fast, balanced, quality ### 5. TranslationServer (streaming/server.py) FastAPI server for real-time translation: | Endpoint | Method | Description | |----------|--------|-------------| | `/translate/audio` | POST | Translate audio file | | `/translate/video` | POST | Translate video with lip sync | | `/speakers/register` | POST | Register voice for cloning | | `/speakers` | GET | List registered speakers | | `/languages` | GET | Get supported languages | | `/ws/translate` | WS | Real-time streaming translation | ## Configuration ### TranslatorConfig ```python config = TranslatorConfig( # Models qwen3_omni_model="Qwen/Qwen3-Omni-30B-A3B-Instruct", cosyvoice_model="FunAudioLLM/CosyVoice2-0.5B", wav2lip_model="numz/wav2lip_studio", # Translation target_language="en", # Voice cloning voice_reference_seconds=3.0, preserve_emotion=True, # Lip sync enable_lip_sync=True, lip_sync_quality="balanced", # Hardware device="cuda", dtype="bfloat16", use_flash_attention=True, ) ``` ### Environment Variables ```bash ZEN_TRANSLATOR_TARGET_LANGUAGE=es ZEN_TRANSLATOR_DEVICE=cuda ZEN_TRANSLATOR_DTYPE=bfloat16 ZEN_TRANSLATOR_ENABLE_LIP_SYNC=true ``` ## Training Infrastructure ### Identity Finetuning (ZenIdentityConfig) Finetunes Qwen3-Omni with Zen Translator identity: - Professional translation persona - Consistent behavior and responses - Uses ms-swift for LoRA training ### News Anchor Adaptation (NewsAnchorConfig) Specialized training for broadcast translation: - Collects data from YouTube news channels (CNN, BBC, NHK, DW, etc.) - Segments into training samples - Creates translation pairs - Exports in ms-swift format ### Training Commands ```bash # Build news anchor dataset make dataset-build # Generate training config make train-anchor # Run ms-swift training swift sft --config outputs/anchor/train_config.yaml ``` ## Development ### Setup ```bash # Create venv and install make install # Install with dev dependencies make dev # Download models (~62GB full, ~16GB quantized) make download make download-quantized ``` ### Testing ```bash make test # Run tests make lint # Run ruff linter make format # Format code make typecheck # Run mypy ``` ### CLI Commands ```bash # Translate file zen-translate video.mp4 -o translated.mp4 -t spanish # Start server zen-serve --host 0.0.0.0 --port 8000 # Register speaker zen-translate register-speaker john_doe reference.wav # Download models zen-translate download all # Train zen-translate train --type anchor --output ./outputs ``` ## Model Requirements | Model | Parameters | VRAM | Disk | |-------|------------|------|------| | Qwen3-Omni | 30B (3B active) | 16GB | 60GB | | CosyVoice 2.0 | 0.5B | 2GB | 1GB | | Wav2Lip | ~100M | 2GB | 500MB | | **Total** | - | **~20GB** | **~62GB** | For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk). ## Dependencies ### Core - torch>=2.1.0 - transformers>=4.45.0 - accelerate>=0.25.0 ### Audio - librosa>=0.10.0 - soundfile>=0.12.0 - webrtcvad>=2.0.10 ### Video - opencv-python>=4.8.0 - ffmpeg-python>=0.2.0 - av>=11.0.0 ### Streaming - fastapi>=0.109.0 - uvicorn>=0.27.0 - websockets>=12.0 ### Training - ms-swift>=2.4.0 - peft>=0.7.0 - deepspeed>=0.13.0 ## Key Files - `src/zen_translator/pipeline.py` - Main orchestration (line 23: TranslationPipeline) - `src/zen_translator/translation/qwen3_omni.py` - Qwen3-Omni (line 25: Qwen3OmniTranslator) - `src/zen_translator/voice_clone/cosyvoice.py` - CosyVoice (line 23: CosyVoiceCloner) - `src/zen_translator/lip_sync/wav2lip.py` - Wav2Lip (line 21: Wav2LipSync) - `src/zen_translator/streaming/server.py` - FastAPI server (line 92: create_app) ## Notes for AI Assistants 1. **ALWAYS** update this file with significant discoveries or changes 2. **NEVER** commit model files or weights (they're in .gitignore) 3. All Zen models are based on **Qwen3** (not Qwen2!) 4. Use `uv` for Python environment management 5. Use `make` commands for standard operations 6. The Wav2Lip model requires `wav2lip_model.py` for architecture definition 7. CosyVoice has fallback mode when not installed 8. Flash Attention 2 is recommended for performance ## Related Projects - [zen](https://github.com/zenlm/zen) - Zen AI model family - [Qwen3-Omni](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct) - Base translation model - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - Voice cloning - [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) - Lip synchronization - [ms-swift](https://github.com/modelscope/ms-swift) - Training framework --- **Zen Translator**: Real-time translation with voice cloning and lip sync.