| # Zen Translator - AI Knowledge Base | |
| **Project**: zen-translator | |
| **Organization**: zenlm | |
| **Repository**: https://github.com/zenlm/zen-translator | |
| **Version**: 0.1.0 | |
| **Last Updated**: 2025-11-27 | |
| ## Project Overview | |
| Zen Translator is a real-time multimodal translation pipeline that combines speech translation, voice cloning, and lip synchronization for seamless video dubbing and live translation. | |
| ### Core Technology Stack | |
| | Component | Model | Parameters | Latency | | |
| |-----------|-------|------------|---------| | |
| | Translation | Qwen3-Omni-30B-A3B | 30B (3B active MoE) | ~500ms | | |
| | Voice Cloning | CosyVoice 2.0 | 0.5B | ~150ms | | |
| | Lip Sync | Wav2Lip | ~100M | ~200ms | | |
| | **Total** | - | - | **<1 second** | | |
| ### Language Support | |
| **Input (18 languages + 6 dialects)**: | |
| - English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian | |
| - Arabic, Hindi, Thai, Vietnamese, Indonesian, Malay, Turkish, Polish | |
| - Cantonese (yue), Shanghainese (wuu), Xiang (hsn), Min Nan (nan), Hakka (hak), Min Dong (cdo) | |
| **Output (10 languages)**: | |
| - English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian | |
| ## Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Zen Translator Pipeline β | |
| βββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββββββββββββββ€ | |
| β Audio/Video β Qwen3-Omni β Translation + Understanding β | |
| β Input β (30B MoE) β ~500ms β | |
| βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββββββββ€ | |
| β Translated β CosyVoice 2.0 β Voice Cloning β | |
| β Text β (0.5B) β ~150ms β | |
| βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββββββββ€ | |
| β Cloned Audio β Wav2Lip β Lip Synchronization β | |
| β + Video β β ~200ms β | |
| βββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββββββββββββββ€ | |
| β Total End-to-End Latency: <1 second β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Project Structure | |
| ``` | |
| zen-translator/ | |
| βββ src/zen_translator/ | |
| β βββ __init__.py # Package exports | |
| β βββ config.py # TranslatorConfig, NewsAnchorConfig | |
| β βββ pipeline.py # Main TranslationPipeline orchestrator | |
| β βββ cli.py # Typer CLI (zen-translate command) | |
| β βββ translation/ | |
| β β βββ __init__.py | |
| β β βββ qwen3_omni.py # Qwen3-Omni translation | |
| β βββ voice_clone/ | |
| β β βββ __init__.py | |
| β β βββ cosyvoice.py # CosyVoice 2.0 voice cloning | |
| β βββ lip_sync/ | |
| β β βββ __init__.py | |
| β β βββ wav2lip.py # Wav2Lip lip synchronization | |
| β β βββ wav2lip_model.py # Wav2Lip neural network architecture | |
| β βββ streaming/ | |
| β β βββ __init__.py | |
| β β βββ server.py # FastAPI + WebSocket server | |
| β βββ training/ | |
| β βββ __init__.py | |
| β βββ swift_config.py # ms-swift finetuning configs | |
| β βββ news_anchor_dataset.py # News anchor data collection | |
| βββ configs/ | |
| β βββ train_identity.yaml # Zen identity finetuning | |
| β βββ train_anchor.yaml # News anchor adaptation | |
| βββ scripts/ | |
| β βββ download_models.py # Model download utility | |
| βββ tests/ # Test suite | |
| βββ data/ # Training data directory | |
| β βββ news_anchors/ | |
| β βββ voices/ | |
| βββ models/ # Downloaded model cache | |
| βββ pyproject.toml # Package configuration (uv/pip) | |
| βββ Makefile # Build automation | |
| βββ README.md # User documentation | |
| βββ LLM.md # AI assistant knowledge base (this file) | |
| ``` | |
| ## Key Components | |
| ### 1. TranslationPipeline (pipeline.py) | |
| Main orchestrator that coordinates all translation stages: | |
| ```python | |
| from zen_translator import TranslationPipeline, TranslatorConfig | |
| config = TranslatorConfig(target_language="es") | |
| pipeline = TranslationPipeline(config) | |
| await pipeline.load() | |
| # Audio translation | |
| result = await pipeline.translate_audio( | |
| audio="input.wav", | |
| target_lang="es", | |
| speaker_id="john_doe" | |
| ) | |
| # Video translation with lip sync | |
| result = await pipeline.translate_video( | |
| video="news.mp4", | |
| target_lang="zh", | |
| output_path="news_zh.mp4" | |
| ) | |
| ``` | |
| ### 2. Qwen3OmniTranslator (translation/qwen3_omni.py) | |
| Handles speech understanding and translation using Qwen3-Omni: | |
| - Audio input processing | |
| - Video multimodal analysis (lip reading, visual context) | |
| - Streaming translation support | |
| - Built-in TTS when voice cloning not needed | |
| ### 3. CosyVoiceCloner (voice_clone/cosyvoice.py) | |
| Voice cloning with 3-second reference audio: | |
| - Speaker embedding extraction | |
| - Emotion preservation | |
| - Streaming synthesis (~150ms first packet) | |
| - NewsAnchorVoiceBank for pre-registered voices | |
| ### 4. Wav2LipSync (lip_sync/wav2lip.py) | |
| Lip synchronization for video dubbing: | |
| - Face detection (face_alignment or OpenCV fallback) | |
| - Mel spectrogram audio processing | |
| - Batch processing for efficiency | |
| - Quality presets: fast, balanced, quality | |
| ### 5. TranslationServer (streaming/server.py) | |
| FastAPI server for real-time translation: | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/translate/audio` | POST | Translate audio file | | |
| | `/translate/video` | POST | Translate video with lip sync | | |
| | `/speakers/register` | POST | Register voice for cloning | | |
| | `/speakers` | GET | List registered speakers | | |
| | `/languages` | GET | Get supported languages | | |
| | `/ws/translate` | WS | Real-time streaming translation | | |
| ## Configuration | |
| ### TranslatorConfig | |
| ```python | |
| config = TranslatorConfig( | |
| # Models | |
| qwen3_omni_model="Qwen/Qwen3-Omni-30B-A3B-Instruct", | |
| cosyvoice_model="FunAudioLLM/CosyVoice2-0.5B", | |
| wav2lip_model="numz/wav2lip_studio", | |
| # Translation | |
| target_language="en", | |
| # Voice cloning | |
| voice_reference_seconds=3.0, | |
| preserve_emotion=True, | |
| # Lip sync | |
| enable_lip_sync=True, | |
| lip_sync_quality="balanced", | |
| # Hardware | |
| device="cuda", | |
| dtype="bfloat16", | |
| use_flash_attention=True, | |
| ) | |
| ``` | |
| ### Environment Variables | |
| ```bash | |
| ZEN_TRANSLATOR_TARGET_LANGUAGE=es | |
| ZEN_TRANSLATOR_DEVICE=cuda | |
| ZEN_TRANSLATOR_DTYPE=bfloat16 | |
| ZEN_TRANSLATOR_ENABLE_LIP_SYNC=true | |
| ``` | |
| ## Training Infrastructure | |
| ### Identity Finetuning (ZenIdentityConfig) | |
| Finetunes Qwen3-Omni with Zen Translator identity: | |
| - Professional translation persona | |
| - Consistent behavior and responses | |
| - Uses ms-swift for LoRA training | |
| ### News Anchor Adaptation (NewsAnchorConfig) | |
| Specialized training for broadcast translation: | |
| - Collects data from YouTube news channels (CNN, BBC, NHK, DW, etc.) | |
| - Segments into training samples | |
| - Creates translation pairs | |
| - Exports in ms-swift format | |
| ### Training Commands | |
| ```bash | |
| # Build news anchor dataset | |
| make dataset-build | |
| # Generate training config | |
| make train-anchor | |
| # Run ms-swift training | |
| swift sft --config outputs/anchor/train_config.yaml | |
| ``` | |
| ## Development | |
| ### Setup | |
| ```bash | |
| # Create venv and install | |
| make install | |
| # Install with dev dependencies | |
| make dev | |
| # Download models (~62GB full, ~16GB quantized) | |
| make download | |
| make download-quantized | |
| ``` | |
| ### Testing | |
| ```bash | |
| make test # Run tests | |
| make lint # Run ruff linter | |
| make format # Format code | |
| make typecheck # Run mypy | |
| ``` | |
| ### CLI Commands | |
| ```bash | |
| # Translate file | |
| zen-translate video.mp4 -o translated.mp4 -t spanish | |
| # Start server | |
| zen-serve --host 0.0.0.0 --port 8000 | |
| # Register speaker | |
| zen-translate register-speaker john_doe reference.wav | |
| # Download models | |
| zen-translate download all | |
| # Train | |
| zen-translate train --type anchor --output ./outputs | |
| ``` | |
| ## Model Requirements | |
| | Model | Parameters | VRAM | Disk | | |
| |-------|------------|------|------| | |
| | Qwen3-Omni | 30B (3B active) | 16GB | 60GB | | |
| | CosyVoice 2.0 | 0.5B | 2GB | 1GB | | |
| | Wav2Lip | ~100M | 2GB | 500MB | | |
| | **Total** | - | **~20GB** | **~62GB** | | |
| For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk). | |
| ## Dependencies | |
| ### Core | |
| - torch>=2.1.0 | |
| - transformers>=4.45.0 | |
| - accelerate>=0.25.0 | |
| ### Audio | |
| - librosa>=0.10.0 | |
| - soundfile>=0.12.0 | |
| - webrtcvad>=2.0.10 | |
| ### Video | |
| - opencv-python>=4.8.0 | |
| - ffmpeg-python>=0.2.0 | |
| - av>=11.0.0 | |
| ### Streaming | |
| - fastapi>=0.109.0 | |
| - uvicorn>=0.27.0 | |
| - websockets>=12.0 | |
| ### Training | |
| - ms-swift>=2.4.0 | |
| - peft>=0.7.0 | |
| - deepspeed>=0.13.0 | |
| ## Key Files | |
| - `src/zen_translator/pipeline.py` - Main orchestration (line 23: TranslationPipeline) | |
| - `src/zen_translator/translation/qwen3_omni.py` - Qwen3-Omni (line 25: Qwen3OmniTranslator) | |
| - `src/zen_translator/voice_clone/cosyvoice.py` - CosyVoice (line 23: CosyVoiceCloner) | |
| - `src/zen_translator/lip_sync/wav2lip.py` - Wav2Lip (line 21: Wav2LipSync) | |
| - `src/zen_translator/streaming/server.py` - FastAPI server (line 92: create_app) | |
| ## Notes for AI Assistants | |
| 1. **ALWAYS** update this file with significant discoveries or changes | |
| 2. **NEVER** commit model files or weights (they're in .gitignore) | |
| 3. All Zen models are based on **Qwen3** (not Qwen2!) | |
| 4. Use `uv` for Python environment management | |
| 5. Use `make` commands for standard operations | |
| 6. The Wav2Lip model requires `wav2lip_model.py` for architecture definition | |
| 7. CosyVoice has fallback mode when not installed | |
| 8. Flash Attention 2 is recommended for performance | |
| ## Related Projects | |
| - [zen](https://github.com/zenlm/zen) - Zen AI model family | |
| - [Qwen3-Omni](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct) - Base translation model | |
| - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - Voice cloning | |
| - [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) - Lip synchronization | |
| - [ms-swift](https://github.com/modelscope/ms-swift) - Training framework | |
| --- | |
| **Zen Translator**: Real-time translation with voice cloning and lip sync. | |