Upload folder using huggingface_hub

f0b1626 verified 3 months ago

10.9 kB

	# Zen Translator - AI Knowledge Base

	Project: zen-translator
	Organization: zenlm
	Repository: https://github.com/zenlm/zen-translator
	Version: 0.1.0
	Last Updated: 2025-11-27

	## Project Overview

	Zen Translator is a real-time multimodal translation pipeline that combines speech translation, voice cloning, and lip synchronization for seamless video dubbing and live translation.

	### Core Technology Stack

	\| Component \| Model \| Parameters \| Latency \|
	\|-----------\|-------\|------------\|---------\|
	\| Translation \| Qwen3-Omni-30B-A3B \| 30B (3B active MoE) \| ~500ms \|
	\| Voice Cloning \| CosyVoice 2.0 \| 0.5B \| ~150ms \|
	\| Lip Sync \| Wav2Lip \| ~100M \| ~200ms \|
	\| Total \| - \| - \| <1 second \|

	### Language Support

	Input (18 languages + 6 dialects):
	- English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian
	- Arabic, Hindi, Thai, Vietnamese, Indonesian, Malay, Turkish, Polish
	- Cantonese (yue), Shanghainese (wuu), Xiang (hsn), Min Nan (nan), Hakka (hak), Min Dong (cdo)

	Output (10 languages):
	- English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian

	## Architecture

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ Zen Translator Pipeline │
	├─────────────────┬─────────────────┬─────────────────────────────┤
	│ Audio/Video │ Qwen3-Omni │ Translation + Understanding │
	│ Input │ (30B MoE) │ ~500ms │
	├─────────────────┼─────────────────┼─────────────────────────────┤
	│ Translated │ CosyVoice 2.0 │ Voice Cloning │
	│ Text │ (0.5B) │ ~150ms │
	├─────────────────┼─────────────────┼─────────────────────────────┤
	│ Cloned Audio │ Wav2Lip │ Lip Synchronization │
	│ + Video │ │ ~200ms │
	├─────────────────┴─────────────────┴─────────────────────────────┤
	│ Total End-to-End Latency: <1 second │
	└─────────────────────────────────────────────────────────────────┘
	```

	## Project Structure

	```
	zen-translator/
	├── src/zen_translator/
	│ ├── __init__.py # Package exports
	│ ├── config.py # TranslatorConfig, NewsAnchorConfig
	│ ├── pipeline.py # Main TranslationPipeline orchestrator
	│ ├── cli.py # Typer CLI (zen-translate command)
	│ ├── translation/
	│ │ ├── __init__.py
	│ │ └── qwen3_omni.py # Qwen3-Omni translation
	│ ├── voice_clone/
	│ │ ├── __init__.py
	│ │ └── cosyvoice.py # CosyVoice 2.0 voice cloning
	│ ├── lip_sync/
	│ │ ├── __init__.py
	│ │ ├── wav2lip.py # Wav2Lip lip synchronization
	│ │ └── wav2lip_model.py # Wav2Lip neural network architecture
	│ ├── streaming/
	│ │ ├── __init__.py
	│ │ └── server.py # FastAPI + WebSocket server
	│ └── training/
	│ ├── __init__.py
	│ ├── swift_config.py # ms-swift finetuning configs
	│ └── news_anchor_dataset.py # News anchor data collection
	├── configs/
	│ ├── train_identity.yaml # Zen identity finetuning
	│ └── train_anchor.yaml # News anchor adaptation
	├── scripts/
	│ └── download_models.py # Model download utility
	├── tests/ # Test suite
	├── data/ # Training data directory
	│ ├── news_anchors/
	│ └── voices/
	├── models/ # Downloaded model cache
	├── pyproject.toml # Package configuration (uv/pip)
	├── Makefile # Build automation
	├── README.md # User documentation
	└── LLM.md # AI assistant knowledge base (this file)
	```

	## Key Components

	### 1. TranslationPipeline (pipeline.py)

	Main orchestrator that coordinates all translation stages:

	```python
	from zen_translator import TranslationPipeline, TranslatorConfig

	config = TranslatorConfig(target_language="es")
	pipeline = TranslationPipeline(config)
	await pipeline.load()

	# Audio translation
	result = await pipeline.translate_audio(
	audio="input.wav",
	target_lang="es",
	speaker_id="john_doe"
	)

	# Video translation with lip sync
	result = await pipeline.translate_video(
	video="news.mp4",
	target_lang="zh",
	output_path="news_zh.mp4"
	)
	```

	### 2. Qwen3OmniTranslator (translation/qwen3_omni.py)

	Handles speech understanding and translation using Qwen3-Omni:
	- Audio input processing
	- Video multimodal analysis (lip reading, visual context)
	- Streaming translation support
	- Built-in TTS when voice cloning not needed

	### 3. CosyVoiceCloner (voice_clone/cosyvoice.py)

	Voice cloning with 3-second reference audio:
	- Speaker embedding extraction
	- Emotion preservation
	- Streaming synthesis (~150ms first packet)
	- NewsAnchorVoiceBank for pre-registered voices

	### 4. Wav2LipSync (lip_sync/wav2lip.py)

	Lip synchronization for video dubbing:
	- Face detection (face_alignment or OpenCV fallback)
	- Mel spectrogram audio processing
	- Batch processing for efficiency
	- Quality presets: fast, balanced, quality

	### 5. TranslationServer (streaming/server.py)

	FastAPI server for real-time translation:

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/translate/audio` \| POST \| Translate audio file \|
	\| `/translate/video` \| POST \| Translate video with lip sync \|
	\| `/speakers/register` \| POST \| Register voice for cloning \|
	\| `/speakers` \| GET \| List registered speakers \|
	\| `/languages` \| GET \| Get supported languages \|
	\| `/ws/translate` \| WS \| Real-time streaming translation \|

	## Configuration

	### TranslatorConfig

	```python
	config = TranslatorConfig(
	# Models
	qwen3_omni_model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
	cosyvoice_model="FunAudioLLM/CosyVoice2-0.5B",
	wav2lip_model="numz/wav2lip_studio",

	# Translation
	target_language="en",

	# Voice cloning
	voice_reference_seconds=3.0,
	preserve_emotion=True,

	# Lip sync
	enable_lip_sync=True,
	lip_sync_quality="balanced",

	# Hardware
	device="cuda",
	dtype="bfloat16",
	use_flash_attention=True,
	)
	```

	### Environment Variables

	```bash
	ZEN_TRANSLATOR_TARGET_LANGUAGE=es
	ZEN_TRANSLATOR_DEVICE=cuda
	ZEN_TRANSLATOR_DTYPE=bfloat16
	ZEN_TRANSLATOR_ENABLE_LIP_SYNC=true
	```

	## Training Infrastructure

	### Identity Finetuning (ZenIdentityConfig)

	Finetunes Qwen3-Omni with Zen Translator identity:
	- Professional translation persona
	- Consistent behavior and responses
	- Uses ms-swift for LoRA training

	### News Anchor Adaptation (NewsAnchorConfig)

	Specialized training for broadcast translation:
	- Collects data from YouTube news channels (CNN, BBC, NHK, DW, etc.)
	- Segments into training samples
	- Creates translation pairs
	- Exports in ms-swift format

	### Training Commands

	```bash
	# Build news anchor dataset
	make dataset-build

	# Generate training config
	make train-anchor

	# Run ms-swift training
	swift sft --config outputs/anchor/train_config.yaml
	```

	## Development

	### Setup

	```bash
	# Create venv and install
	make install

	# Install with dev dependencies
	make dev

	# Download models (~62GB full, ~16GB quantized)
	make download
	make download-quantized
	```

	### Testing

	```bash
	make test # Run tests
	make lint # Run ruff linter
	make format # Format code
	make typecheck # Run mypy
	```

	### CLI Commands

	```bash
	# Translate file
	zen-translate video.mp4 -o translated.mp4 -t spanish

	# Start server
	zen-serve --host 0.0.0.0 --port 8000

	# Register speaker
	zen-translate register-speaker john_doe reference.wav

	# Download models
	zen-translate download all

	# Train
	zen-translate train --type anchor --output ./outputs
	```

	## Model Requirements

	\| Model \| Parameters \| VRAM \| Disk \|
	\|-------\|------------\|------\|------\|
	\| Qwen3-Omni \| 30B (3B active) \| 16GB \| 60GB \|
	\| CosyVoice 2.0 \| 0.5B \| 2GB \| 1GB \|
	\| Wav2Lip \| ~100M \| 2GB \| 500MB \|
	\| Total \| - \| ~20GB \| ~62GB \|

	For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk).

	## Dependencies

	### Core
	- torch>=2.1.0
	- transformers>=4.45.0
	- accelerate>=0.25.0

	### Audio
	- librosa>=0.10.0
	- soundfile>=0.12.0
	- webrtcvad>=2.0.10

	### Video
	- opencv-python>=4.8.0
	- ffmpeg-python>=0.2.0
	- av>=11.0.0

	### Streaming
	- fastapi>=0.109.0
	- uvicorn>=0.27.0
	- websockets>=12.0

	### Training
	- ms-swift>=2.4.0
	- peft>=0.7.0
	- deepspeed>=0.13.0

	## Key Files

	- `src/zen_translator/pipeline.py` - Main orchestration (line 23: TranslationPipeline)
	- `src/zen_translator/translation/qwen3_omni.py` - Qwen3-Omni (line 25: Qwen3OmniTranslator)
	- `src/zen_translator/voice_clone/cosyvoice.py` - CosyVoice (line 23: CosyVoiceCloner)
	- `src/zen_translator/lip_sync/wav2lip.py` - Wav2Lip (line 21: Wav2LipSync)
	- `src/zen_translator/streaming/server.py` - FastAPI server (line 92: create_app)

	## Notes for AI Assistants

	1. ALWAYS update this file with significant discoveries or changes
	2. NEVER commit model files or weights (they're in .gitignore)
	3. All Zen models are based on Qwen3 (not Qwen2!)
	4. Use `uv` for Python environment management
	5. Use `make` commands for standard operations
	6. The Wav2Lip model requires `wav2lip_model.py` for architecture definition
	7. CosyVoice has fallback mode when not installed
	8. Flash Attention 2 is recommended for performance

	## Related Projects

	- [zen](https://github.com/zenlm/zen) - Zen AI model family
	- [Qwen3-Omni](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct) - Base translation model
	- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - Voice cloning
	- [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) - Lip synchronization
	- [ms-swift](https://github.com/modelscope/ms-swift) - Training framework

	---

	Zen Translator: Real-time translation with voice cloning and lip sync.