Upload folder using huggingface_hub

f0b1626 verified 3 months ago

10.9 kB

Zen Translator - AI Knowledge Base

Project: zen-translator Organization: zenlm Repository: https://github.com/zenlm/zen-translator Version: 0.1.0 Last Updated: 2025-11-27

Project Overview

Zen Translator is a real-time multimodal translation pipeline that combines speech translation, voice cloning, and lip synchronization for seamless video dubbing and live translation.

Core Technology Stack

Component	Model	Parameters	Latency
Translation	Qwen3-Omni-30B-A3B	30B (3B active MoE)	~500ms
Voice Cloning	CosyVoice 2.0	0.5B	~150ms
Lip Sync	Wav2Lip	~100M	~200ms
Total	-	-	<1 second

Language Support

Input (18 languages + 6 dialects):

English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian
Arabic, Hindi, Thai, Vietnamese, Indonesian, Malay, Turkish, Polish
Cantonese (yue), Shanghainese (wuu), Xiang (hsn), Min Nan (nan), Hakka (hak), Min Dong (cdo)

Output (10 languages):

English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Zen Translator Pipeline                     │
├─────────────────┬─────────────────┬─────────────────────────────┤
│  Audio/Video    │  Qwen3-Omni     │  Translation + Understanding │
│  Input          │  (30B MoE)      │  ~500ms                       │
├─────────────────┼─────────────────┼─────────────────────────────┤
│  Translated     │  CosyVoice 2.0  │  Voice Cloning               │
│  Text           │  (0.5B)         │  ~150ms                       │
├─────────────────┼─────────────────┼─────────────────────────────┤
│  Cloned Audio   │  Wav2Lip        │  Lip Synchronization         │
│  + Video        │                 │  ~200ms                       │
├─────────────────┴─────────────────┴─────────────────────────────┤
│  Total End-to-End Latency: <1 second                            │
└─────────────────────────────────────────────────────────────────┘

Project Structure

zen-translator/
├── src/zen_translator/
│   ├── __init__.py           # Package exports
│   ├── config.py             # TranslatorConfig, NewsAnchorConfig
│   ├── pipeline.py           # Main TranslationPipeline orchestrator
│   ├── cli.py                # Typer CLI (zen-translate command)
│   ├── translation/
│   │   ├── __init__.py
│   │   └── qwen3_omni.py     # Qwen3-Omni translation
│   ├── voice_clone/
│   │   ├── __init__.py
│   │   └── cosyvoice.py      # CosyVoice 2.0 voice cloning
│   ├── lip_sync/
│   │   ├── __init__.py
│   │   ├── wav2lip.py        # Wav2Lip lip synchronization
│   │   └── wav2lip_model.py  # Wav2Lip neural network architecture
│   ├── streaming/
│   │   ├── __init__.py
│   │   └── server.py         # FastAPI + WebSocket server
│   └── training/
│       ├── __init__.py
│       ├── swift_config.py       # ms-swift finetuning configs
│       └── news_anchor_dataset.py # News anchor data collection
├── configs/
│   ├── train_identity.yaml   # Zen identity finetuning
│   └── train_anchor.yaml     # News anchor adaptation
├── scripts/
│   └── download_models.py    # Model download utility
├── tests/                    # Test suite
├── data/                     # Training data directory
│   ├── news_anchors/
│   └── voices/
├── models/                   # Downloaded model cache
├── pyproject.toml            # Package configuration (uv/pip)
├── Makefile                  # Build automation
├── README.md                 # User documentation
└── LLM.md                    # AI assistant knowledge base (this file)

Key Components

1. TranslationPipeline (pipeline.py)

Main orchestrator that coordinates all translation stages:

from zen_translator import TranslationPipeline, TranslatorConfig

config = TranslatorConfig(target_language="es")
pipeline = TranslationPipeline(config)
await pipeline.load()

# Audio translation
result = await pipeline.translate_audio(
    audio="input.wav",
    target_lang="es",
    speaker_id="john_doe"
)

# Video translation with lip sync
result = await pipeline.translate_video(
    video="news.mp4",
    target_lang="zh",
    output_path="news_zh.mp4"
)

2. Qwen3OmniTranslator (translation/qwen3_omni.py)

Handles speech understanding and translation using Qwen3-Omni:

Audio input processing
Video multimodal analysis (lip reading, visual context)
Streaming translation support
Built-in TTS when voice cloning not needed

3. CosyVoiceCloner (voice_clone/cosyvoice.py)

Voice cloning with 3-second reference audio:

Speaker embedding extraction
Emotion preservation
Streaming synthesis (~150ms first packet)
NewsAnchorVoiceBank for pre-registered voices

4. Wav2LipSync (lip_sync/wav2lip.py)

Lip synchronization for video dubbing:

Face detection (face_alignment or OpenCV fallback)
Mel spectrogram audio processing
Batch processing for efficiency
Quality presets: fast, balanced, quality

5. TranslationServer (streaming/server.py)

FastAPI server for real-time translation:

Endpoint	Method	Description
`/translate/audio`	POST	Translate audio file
`/translate/video`	POST	Translate video with lip sync
`/speakers/register`	POST	Register voice for cloning
`/speakers`	GET	List registered speakers
`/languages`	GET	Get supported languages
`/ws/translate`	WS	Real-time streaming translation

Configuration

TranslatorConfig

config = TranslatorConfig(
    # Models
    qwen3_omni_model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
    cosyvoice_model="FunAudioLLM/CosyVoice2-0.5B",
    wav2lip_model="numz/wav2lip_studio",
    
    # Translation
    target_language="en",
    
    # Voice cloning
    voice_reference_seconds=3.0,
    preserve_emotion=True,
    
    # Lip sync
    enable_lip_sync=True,
    lip_sync_quality="balanced",
    
    # Hardware
    device="cuda",
    dtype="bfloat16",
    use_flash_attention=True,
)

Environment Variables

ZEN_TRANSLATOR_TARGET_LANGUAGE=es
ZEN_TRANSLATOR_DEVICE=cuda
ZEN_TRANSLATOR_DTYPE=bfloat16
ZEN_TRANSLATOR_ENABLE_LIP_SYNC=true

Training Infrastructure

Identity Finetuning (ZenIdentityConfig)

Finetunes Qwen3-Omni with Zen Translator identity:

Professional translation persona
Consistent behavior and responses
Uses ms-swift for LoRA training

News Anchor Adaptation (NewsAnchorConfig)

Specialized training for broadcast translation:

Collects data from YouTube news channels (CNN, BBC, NHK, DW, etc.)
Segments into training samples
Creates translation pairs
Exports in ms-swift format

Training Commands

# Build news anchor dataset
make dataset-build

# Generate training config
make train-anchor

# Run ms-swift training
swift sft --config outputs/anchor/train_config.yaml

Development

Setup

# Create venv and install
make install

# Install with dev dependencies
make dev

# Download models (~62GB full, ~16GB quantized)
make download
make download-quantized

Testing

make test       # Run tests
make lint       # Run ruff linter
make format     # Format code
make typecheck  # Run mypy

CLI Commands

# Translate file
zen-translate video.mp4 -o translated.mp4 -t spanish

# Start server
zen-serve --host 0.0.0.0 --port 8000

# Register speaker
zen-translate register-speaker john_doe reference.wav

# Download models
zen-translate download all

# Train
zen-translate train --type anchor --output ./outputs

Model Requirements

Model	Parameters	VRAM	Disk
Qwen3-Omni	30B (3B active)	16GB	60GB
CosyVoice 2.0	0.5B	2GB	1GB
Wav2Lip	~100M	2GB	500MB
Total	-	~20GB	~62GB

For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk).

Dependencies

Core

torch>=2.1.0
transformers>=4.45.0
accelerate>=0.25.0

Audio

librosa>=0.10.0
soundfile>=0.12.0
webrtcvad>=2.0.10

Video

opencv-python>=4.8.0
ffmpeg-python>=0.2.0
av>=11.0.0

Streaming

fastapi>=0.109.0
uvicorn>=0.27.0
websockets>=12.0

Training

ms-swift>=2.4.0
peft>=0.7.0
deepspeed>=0.13.0

Key Files

src/zen_translator/pipeline.py - Main orchestration (line 23: TranslationPipeline)
src/zen_translator/translation/qwen3_omni.py - Qwen3-Omni (line 25: Qwen3OmniTranslator)
src/zen_translator/voice_clone/cosyvoice.py - CosyVoice (line 23: CosyVoiceCloner)
src/zen_translator/lip_sync/wav2lip.py - Wav2Lip (line 21: Wav2LipSync)
src/zen_translator/streaming/server.py - FastAPI server (line 92: create_app)

Notes for AI Assistants

ALWAYS update this file with significant discoveries or changes
NEVER commit model files or weights (they're in .gitignore)
All Zen models are based on Qwen3 (not Qwen2!)
Use uv for Python environment management
Use make commands for standard operations
The Wav2Lip model requires wav2lip_model.py for architecture definition
CosyVoice has fallback mode when not installed
Flash Attention 2 is recommended for performance

Related Projects

zen - Zen AI model family
Qwen3-Omni - Base translation model
CosyVoice - Voice cloning
Wav2Lip - Lip synchronization
ms-swift - Training framework

Zen Translator: Real-time translation with voice cloning and lip sync.