Update model card
Browse files
README.md
CHANGED
|
@@ -1,24 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Zen Translator
|
| 2 |
|
| 3 |
Real-time multimodal translation with voice cloning and lip synchronization.
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
## Features
|
| 11 |
|
| 12 |
-
-
|
| 13 |
-
-
|
| 14 |
-
-
|
| 15 |
-
-
|
| 16 |
-
-
|
|
|
|
| 17 |
|
| 18 |
## Quick Start
|
| 19 |
|
| 20 |
-
### Installation
|
| 21 |
-
|
| 22 |
```bash
|
| 23 |
# Clone repository
|
| 24 |
git clone https://github.com/zenlm/zen-translator.git
|
|
@@ -27,24 +60,67 @@ cd zen-translator
|
|
| 27 |
# Install with uv
|
| 28 |
make install
|
| 29 |
|
| 30 |
-
# Download models (
|
| 31 |
make download
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
```
|
| 33 |
|
| 34 |
-
###
|
| 35 |
|
| 36 |
-
**Translate a video:**
|
| 37 |
```bash
|
|
|
|
| 38 |
zen-translate video.mp4 -o translated.mp4 -t spanish
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
```
|
| 40 |
|
| 41 |
-
|
|
|
|
| 42 |
```bash
|
| 43 |
-
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
```
|
| 46 |
|
| 47 |
-
|
|
|
|
| 48 |
```javascript
|
| 49 |
const ws = new WebSocket('ws://localhost:8000/ws/translate');
|
| 50 |
ws.send(JSON.stringify({ target_lang: 'es', speaker_id: 'my_voice' }));
|
|
@@ -54,63 +130,59 @@ ws.onmessage = (event) => {
|
|
| 54 |
};
|
| 55 |
```
|
| 56 |
|
| 57 |
-
##
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
### Output (10)
|
| 82 |
English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian
|
| 83 |
|
| 84 |
-
##
|
| 85 |
-
|
| 86 |
-
Register a speaker with just 3 seconds of audio:
|
| 87 |
-
|
| 88 |
-
```python
|
| 89 |
-
from zen_translator import TranslationPipeline
|
| 90 |
-
|
| 91 |
-
pipeline = TranslationPipeline()
|
| 92 |
-
await pipeline.load()
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
|
|
|
| 99 |
|
| 100 |
-
|
| 101 |
-
result = await pipeline.translate_audio(
|
| 102 |
-
audio="input.wav",
|
| 103 |
-
target_lang="es",
|
| 104 |
-
speaker_id="john_doe"
|
| 105 |
-
)
|
| 106 |
-
```
|
| 107 |
|
| 108 |
-
##
|
| 109 |
|
| 110 |
-
|
| 111 |
|
| 112 |
```bash
|
| 113 |
-
# Build dataset from news channels
|
| 114 |
make dataset-build
|
| 115 |
|
| 116 |
# Train news anchor adaptation
|
|
@@ -120,103 +192,23 @@ make train-anchor
|
|
| 120 |
swift sft --config outputs/anchor/train_config.yaml
|
| 121 |
```
|
| 122 |
|
| 123 |
-
|
| 124 |
-
- CNN, BBC News, NHK World, DW News
|
| 125 |
-
- France24, Al Jazeera, Sky News, Reuters
|
| 126 |
-
- CCTV, TBS, KBS, and more
|
| 127 |
-
|
| 128 |
-
## API Reference
|
| 129 |
-
|
| 130 |
-
### REST Endpoints
|
| 131 |
-
|
| 132 |
-
| Endpoint | Method | Description |
|
| 133 |
-
|----------|--------|-------------|
|
| 134 |
-
| `/translate/audio` | POST | Translate audio file |
|
| 135 |
-
| `/translate/video` | POST | Translate video with lip sync |
|
| 136 |
-
| `/speakers/register` | POST | Register voice for cloning |
|
| 137 |
-
| `/speakers` | GET | List registered speakers |
|
| 138 |
-
| `/languages` | GET | Get supported languages |
|
| 139 |
-
| `/ws/translate` | WS | Real-time streaming translation |
|
| 140 |
-
|
| 141 |
-
### Python API
|
| 142 |
-
|
| 143 |
-
```python
|
| 144 |
-
from zen_translator import TranslationPipeline, TranslatorConfig
|
| 145 |
-
|
| 146 |
-
# Configure
|
| 147 |
-
config = TranslatorConfig(
|
| 148 |
-
target_language="es",
|
| 149 |
-
enable_lip_sync=True,
|
| 150 |
-
preserve_emotion=True,
|
| 151 |
-
)
|
| 152 |
-
|
| 153 |
-
# Initialize
|
| 154 |
-
pipeline = TranslationPipeline(config)
|
| 155 |
-
await pipeline.load()
|
| 156 |
-
|
| 157 |
-
# Translate video
|
| 158 |
-
result = await pipeline.translate_video(
|
| 159 |
-
video="news_clip.mp4",
|
| 160 |
-
output_path="translated.mp4",
|
| 161 |
-
)
|
| 162 |
-
```
|
| 163 |
-
|
| 164 |
-
## Model Requirements
|
| 165 |
-
|
| 166 |
-
| Model | Parameters | VRAM | Disk |
|
| 167 |
-
|-------|------------|------|------|
|
| 168 |
-
| Qwen3-Omni | 30B (3B active) | 16GB | 60GB |
|
| 169 |
-
| CosyVoice 2.0 | 0.5B | 2GB | 1GB |
|
| 170 |
-
| Wav2Lip | ~100M | 2GB | 500MB |
|
| 171 |
-
| **Total** | - | **~20GB** | **~62GB** |
|
| 172 |
-
|
| 173 |
-
For smaller deployments, use quantized models:
|
| 174 |
-
```bash
|
| 175 |
-
make download-quantized # 4-bit Qwen3-Omni (~15GB)
|
| 176 |
-
```
|
| 177 |
-
|
| 178 |
-
## Development
|
| 179 |
-
|
| 180 |
-
```bash
|
| 181 |
-
# Install dev dependencies
|
| 182 |
-
make dev
|
| 183 |
-
|
| 184 |
-
# Run tests
|
| 185 |
-
make test
|
| 186 |
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
|
|
|
|
|
|
| 192 |
```
|
| 193 |
|
| 194 |
-
##
|
| 195 |
-
|
| 196 |
-
Environment variables:
|
| 197 |
-
```bash
|
| 198 |
-
export ZEN_TRANSLATOR_TARGET_LANGUAGE=es
|
| 199 |
-
export ZEN_TRANSLATOR_DEVICE=cuda
|
| 200 |
-
export ZEN_TRANSLATOR_DTYPE=bfloat16
|
| 201 |
-
export ZEN_TRANSLATOR_ENABLE_LIP_SYNC=true
|
| 202 |
-
```
|
| 203 |
|
| 204 |
-
|
|
|
|
|
|
|
| 205 |
|
| 206 |
## License
|
| 207 |
|
| 208 |
Apache 2.0
|
| 209 |
-
|
| 210 |
-
## Credits
|
| 211 |
-
|
| 212 |
-
- **Qwen Team** - Qwen3-Omni model
|
| 213 |
-
- **Alibaba FunAudioLLM** - CosyVoice
|
| 214 |
-
- **Wav2Lip Authors** - Lip synchronization
|
| 215 |
-
- **Hanzo AI / Zen LM** - Integration and finetuning
|
| 216 |
-
|
| 217 |
-
## Links
|
| 218 |
-
|
| 219 |
-
- [Zen LM](https://zenlm.org)
|
| 220 |
-
- [Qwen3-Omni](https://huggingface.co/collections/Qwen/qwen3-omni)
|
| 221 |
-
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
|
| 222 |
-
- [Wav2Lip](https://github.com/Rudrabha/Wav2Lip)
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
- ja
|
| 7 |
+
- ko
|
| 8 |
+
- es
|
| 9 |
+
- fr
|
| 10 |
+
- de
|
| 11 |
+
- it
|
| 12 |
+
- pt
|
| 13 |
+
- ru
|
| 14 |
+
library_name: transformers
|
| 15 |
+
pipeline_tag: audio-to-audio
|
| 16 |
+
tags:
|
| 17 |
+
- translation
|
| 18 |
+
- voice-cloning
|
| 19 |
+
- lip-sync
|
| 20 |
+
- multimodal
|
| 21 |
+
- real-time
|
| 22 |
+
- qwen3-omni
|
| 23 |
+
- cosyvoice
|
| 24 |
+
- wav2lip
|
| 25 |
+
- hanzo-ai
|
| 26 |
+
- zen-lm
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
# Zen Translator
|
| 30 |
|
| 31 |
Real-time multimodal translation with voice cloning and lip synchronization.
|
| 32 |
|
| 33 |
+
## Overview
|
| 34 |
+
|
| 35 |
+
Zen Translator combines three state-of-the-art models into a sub-second end-to-end pipeline:
|
| 36 |
+
|
| 37 |
+
| Component | Model | Parameters | Latency |
|
| 38 |
+
|-----------|-------|------------|---------|
|
| 39 |
+
| Translation | [Qwen3-Omni-30B-A3B](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct) | 30B (3B active MoE) | ~500ms |
|
| 40 |
+
| Voice Cloning | [CosyVoice 2.0](https://github.com/FunAudioLLM/CosyVoice) | 0.5B | ~150ms |
|
| 41 |
+
| Lip Sync | [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) | ~100M | ~200ms |
|
| 42 |
+
| **Total** | - | - | **<1 second** |
|
| 43 |
|
| 44 |
## Features
|
| 45 |
|
| 46 |
+
- **18 input languages** including Chinese dialects (Cantonese, Shanghainese, etc.)
|
| 47 |
+
- **10 output languages** with high-quality voice synthesis
|
| 48 |
+
- **3-second voice cloning** - Preserve speaker characteristics with minimal reference audio
|
| 49 |
+
- **Real-time streaming** - WebSocket API with <500ms first packet latency
|
| 50 |
+
- **Lip synchronization** - Natural video dubbing for translated content
|
| 51 |
+
- **News anchor training** - Domain-specific finetuning for broadcast translation
|
| 52 |
|
| 53 |
## Quick Start
|
| 54 |
|
|
|
|
|
|
|
| 55 |
```bash
|
| 56 |
# Clone repository
|
| 57 |
git clone https://github.com/zenlm/zen-translator.git
|
|
|
|
| 60 |
# Install with uv
|
| 61 |
make install
|
| 62 |
|
| 63 |
+
# Download models (~62GB full, ~16GB quantized)
|
| 64 |
make download
|
| 65 |
+
# OR
|
| 66 |
+
make download-quantized
|
| 67 |
+
|
| 68 |
+
# Start server
|
| 69 |
+
make serve
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
## Usage
|
| 73 |
+
|
| 74 |
+
### Python API
|
| 75 |
+
|
| 76 |
+
```python
|
| 77 |
+
from zen_translator import TranslationPipeline, TranslatorConfig
|
| 78 |
+
|
| 79 |
+
config = TranslatorConfig(target_language="es")
|
| 80 |
+
pipeline = TranslationPipeline(config)
|
| 81 |
+
await pipeline.load()
|
| 82 |
+
|
| 83 |
+
# Register speaker voice (3+ seconds of audio)
|
| 84 |
+
await pipeline.register_speaker("john_doe", "reference.wav")
|
| 85 |
+
|
| 86 |
+
# Translate video with voice cloning and lip sync
|
| 87 |
+
result = await pipeline.translate_video(
|
| 88 |
+
video="news.mp4",
|
| 89 |
+
target_lang="es",
|
| 90 |
+
speaker_id="john_doe",
|
| 91 |
+
output_path="news_es.mp4"
|
| 92 |
+
)
|
| 93 |
```
|
| 94 |
|
| 95 |
+
### CLI
|
| 96 |
|
|
|
|
| 97 |
```bash
|
| 98 |
+
# Translate a video
|
| 99 |
zen-translate video.mp4 -o translated.mp4 -t spanish
|
| 100 |
+
|
| 101 |
+
# Register a speaker
|
| 102 |
+
zen-translate register-speaker john_doe reference.wav
|
| 103 |
+
|
| 104 |
+
# Start the API server
|
| 105 |
+
zen-serve --host 0.0.0.0 --port 8000
|
| 106 |
```
|
| 107 |
|
| 108 |
+
### REST API
|
| 109 |
+
|
| 110 |
```bash
|
| 111 |
+
# Translate audio
|
| 112 |
+
curl -X POST http://localhost:8000/translate/audio \
|
| 113 |
+
-F "audio=@input.wav" \
|
| 114 |
+
-F "target_lang=es"
|
| 115 |
+
|
| 116 |
+
# Translate video with lip sync
|
| 117 |
+
curl -X POST http://localhost:8000/translate/video \
|
| 118 |
+
-F "video=@input.mp4" \
|
| 119 |
+
-F "target_lang=zh"
|
| 120 |
```
|
| 121 |
|
| 122 |
+
### WebSocket (Real-time)
|
| 123 |
+
|
| 124 |
```javascript
|
| 125 |
const ws = new WebSocket('ws://localhost:8000/ws/translate');
|
| 126 |
ws.send(JSON.stringify({ target_lang: 'es', speaker_id: 'my_voice' }));
|
|
|
|
| 130 |
};
|
| 131 |
```
|
| 132 |
|
| 133 |
+
## Language Support
|
| 134 |
+
|
| 135 |
+
### Input Languages (18 + 6 dialects)
|
| 136 |
+
|
| 137 |
+
| Language | Code |
|
| 138 |
+
|----------|------|
|
| 139 |
+
| English | en |
|
| 140 |
+
| Chinese | zh |
|
| 141 |
+
| Japanese | ja |
|
| 142 |
+
| Korean | ko |
|
| 143 |
+
| Spanish | es |
|
| 144 |
+
| French | fr |
|
| 145 |
+
| German | de |
|
| 146 |
+
| Italian | it |
|
| 147 |
+
| Portuguese | pt |
|
| 148 |
+
| Russian | ru |
|
| 149 |
+
| Arabic | ar |
|
| 150 |
+
| Hindi | hi |
|
| 151 |
+
| Thai | th |
|
| 152 |
+
| Vietnamese | vi |
|
| 153 |
+
| Indonesian | id |
|
| 154 |
+
| Malay | ms |
|
| 155 |
+
| Turkish | tr |
|
| 156 |
+
| Polish | pl |
|
| 157 |
+
| **Dialects** | |
|
| 158 |
+
| Cantonese | yue |
|
| 159 |
+
| Shanghainese | wuu |
|
| 160 |
+
| Xiang | hsn |
|
| 161 |
+
| Min Nan | nan |
|
| 162 |
+
| Hakka | hak |
|
| 163 |
+
| Min Dong | cdo |
|
| 164 |
+
|
| 165 |
+
### Output Languages (10)
|
| 166 |
|
|
|
|
| 167 |
English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian
|
| 168 |
|
| 169 |
+
## Model Requirements
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
+
| Model | VRAM | Disk |
|
| 172 |
+
|-------|------|------|
|
| 173 |
+
| Qwen3-Omni | 16GB | 60GB |
|
| 174 |
+
| CosyVoice 2.0 | 2GB | 1GB |
|
| 175 |
+
| Wav2Lip | 2GB | 500MB |
|
| 176 |
+
| **Total** | **~20GB** | **~62GB** |
|
| 177 |
|
| 178 |
+
For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
|
| 180 |
+
## Training
|
| 181 |
|
| 182 |
+
### News Anchor Adaptation
|
| 183 |
|
| 184 |
```bash
|
| 185 |
+
# Build dataset from news channels (CNN, BBC, NHK, DW)
|
| 186 |
make dataset-build
|
| 187 |
|
| 188 |
# Train news anchor adaptation
|
|
|
|
| 192 |
swift sft --config outputs/anchor/train_config.yaml
|
| 193 |
```
|
| 194 |
|
| 195 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
|
| 197 |
+
```bibtex
|
| 198 |
+
@software{zen_translator,
|
| 199 |
+
author = {Hanzo AI and Zen LM},
|
| 200 |
+
title = {Zen Translator: Real-time Multimodal Translation with Voice Cloning},
|
| 201 |
+
year = {2025},
|
| 202 |
+
url = {https://github.com/zenlm/zen-translator}
|
| 203 |
+
}
|
| 204 |
```
|
| 205 |
|
| 206 |
+
## Links
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
|
| 208 |
+
- **GitHub**: https://github.com/zenlm/zen-translator
|
| 209 |
+
- **Zen LM**: https://zenlm.org
|
| 210 |
+
- **Hanzo AI**: https://hanzo.ai
|
| 211 |
|
| 212 |
## License
|
| 213 |
|
| 214 |
Apache 2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|