zeekay commited on
Commit
54db085
·
verified ·
1 Parent(s): f0b1626

Update model card

Browse files
Files changed (1) hide show
  1. README.md +150 -158
README.md CHANGED
@@ -1,24 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Zen Translator
2
 
3
  Real-time multimodal translation with voice cloning and lip synchronization.
4
 
5
- Built on:
6
- - **[Qwen3-Omni](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct)** - Real-time speech understanding and translation
7
- - **[CosyVoice 2.0](https://github.com/FunAudioLLM/CosyVoice)** - Ultra-low latency voice cloning (150ms)
8
- - **[Wav2Lip](https://github.com/Rudrabha/Wav2Lip)** - Accurate lip synchronization
 
 
 
 
 
 
9
 
10
  ## Features
11
 
12
- - 🌐 **18 input languages**, 10 output languages
13
- - 🎙️ **3-second voice cloning** - Preserve speaker characteristics
14
- - 👄 **Accurate lip sync** - Natural video dubbing
15
- - **<1 second latency** - Real-time streaming
16
- - 📺 **News anchor optimization** - Domain-specific finetuning
 
17
 
18
  ## Quick Start
19
 
20
- ### Installation
21
-
22
  ```bash
23
  # Clone repository
24
  git clone https://github.com/zenlm/zen-translator.git
@@ -27,24 +60,67 @@ cd zen-translator
27
  # Install with uv
28
  make install
29
 
30
- # Download models (requires ~100GB disk space)
31
  make download
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ```
33
 
34
- ### Usage
35
 
36
- **Translate a video:**
37
  ```bash
 
38
  zen-translate video.mp4 -o translated.mp4 -t spanish
 
 
 
 
 
 
39
  ```
40
 
41
- **Start the API server:**
 
42
  ```bash
43
- make serve
44
- # Server runs at http://localhost:8000
 
 
 
 
 
 
 
45
  ```
46
 
47
- **Real-time WebSocket translation:**
 
48
  ```javascript
49
  const ws = new WebSocket('ws://localhost:8000/ws/translate');
50
  ws.send(JSON.stringify({ target_lang: 'es', speaker_id: 'my_voice' }));
@@ -54,63 +130,59 @@ ws.onmessage = (event) => {
54
  };
55
  ```
56
 
57
- ## Architecture
58
-
59
- ```
60
- ┌─────────────────────────────────────────────────────────────────┐
61
- │ Zen Translator Pipeline │
62
- ├─────────────────┬─────────────────┬─────────────────────────────┤
63
- │ Audio/Video │ Qwen3-Omni │ Translation + Understanding
64
- │ Input │ (30B MoE) │ ~500ms │
65
- ├─────────────────┼─────────────────┼─────────────────────────────┤
66
- │ Translated │ CosyVoice 2.0 │ Voice Cloning │
67
- │ Text │ (0.5B) │ ~150ms │
68
- ├─────────────────┼─────────────────┼─────────────────────────────┤
69
- │ Cloned Audio │ Wav2Lip │ Lip Synchronization │
70
- │ + Video │ │ ~200ms │
71
- ├─────────────────┴─────────────────┴─────────────────────────────┤
72
- │ Total End-to-End Latency: <1 second │
73
- └─────────────────────────────────────────────────────────────────┘
74
- ```
75
-
76
- ## Supported Languages
77
-
78
- ### Input (18 + 6 dialects)
79
- English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, Indonesian, Malay, Turkish, Polish, Cantonese, Shanghainese, and more.
 
 
 
 
 
 
 
 
 
 
80
 
81
- ### Output (10)
82
  English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian
83
 
84
- ## Voice Cloning
85
-
86
- Register a speaker with just 3 seconds of audio:
87
-
88
- ```python
89
- from zen_translator import TranslationPipeline
90
-
91
- pipeline = TranslationPipeline()
92
- await pipeline.load()
93
 
94
- # Register speaker
95
- await pipeline.register_speaker(
96
- speaker_id="john_doe",
97
- reference_audio="reference.wav"
98
- )
 
99
 
100
- # Translate with cloned voice
101
- result = await pipeline.translate_audio(
102
- audio="input.wav",
103
- target_lang="es",
104
- speaker_id="john_doe"
105
- )
106
- ```
107
 
108
- ## News Anchor Training
109
 
110
- Finetune for accurate news translation:
111
 
112
  ```bash
113
- # Build dataset from news channels
114
  make dataset-build
115
 
116
  # Train news anchor adaptation
@@ -120,103 +192,23 @@ make train-anchor
120
  swift sft --config outputs/anchor/train_config.yaml
121
  ```
122
 
123
- Supported news sources:
124
- - CNN, BBC News, NHK World, DW News
125
- - France24, Al Jazeera, Sky News, Reuters
126
- - CCTV, TBS, KBS, and more
127
-
128
- ## API Reference
129
-
130
- ### REST Endpoints
131
-
132
- | Endpoint | Method | Description |
133
- |----------|--------|-------------|
134
- | `/translate/audio` | POST | Translate audio file |
135
- | `/translate/video` | POST | Translate video with lip sync |
136
- | `/speakers/register` | POST | Register voice for cloning |
137
- | `/speakers` | GET | List registered speakers |
138
- | `/languages` | GET | Get supported languages |
139
- | `/ws/translate` | WS | Real-time streaming translation |
140
-
141
- ### Python API
142
-
143
- ```python
144
- from zen_translator import TranslationPipeline, TranslatorConfig
145
-
146
- # Configure
147
- config = TranslatorConfig(
148
- target_language="es",
149
- enable_lip_sync=True,
150
- preserve_emotion=True,
151
- )
152
-
153
- # Initialize
154
- pipeline = TranslationPipeline(config)
155
- await pipeline.load()
156
-
157
- # Translate video
158
- result = await pipeline.translate_video(
159
- video="news_clip.mp4",
160
- output_path="translated.mp4",
161
- )
162
- ```
163
-
164
- ## Model Requirements
165
-
166
- | Model | Parameters | VRAM | Disk |
167
- |-------|------------|------|------|
168
- | Qwen3-Omni | 30B (3B active) | 16GB | 60GB |
169
- | CosyVoice 2.0 | 0.5B | 2GB | 1GB |
170
- | Wav2Lip | ~100M | 2GB | 500MB |
171
- | **Total** | - | **~20GB** | **~62GB** |
172
-
173
- For smaller deployments, use quantized models:
174
- ```bash
175
- make download-quantized # 4-bit Qwen3-Omni (~15GB)
176
- ```
177
-
178
- ## Development
179
-
180
- ```bash
181
- # Install dev dependencies
182
- make dev
183
-
184
- # Run tests
185
- make test
186
 
187
- # Lint and format
188
- make lint format
189
-
190
- # Type check
191
- make typecheck
 
 
192
  ```
193
 
194
- ## Configuration
195
-
196
- Environment variables:
197
- ```bash
198
- export ZEN_TRANSLATOR_TARGET_LANGUAGE=es
199
- export ZEN_TRANSLATOR_DEVICE=cuda
200
- export ZEN_TRANSLATOR_DTYPE=bfloat16
201
- export ZEN_TRANSLATOR_ENABLE_LIP_SYNC=true
202
- ```
203
 
204
- Or use `.env` file in project root.
 
 
205
 
206
  ## License
207
 
208
  Apache 2.0
209
-
210
- ## Credits
211
-
212
- - **Qwen Team** - Qwen3-Omni model
213
- - **Alibaba FunAudioLLM** - CosyVoice
214
- - **Wav2Lip Authors** - Lip synchronization
215
- - **Hanzo AI / Zen LM** - Integration and finetuning
216
-
217
- ## Links
218
-
219
- - [Zen LM](https://zenlm.org)
220
- - [Qwen3-Omni](https://huggingface.co/collections/Qwen/qwen3-omni)
221
- - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
222
- - [Wav2Lip](https://github.com/Rudrabha/Wav2Lip)
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - ja
7
+ - ko
8
+ - es
9
+ - fr
10
+ - de
11
+ - it
12
+ - pt
13
+ - ru
14
+ library_name: transformers
15
+ pipeline_tag: audio-to-audio
16
+ tags:
17
+ - translation
18
+ - voice-cloning
19
+ - lip-sync
20
+ - multimodal
21
+ - real-time
22
+ - qwen3-omni
23
+ - cosyvoice
24
+ - wav2lip
25
+ - hanzo-ai
26
+ - zen-lm
27
+ ---
28
+
29
  # Zen Translator
30
 
31
  Real-time multimodal translation with voice cloning and lip synchronization.
32
 
33
+ ## Overview
34
+
35
+ Zen Translator combines three state-of-the-art models into a sub-second end-to-end pipeline:
36
+
37
+ | Component | Model | Parameters | Latency |
38
+ |-----------|-------|------------|---------|
39
+ | Translation | [Qwen3-Omni-30B-A3B](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct) | 30B (3B active MoE) | ~500ms |
40
+ | Voice Cloning | [CosyVoice 2.0](https://github.com/FunAudioLLM/CosyVoice) | 0.5B | ~150ms |
41
+ | Lip Sync | [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) | ~100M | ~200ms |
42
+ | **Total** | - | - | **<1 second** |
43
 
44
  ## Features
45
 
46
+ - **18 input languages** including Chinese dialects (Cantonese, Shanghainese, etc.)
47
+ - **10 output languages** with high-quality voice synthesis
48
+ - **3-second voice cloning** - Preserve speaker characteristics with minimal reference audio
49
+ - **Real-time streaming** - WebSocket API with <500ms first packet latency
50
+ - **Lip synchronization** - Natural video dubbing for translated content
51
+ - **News anchor training** - Domain-specific finetuning for broadcast translation
52
 
53
  ## Quick Start
54
 
 
 
55
  ```bash
56
  # Clone repository
57
  git clone https://github.com/zenlm/zen-translator.git
 
60
  # Install with uv
61
  make install
62
 
63
+ # Download models (~62GB full, ~16GB quantized)
64
  make download
65
+ # OR
66
+ make download-quantized
67
+
68
+ # Start server
69
+ make serve
70
+ ```
71
+
72
+ ## Usage
73
+
74
+ ### Python API
75
+
76
+ ```python
77
+ from zen_translator import TranslationPipeline, TranslatorConfig
78
+
79
+ config = TranslatorConfig(target_language="es")
80
+ pipeline = TranslationPipeline(config)
81
+ await pipeline.load()
82
+
83
+ # Register speaker voice (3+ seconds of audio)
84
+ await pipeline.register_speaker("john_doe", "reference.wav")
85
+
86
+ # Translate video with voice cloning and lip sync
87
+ result = await pipeline.translate_video(
88
+ video="news.mp4",
89
+ target_lang="es",
90
+ speaker_id="john_doe",
91
+ output_path="news_es.mp4"
92
+ )
93
  ```
94
 
95
+ ### CLI
96
 
 
97
  ```bash
98
+ # Translate a video
99
  zen-translate video.mp4 -o translated.mp4 -t spanish
100
+
101
+ # Register a speaker
102
+ zen-translate register-speaker john_doe reference.wav
103
+
104
+ # Start the API server
105
+ zen-serve --host 0.0.0.0 --port 8000
106
  ```
107
 
108
+ ### REST API
109
+
110
  ```bash
111
+ # Translate audio
112
+ curl -X POST http://localhost:8000/translate/audio \
113
+ -F "audio=@input.wav" \
114
+ -F "target_lang=es"
115
+
116
+ # Translate video with lip sync
117
+ curl -X POST http://localhost:8000/translate/video \
118
+ -F "video=@input.mp4" \
119
+ -F "target_lang=zh"
120
  ```
121
 
122
+ ### WebSocket (Real-time)
123
+
124
  ```javascript
125
  const ws = new WebSocket('ws://localhost:8000/ws/translate');
126
  ws.send(JSON.stringify({ target_lang: 'es', speaker_id: 'my_voice' }));
 
130
  };
131
  ```
132
 
133
+ ## Language Support
134
+
135
+ ### Input Languages (18 + 6 dialects)
136
+
137
+ | Language | Code |
138
+ |----------|------|
139
+ | English | en |
140
+ | Chinese | zh |
141
+ | Japanese | ja |
142
+ | Korean | ko |
143
+ | Spanish | es |
144
+ | French | fr |
145
+ | German | de |
146
+ | Italian | it |
147
+ | Portuguese | pt |
148
+ | Russian | ru |
149
+ | Arabic | ar |
150
+ | Hindi | hi |
151
+ | Thai | th |
152
+ | Vietnamese | vi |
153
+ | Indonesian | id |
154
+ | Malay | ms |
155
+ | Turkish | tr |
156
+ | Polish | pl |
157
+ | **Dialects** | |
158
+ | Cantonese | yue |
159
+ | Shanghainese | wuu |
160
+ | Xiang | hsn |
161
+ | Min Nan | nan |
162
+ | Hakka | hak |
163
+ | Min Dong | cdo |
164
+
165
+ ### Output Languages (10)
166
 
 
167
  English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian
168
 
169
+ ## Model Requirements
 
 
 
 
 
 
 
 
170
 
171
+ | Model | VRAM | Disk |
172
+ |-------|------|------|
173
+ | Qwen3-Omni | 16GB | 60GB |
174
+ | CosyVoice 2.0 | 2GB | 1GB |
175
+ | Wav2Lip | 2GB | 500MB |
176
+ | **Total** | **~20GB** | **~62GB** |
177
 
178
+ For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk).
 
 
 
 
 
 
179
 
180
+ ## Training
181
 
182
+ ### News Anchor Adaptation
183
 
184
  ```bash
185
+ # Build dataset from news channels (CNN, BBC, NHK, DW)
186
  make dataset-build
187
 
188
  # Train news anchor adaptation
 
192
  swift sft --config outputs/anchor/train_config.yaml
193
  ```
194
 
195
+ ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
+ ```bibtex
198
+ @software{zen_translator,
199
+ author = {Hanzo AI and Zen LM},
200
+ title = {Zen Translator: Real-time Multimodal Translation with Voice Cloning},
201
+ year = {2025},
202
+ url = {https://github.com/zenlm/zen-translator}
203
+ }
204
  ```
205
 
206
+ ## Links
 
 
 
 
 
 
 
 
207
 
208
+ - **GitHub**: https://github.com/zenlm/zen-translator
209
+ - **Zen LM**: https://zenlm.org
210
+ - **Hanzo AI**: https://hanzo.ai
211
 
212
  ## License
213
 
214
  Apache 2.0