ZzWater commited on
Commit
4e5afb7
·
verified ·
1 Parent(s): 8eb9806

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +142 -3
  2. README_zh.md +142 -0
README.md CHANGED
@@ -1,3 +1,142 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">🚀 ViiTor Voice TTS</h1>
2
+ <p align="center">Fast, flexible speech cloning with transformers or vLLM — batch-friendly and duration-aware.</p>
3
+ <p align="center"><a href="README_zh.md">中文文档</a> · <a href="https://viitor-ai.github.io/viitor-voice/">Demo page</a> · <a href="https://github.com/viitor-ai/viitor-voice/">GitHub</a> · <a href="https://huggingface.co/ZzWater/ViiTor-voice-2.0-base">Hugging Face</a></p>
4
+
5
+ ## 🍀 What it is
6
+ ViiTor Voice is a three-stage speech cloning stack:
7
+ - Stage 1: prompt + text → semantic tokens.
8
+ - Stage 2: prompt acoustic/semantic + predicted semantic → predicted acoustic tokens.
9
+ - Stage 3: acoustic tokens → waveform.
10
+
11
+ ## ✨ Why it shines
12
+ - **Text-free prompts**: stronger cross-lingual cloning, less ASR dependency—raw prompts are welcome.
13
+ - **Similarity boost**: InfoNCE + condition encoder as a similarity constraint; robust even with noisy/background prompts.
14
+ - **Built-in duration control**: duration prediction in the LLM trunk; force duration with ~0.5s precision.
15
+ - **LoRA-based emotion control**: plug in LoRA adapters to steer emotion/style without full finetuning.
16
+
17
+ `cli.py` covers both backends, two batch modes, and an optional duration hint (single-text only).
18
+
19
+ ## ⚡ Quickstart (Linux)
20
+ ### 1) Environment
21
+ Use the provided script (PyTorch, vLLM 0.12.0 CUDA 12.8, requirements, dualcodec):
22
+ ```
23
+ bash create_env.sh
24
+ source .venv/bin/activate
25
+ ```
26
+ Notes:
27
+ - `create_env.sh` uses `uv venv` with Python 3.12—adjust if needed.
28
+ - vLLM install targets CUDA 12.8 (`--torch-backend=cu128`); adapt to your CUDA/toolkit.
29
+
30
+ ### 2) Checkpoints
31
+ Fetch required models (Hugging Face mirror by default):
32
+ ```
33
+ bash download_checkpoints.sh
34
+ ```
35
+ Default paths (override via CLI if you store elsewhere):
36
+ - SoundStorm: `checkpoints/viitor/soundstorm`
37
+ - DualCodec: `checkpoints/dualcodec`
38
+ - wav2vec: `checkpoints/w2v`
39
+ - LLM: `checkpoints/viitor/llm/zh-en`
40
+
41
+ ## 🎯 Demo usage
42
+ ### 🖥️ Gradio demo
43
+ Launch a web UI (hosted on `0.0.0.0`, Gradio share disabled):
44
+ ```
45
+ python gradio_demo.py \
46
+ --soundstorm-model-path checkpoints/viitor/soundstorm \
47
+ --dualcodec-model-path checkpoints/dualcodec \
48
+ --w2v-path checkpoints/w2v \
49
+ --llm-model-path checkpoints/viitor/llm/zh-en \
50
+ --server-port 7860
51
+ ```
52
+ Upload a prompt audio file in the UI, type text, optionally set a duration (seconds), then click “Synthesize” to preview the generated audio.
53
+ Toggle “Enable two-pass speaker refinement (prompt + generated speech)” to reduce accent leakage; helpful for cross-language cloning when you want lighter source accent.
54
+
55
+ ### 💻 CLI demo
56
+ Base command (transformers backend + default checkpoints):
57
+ ```
58
+ python cli.py \
59
+ --prompt /path/to/prompt.wav \
60
+ --text "Hello ViiTorVoice!" \
61
+ --output outputs/out.wav
62
+ ```
63
+ Common flags:
64
+ - `--use-vllm` switch to vLLM.
65
+ - `--duration <seconds>` duration hint; honored only when exactly one text.
66
+ - `--speaker-windowed` enable two-pass speaker refinement (average prompt embedding with generated-speech embedding; reduces accent leakage, useful for cross-language cloning).
67
+
68
+ ### 🧪 Cases
69
+ 1) Single inference (transformers)
70
+ ```
71
+ python cli.py \
72
+ --prompt data/prompt.wav \
73
+ --text "Welcome to ViiTorVoice." \
74
+ --output outputs/single.wav
75
+ ```
76
+
77
+ 2) vLLM backend
78
+ ```
79
+ python cli.py \
80
+ --use-vllm \
81
+ --prompt data/prompt.wav \
82
+ --text "This runs with vLLM." \
83
+ --output outputs/vllm.wav
84
+ ```
85
+
86
+ 3) Duration hint (single text)
87
+ ```
88
+ python cli.py \
89
+ --prompt data/prompt.wav \
90
+ --text "Keep this around three seconds." \
91
+ --duration 3.0 \
92
+ --output outputs/with_duration.wav
93
+ ```
94
+
95
+ 4) Batch: prompts and texts 1:1
96
+ ```
97
+ python cli.py \
98
+ --prompt data/p1.wav data/p2.wav \
99
+ --text "First line" "Second line" \
100
+ --output outputs/pair_batch/
101
+ ```
102
+ Paired by order; outputs auto-named in the directory.
103
+
104
+ 5) Batch: one prompt, many texts
105
+ ```
106
+ python cli.py \
107
+ --prompt data/prompt.wav \
108
+ --text "Line 1" "Line 2" "Line 3" \
109
+ --output outputs/multi_text_batch/
110
+ ```
111
+ Generates multiple files, auto-named `000_prompt_t0.wav`, etc.
112
+
113
+ ### 📣 Output log
114
+ ```
115
+ Saved -> path | text='...' | prompt='...' | set/predicted duration=3.00s | actual duration=2.95s
116
+ ```
117
+ - `set/predicted duration`: provided duration (or model-predicted if none)
118
+ - `actual duration`: measured from generated audio
119
+
120
+ ## 🧭 Tips
121
+ - Ensure CUDA driver/toolkit matches the PyTorch/vLLM build; edit `create_env.sh` if you need a different CUDA wheel.
122
+ - vLLM prefers generous GPU memory; fall back to transformers if constrained.
123
+ - Set duration hints reasonably; extreme values can produce abnormal audio.
124
+
125
+ ## 📌 TODO
126
+ - ✅ Open-sourced Chinese/English base model
127
+ - ✅ Inference code (this repo and demo)
128
+ - ⏳ SoundStorm training recipe
129
+ - ⏳ LLM training recipe
130
+ - ✅ Gradio demo
131
+ - ⏳ Emotion-control LoRA
132
+ - ⏳ Japanese, Korean, Cantonese model weights
133
+ - ⏳ Flow matching–based semantic-to-wav module
134
+
135
+ ## 🙌 Acknowledgments
136
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
137
+ - [Amphion](https://github.com/open-mmlab/Amphion)
138
+ - [soundstorm-pytorch](https://github.com/lucidrains/soundstorm-pytorch)
139
+ - [IndexTTS](https://github.com/index-tts/index-tts)
140
+
141
+ ## 🌟 Product
142
+ Official site: [ViiTor AI](https://www.viitor.com/)
README_zh.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">🚀 ViiTor Voice TTS</h1>
2
+ <p align="center">基于 transformers 与 vLLM 的快速灵活语音克隆,支持批量与时长控制。</p>
3
+ <p align="center"><a href="README.md">English</a> · <a href="https://viitor-ai.github.io/viitor-voice/">在线 Demo</a> · <a href="https://github.com/viitor-ai/viitor-voice/">GitHub</a> · <a href="https://huggingface.co/ZzWater/ViiTor-voice-2.0-base">Hugging Face</a></p>
4
+
5
+ ## 🍀 方案简介
6
+ ViiTor Voice 是一个三阶段语音克隆流程:
7
+ - 阶段 1:Prompt + 文本 → 语义 token。
8
+ - 阶段 2:Prompt 的声学/语义 + 预测语义 → 预测声学 token。
9
+ - 阶段 3:声学 token → 波形。
10
+
11
+ ## ✨ 模型亮点
12
+ - **无文本 Prompt**:更强跨语言克隆,降低 ASR 依赖,原始语音即可。
13
+ - **相似度增强**:InfoNCE + condition encoder 作为相似度约束,在噪声/背景复杂场景也稳健。
14
+ - **内置时长控制**:LLM 主干包含时长预测;可强制时长,精度约 0.5s。
15
+ - **LoRA 情绪控制**:通过 LoRA 适配器调节情绪/风格,无需全量微调。
16
+
17
+ `cli.py` 覆盖 transformers 与 vLLM 后端、两种批处理模式,以及单文本可选时长提示。
18
+
19
+ ## ⚡ 快速开始(Linux)
20
+ ### 1) 环境
21
+ 使用提供的脚本(PyTorch、vLLM 0.12.0 CUDA 12.8、requirements、dualcodec):
22
+ ```
23
+ bash create_env.sh
24
+ source .venv/bin/activate
25
+ ```
26
+ 说明:
27
+ - `create_env.sh` 使用 Python 3.12 的 `uv venv`,如有需要可调整。
28
+ - vLLM 安装目标为 CUDA 12.8(`--torch-backend=cu128`),可按实际 CUDA/Toolkit 修改。
29
+
30
+ ### 2) 模型
31
+ 通过脚本下载(默认使用 Hugging Face 镜像):
32
+ ```
33
+ bash download_checkpoints.sh
34
+ ```
35
+ 默认路径(可在命令行覆盖):
36
+ - SoundStorm: `checkpoints/viitor/soundstorm`
37
+ - DualCodec: `checkpoints/dualcodec`
38
+ - wav2vec: `checkpoints/w2v`
39
+ - LLM: `checkpoints/viitor/llm/zh-en`
40
+
41
+ ## 🎯 Demo 用法
42
+ ### 🖥️ Gradio Demo
43
+ 启动 Web 界面(监听 `0.0.0.0`,关闭 Gradio share):
44
+ ```
45
+ python gradio_demo.py \
46
+ --soundstorm-model-path checkpoints/viitor/soundstorm \
47
+ --dualcodec-model-path checkpoints/dualcodec \
48
+ --w2v-path checkpoints/w2v \
49
+ --llm-model-path checkpoints/viitor/llm/zh-en \
50
+ --server-port 7860
51
+ ```
52
+ 在界面中上传 Prompt 音频,输入文本,可选填时长(秒),点击 “Synthesize” 预览生成音频。
53
+ 如需减少口音泄露、跨语言时希望淡化原始口音,可勾选 “Enable two-pass speaker refinement (prompt + generated speech)”。
54
+
55
+ ### 💻 命令行 Demo
56
+ 基础命令(transformers 后端 + 默认 checkpoint):
57
+ ```
58
+ python cli.py \
59
+ --prompt /path/to/prompt.wav \
60
+ --text "你好,ViiTorVoice!" \
61
+ --output outputs/out.wav
62
+ ```
63
+ 常用参数:
64
+ - `--use-vllm`:切换到 vLLM 后端。
65
+ - `--duration <秒>`:时长提示,仅单条文本时生效。
66
+ - `--speaker-windowed`:开启双阶段说话人表示优化(Prompt + 生成语音求平均,可减少口音泄露,跨语言时按需开启)。
67
+
68
+ ### 🧪 场景示例
69
+ 1) 单条推理(transformers)
70
+ ```
71
+ python cli.py \
72
+ --prompt data/prompt.wav \
73
+ --text "欢迎使用 ViiTorVoice。" \
74
+ --output outputs/single.wav
75
+ ```
76
+
77
+ 2) vLLM 后端
78
+ ```
79
+ python cli.py \
80
+ --use-vllm \
81
+ --prompt data/prompt.wav \
82
+ --text "这是 vLLM 推理示例。" \
83
+ --output outputs/vllm.wav
84
+ ```
85
+
86
+ 3) 时长提示(仅单文本)
87
+ ```
88
+ python cli.py \
89
+ --prompt data/prompt.wav \
90
+ --text "请将这句话控制在三秒左右。" \
91
+ --duration 3.0 \
92
+ --output outputs/with_duration.wav
93
+ ```
94
+
95
+ 4) 批处理:Prompt 与文本一一对应
96
+ ```
97
+ python cli.py \
98
+ --prompt data/p1.wav data/p2.wav \
99
+ --text "第一条文本" "第二条文本" \
100
+ --output outputs/pair_batch/
101
+ ```
102
+ 按顺序配对,输出自动命名。
103
+
104
+ 5) 批处理:单个 Prompt,多条文本
105
+ ```
106
+ python cli.py \
107
+ --prompt data/prompt.wav \
108
+ --text "第一条" "第二条" "第三条" \
109
+ --output outputs/multi_text_batch/
110
+ ```
111
+ 生成多条音频,自动命名 `000_prompt_t0.wav` 等。
112
+
113
+ ### 📣 输出日志
114
+ ```
115
+ Saved -> path | text='...' | prompt='...' | set/predicted duration=3.00s | actual duration=2.95s
116
+ ```
117
+ - `set/predicted duration`:指定或模型预测的时长(未指定则为预测)。
118
+ - `actual duration`:实际生成音频的时长。
119
+
120
+ ## 🧭 Tips
121
+ - 确保 CUDA 与 PyTorch/vLLM 版本匹配;需更换 CUDA Wheel 可修改 `create_env.sh`。
122
+ - vLLM 需要相对充足的显存;显存紧张可改用 transformers。
123
+ - 时长提示需设定在合理范围,过于极端可能导致生成异常音频。
124
+
125
+ ## 📌 TODO
126
+ - ✅ 开源中英 Base 模型
127
+ - ✅ 推理代码(本仓库与 Demo)
128
+ - ⏳ SoundStorm 训练流程
129
+ - ⏳ LLM 训练流程
130
+ - ✅ Gradio Demo
131
+ - ⏳ 情绪控制 LoRA
132
+ - ⏳ 日语、韩语、粤语模型权重
133
+ - ⏳ 基于 Flow Matching 的 semantic-to-wav 模块
134
+
135
+ ## 🙌 致谢
136
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
137
+ - [Amphion](https://github.com/open-mmlab/Amphion)
138
+ - [soundstorm-pytorch](https://github.com/lucidrains/soundstorm-pytorch)
139
+ - [IndexTTS](https://github.com/index-tts/index-tts)
140
+
141
+ ## 🌟 产品
142
+ 官网: [ViiTor AI](https://www.viitor.com/)