Upload 2 files
Browse files- README.md +142 -3
- README_zh.md +142 -0
README.md
CHANGED
|
@@ -1,3 +1,142 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<h1 align="center">🚀 ViiTor Voice TTS</h1>
|
| 2 |
+
<p align="center">Fast, flexible speech cloning with transformers or vLLM — batch-friendly and duration-aware.</p>
|
| 3 |
+
<p align="center"><a href="README_zh.md">中文文档</a> · <a href="https://viitor-ai.github.io/viitor-voice/">Demo page</a> · <a href="https://github.com/viitor-ai/viitor-voice/">GitHub</a> · <a href="https://huggingface.co/ZzWater/ViiTor-voice-2.0-base">Hugging Face</a></p>
|
| 4 |
+
|
| 5 |
+
## 🍀 What it is
|
| 6 |
+
ViiTor Voice is a three-stage speech cloning stack:
|
| 7 |
+
- Stage 1: prompt + text → semantic tokens.
|
| 8 |
+
- Stage 2: prompt acoustic/semantic + predicted semantic → predicted acoustic tokens.
|
| 9 |
+
- Stage 3: acoustic tokens → waveform.
|
| 10 |
+
|
| 11 |
+
## ✨ Why it shines
|
| 12 |
+
- **Text-free prompts**: stronger cross-lingual cloning, less ASR dependency—raw prompts are welcome.
|
| 13 |
+
- **Similarity boost**: InfoNCE + condition encoder as a similarity constraint; robust even with noisy/background prompts.
|
| 14 |
+
- **Built-in duration control**: duration prediction in the LLM trunk; force duration with ~0.5s precision.
|
| 15 |
+
- **LoRA-based emotion control**: plug in LoRA adapters to steer emotion/style without full finetuning.
|
| 16 |
+
|
| 17 |
+
`cli.py` covers both backends, two batch modes, and an optional duration hint (single-text only).
|
| 18 |
+
|
| 19 |
+
## ⚡ Quickstart (Linux)
|
| 20 |
+
### 1) Environment
|
| 21 |
+
Use the provided script (PyTorch, vLLM 0.12.0 CUDA 12.8, requirements, dualcodec):
|
| 22 |
+
```
|
| 23 |
+
bash create_env.sh
|
| 24 |
+
source .venv/bin/activate
|
| 25 |
+
```
|
| 26 |
+
Notes:
|
| 27 |
+
- `create_env.sh` uses `uv venv` with Python 3.12—adjust if needed.
|
| 28 |
+
- vLLM install targets CUDA 12.8 (`--torch-backend=cu128`); adapt to your CUDA/toolkit.
|
| 29 |
+
|
| 30 |
+
### 2) Checkpoints
|
| 31 |
+
Fetch required models (Hugging Face mirror by default):
|
| 32 |
+
```
|
| 33 |
+
bash download_checkpoints.sh
|
| 34 |
+
```
|
| 35 |
+
Default paths (override via CLI if you store elsewhere):
|
| 36 |
+
- SoundStorm: `checkpoints/viitor/soundstorm`
|
| 37 |
+
- DualCodec: `checkpoints/dualcodec`
|
| 38 |
+
- wav2vec: `checkpoints/w2v`
|
| 39 |
+
- LLM: `checkpoints/viitor/llm/zh-en`
|
| 40 |
+
|
| 41 |
+
## 🎯 Demo usage
|
| 42 |
+
### 🖥️ Gradio demo
|
| 43 |
+
Launch a web UI (hosted on `0.0.0.0`, Gradio share disabled):
|
| 44 |
+
```
|
| 45 |
+
python gradio_demo.py \
|
| 46 |
+
--soundstorm-model-path checkpoints/viitor/soundstorm \
|
| 47 |
+
--dualcodec-model-path checkpoints/dualcodec \
|
| 48 |
+
--w2v-path checkpoints/w2v \
|
| 49 |
+
--llm-model-path checkpoints/viitor/llm/zh-en \
|
| 50 |
+
--server-port 7860
|
| 51 |
+
```
|
| 52 |
+
Upload a prompt audio file in the UI, type text, optionally set a duration (seconds), then click “Synthesize” to preview the generated audio.
|
| 53 |
+
Toggle “Enable two-pass speaker refinement (prompt + generated speech)” to reduce accent leakage; helpful for cross-language cloning when you want lighter source accent.
|
| 54 |
+
|
| 55 |
+
### 💻 CLI demo
|
| 56 |
+
Base command (transformers backend + default checkpoints):
|
| 57 |
+
```
|
| 58 |
+
python cli.py \
|
| 59 |
+
--prompt /path/to/prompt.wav \
|
| 60 |
+
--text "Hello ViiTorVoice!" \
|
| 61 |
+
--output outputs/out.wav
|
| 62 |
+
```
|
| 63 |
+
Common flags:
|
| 64 |
+
- `--use-vllm` switch to vLLM.
|
| 65 |
+
- `--duration <seconds>` duration hint; honored only when exactly one text.
|
| 66 |
+
- `--speaker-windowed` enable two-pass speaker refinement (average prompt embedding with generated-speech embedding; reduces accent leakage, useful for cross-language cloning).
|
| 67 |
+
|
| 68 |
+
### 🧪 Cases
|
| 69 |
+
1) Single inference (transformers)
|
| 70 |
+
```
|
| 71 |
+
python cli.py \
|
| 72 |
+
--prompt data/prompt.wav \
|
| 73 |
+
--text "Welcome to ViiTorVoice." \
|
| 74 |
+
--output outputs/single.wav
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
2) vLLM backend
|
| 78 |
+
```
|
| 79 |
+
python cli.py \
|
| 80 |
+
--use-vllm \
|
| 81 |
+
--prompt data/prompt.wav \
|
| 82 |
+
--text "This runs with vLLM." \
|
| 83 |
+
--output outputs/vllm.wav
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
3) Duration hint (single text)
|
| 87 |
+
```
|
| 88 |
+
python cli.py \
|
| 89 |
+
--prompt data/prompt.wav \
|
| 90 |
+
--text "Keep this around three seconds." \
|
| 91 |
+
--duration 3.0 \
|
| 92 |
+
--output outputs/with_duration.wav
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
4) Batch: prompts and texts 1:1
|
| 96 |
+
```
|
| 97 |
+
python cli.py \
|
| 98 |
+
--prompt data/p1.wav data/p2.wav \
|
| 99 |
+
--text "First line" "Second line" \
|
| 100 |
+
--output outputs/pair_batch/
|
| 101 |
+
```
|
| 102 |
+
Paired by order; outputs auto-named in the directory.
|
| 103 |
+
|
| 104 |
+
5) Batch: one prompt, many texts
|
| 105 |
+
```
|
| 106 |
+
python cli.py \
|
| 107 |
+
--prompt data/prompt.wav \
|
| 108 |
+
--text "Line 1" "Line 2" "Line 3" \
|
| 109 |
+
--output outputs/multi_text_batch/
|
| 110 |
+
```
|
| 111 |
+
Generates multiple files, auto-named `000_prompt_t0.wav`, etc.
|
| 112 |
+
|
| 113 |
+
### 📣 Output log
|
| 114 |
+
```
|
| 115 |
+
Saved -> path | text='...' | prompt='...' | set/predicted duration=3.00s | actual duration=2.95s
|
| 116 |
+
```
|
| 117 |
+
- `set/predicted duration`: provided duration (or model-predicted if none)
|
| 118 |
+
- `actual duration`: measured from generated audio
|
| 119 |
+
|
| 120 |
+
## 🧭 Tips
|
| 121 |
+
- Ensure CUDA driver/toolkit matches the PyTorch/vLLM build; edit `create_env.sh` if you need a different CUDA wheel.
|
| 122 |
+
- vLLM prefers generous GPU memory; fall back to transformers if constrained.
|
| 123 |
+
- Set duration hints reasonably; extreme values can produce abnormal audio.
|
| 124 |
+
|
| 125 |
+
## 📌 TODO
|
| 126 |
+
- ✅ Open-sourced Chinese/English base model
|
| 127 |
+
- ✅ Inference code (this repo and demo)
|
| 128 |
+
- ⏳ SoundStorm training recipe
|
| 129 |
+
- ⏳ LLM training recipe
|
| 130 |
+
- ✅ Gradio demo
|
| 131 |
+
- ⏳ Emotion-control LoRA
|
| 132 |
+
- ⏳ Japanese, Korean, Cantonese model weights
|
| 133 |
+
- ⏳ Flow matching–based semantic-to-wav module
|
| 134 |
+
|
| 135 |
+
## 🙌 Acknowledgments
|
| 136 |
+
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
|
| 137 |
+
- [Amphion](https://github.com/open-mmlab/Amphion)
|
| 138 |
+
- [soundstorm-pytorch](https://github.com/lucidrains/soundstorm-pytorch)
|
| 139 |
+
- [IndexTTS](https://github.com/index-tts/index-tts)
|
| 140 |
+
|
| 141 |
+
## 🌟 Product
|
| 142 |
+
Official site: [ViiTor AI](https://www.viitor.com/)
|
README_zh.md
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<h1 align="center">🚀 ViiTor Voice TTS</h1>
|
| 2 |
+
<p align="center">基于 transformers 与 vLLM 的快速灵活语音克隆,支持批量与时长控制。</p>
|
| 3 |
+
<p align="center"><a href="README.md">English</a> · <a href="https://viitor-ai.github.io/viitor-voice/">在线 Demo</a> · <a href="https://github.com/viitor-ai/viitor-voice/">GitHub</a> · <a href="https://huggingface.co/ZzWater/ViiTor-voice-2.0-base">Hugging Face</a></p>
|
| 4 |
+
|
| 5 |
+
## 🍀 方案简介
|
| 6 |
+
ViiTor Voice 是一个三阶段语音克隆流程:
|
| 7 |
+
- 阶段 1:Prompt + 文本 → 语义 token。
|
| 8 |
+
- 阶段 2:Prompt 的声学/语义 + 预测语义 → 预测声学 token。
|
| 9 |
+
- 阶段 3:声学 token → 波形。
|
| 10 |
+
|
| 11 |
+
## ✨ 模型亮点
|
| 12 |
+
- **无文本 Prompt**:更强跨语言克隆,降低 ASR 依赖,原始语音即可。
|
| 13 |
+
- **相似度增强**:InfoNCE + condition encoder 作为相似度约束,在噪声/背景复杂场景也稳健。
|
| 14 |
+
- **内置时长控制**:LLM 主干包含时长预测;可强制时长,精度约 0.5s。
|
| 15 |
+
- **LoRA 情绪控制**:通过 LoRA 适配器调节情绪/风格,无需全量微调。
|
| 16 |
+
|
| 17 |
+
`cli.py` 覆盖 transformers 与 vLLM 后端、两种批处理模式,以及单文本可选时长提示。
|
| 18 |
+
|
| 19 |
+
## ⚡ 快速开始(Linux)
|
| 20 |
+
### 1) 环境
|
| 21 |
+
使用提供的脚本(PyTorch、vLLM 0.12.0 CUDA 12.8、requirements、dualcodec):
|
| 22 |
+
```
|
| 23 |
+
bash create_env.sh
|
| 24 |
+
source .venv/bin/activate
|
| 25 |
+
```
|
| 26 |
+
说明:
|
| 27 |
+
- `create_env.sh` 使用 Python 3.12 的 `uv venv`,如有需要可调整。
|
| 28 |
+
- vLLM 安装目标为 CUDA 12.8(`--torch-backend=cu128`),可按实际 CUDA/Toolkit 修改。
|
| 29 |
+
|
| 30 |
+
### 2) 模型
|
| 31 |
+
通过脚本下载(默认使用 Hugging Face 镜像):
|
| 32 |
+
```
|
| 33 |
+
bash download_checkpoints.sh
|
| 34 |
+
```
|
| 35 |
+
默认路径(可在命令行覆盖):
|
| 36 |
+
- SoundStorm: `checkpoints/viitor/soundstorm`
|
| 37 |
+
- DualCodec: `checkpoints/dualcodec`
|
| 38 |
+
- wav2vec: `checkpoints/w2v`
|
| 39 |
+
- LLM: `checkpoints/viitor/llm/zh-en`
|
| 40 |
+
|
| 41 |
+
## 🎯 Demo 用法
|
| 42 |
+
### 🖥️ Gradio Demo
|
| 43 |
+
启动 Web 界面(监听 `0.0.0.0`,关闭 Gradio share):
|
| 44 |
+
```
|
| 45 |
+
python gradio_demo.py \
|
| 46 |
+
--soundstorm-model-path checkpoints/viitor/soundstorm \
|
| 47 |
+
--dualcodec-model-path checkpoints/dualcodec \
|
| 48 |
+
--w2v-path checkpoints/w2v \
|
| 49 |
+
--llm-model-path checkpoints/viitor/llm/zh-en \
|
| 50 |
+
--server-port 7860
|
| 51 |
+
```
|
| 52 |
+
在界面中上传 Prompt 音频,输入文本,可选填时长(秒),点击 “Synthesize” 预览生成音频。
|
| 53 |
+
如需减少口音泄露、跨语言时希望淡化原始口音,可勾选 “Enable two-pass speaker refinement (prompt + generated speech)”。
|
| 54 |
+
|
| 55 |
+
### 💻 命令行 Demo
|
| 56 |
+
基础命令(transformers 后端 + 默认 checkpoint):
|
| 57 |
+
```
|
| 58 |
+
python cli.py \
|
| 59 |
+
--prompt /path/to/prompt.wav \
|
| 60 |
+
--text "你好,ViiTorVoice!" \
|
| 61 |
+
--output outputs/out.wav
|
| 62 |
+
```
|
| 63 |
+
常用参数:
|
| 64 |
+
- `--use-vllm`:切换到 vLLM 后端。
|
| 65 |
+
- `--duration <秒>`:时长提示,仅单条文本时生效。
|
| 66 |
+
- `--speaker-windowed`:开启双阶段说话人表示优化(Prompt + 生成语音求平均,可减少口音泄露,跨语言时按需开启)。
|
| 67 |
+
|
| 68 |
+
### 🧪 场景示例
|
| 69 |
+
1) 单条推理(transformers)
|
| 70 |
+
```
|
| 71 |
+
python cli.py \
|
| 72 |
+
--prompt data/prompt.wav \
|
| 73 |
+
--text "欢迎使用 ViiTorVoice。" \
|
| 74 |
+
--output outputs/single.wav
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
2) vLLM 后端
|
| 78 |
+
```
|
| 79 |
+
python cli.py \
|
| 80 |
+
--use-vllm \
|
| 81 |
+
--prompt data/prompt.wav \
|
| 82 |
+
--text "这是 vLLM 推理示例。" \
|
| 83 |
+
--output outputs/vllm.wav
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
3) 时长提示(仅单文本)
|
| 87 |
+
```
|
| 88 |
+
python cli.py \
|
| 89 |
+
--prompt data/prompt.wav \
|
| 90 |
+
--text "请将这句话控制在三秒左右。" \
|
| 91 |
+
--duration 3.0 \
|
| 92 |
+
--output outputs/with_duration.wav
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
4) 批处理:Prompt 与文本一一对应
|
| 96 |
+
```
|
| 97 |
+
python cli.py \
|
| 98 |
+
--prompt data/p1.wav data/p2.wav \
|
| 99 |
+
--text "第一条文本" "第二条文本" \
|
| 100 |
+
--output outputs/pair_batch/
|
| 101 |
+
```
|
| 102 |
+
按顺序配对,输出自动命名。
|
| 103 |
+
|
| 104 |
+
5) 批处理:单个 Prompt,多条文本
|
| 105 |
+
```
|
| 106 |
+
python cli.py \
|
| 107 |
+
--prompt data/prompt.wav \
|
| 108 |
+
--text "第一条" "第二条" "第三条" \
|
| 109 |
+
--output outputs/multi_text_batch/
|
| 110 |
+
```
|
| 111 |
+
生成多条音频,自动命名 `000_prompt_t0.wav` 等。
|
| 112 |
+
|
| 113 |
+
### 📣 输出日志
|
| 114 |
+
```
|
| 115 |
+
Saved -> path | text='...' | prompt='...' | set/predicted duration=3.00s | actual duration=2.95s
|
| 116 |
+
```
|
| 117 |
+
- `set/predicted duration`:指定或模型预测的时长(未指定则为预测)。
|
| 118 |
+
- `actual duration`:实际生成音频的时长。
|
| 119 |
+
|
| 120 |
+
## 🧭 Tips
|
| 121 |
+
- 确保 CUDA 与 PyTorch/vLLM 版本匹配;需更换 CUDA Wheel 可修改 `create_env.sh`。
|
| 122 |
+
- vLLM 需要相对充足的显存;显存紧张可改用 transformers。
|
| 123 |
+
- 时长提示需设定在合理范围,过于极端可能导致生成异常音频。
|
| 124 |
+
|
| 125 |
+
## 📌 TODO
|
| 126 |
+
- ✅ 开源中英 Base 模型
|
| 127 |
+
- ✅ 推理代码(本仓库与 Demo)
|
| 128 |
+
- ⏳ SoundStorm 训练流程
|
| 129 |
+
- ⏳ LLM 训练流程
|
| 130 |
+
- ✅ Gradio Demo
|
| 131 |
+
- ⏳ 情绪控制 LoRA
|
| 132 |
+
- ⏳ 日语、韩语、粤语模型权重
|
| 133 |
+
- ⏳ 基于 Flow Matching 的 semantic-to-wav 模块
|
| 134 |
+
|
| 135 |
+
## 🙌 致谢
|
| 136 |
+
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
|
| 137 |
+
- [Amphion](https://github.com/open-mmlab/Amphion)
|
| 138 |
+
- [soundstorm-pytorch](https://github.com/lucidrains/soundstorm-pytorch)
|
| 139 |
+
- [IndexTTS](https://github.com/index-tts/index-tts)
|
| 140 |
+
|
| 141 |
+
## 🌟 产品
|
| 142 |
+
官网: [ViiTor AI](https://www.viitor.com/)
|