Update README.md
Browse files
README.md
CHANGED
|
@@ -45,9 +45,7 @@ We have included the following pre-trained models at Amphion:
|
|
| 45 |
The training data includes:
|
| 46 |
|
| 47 |
- **Emilia-101k**: about 101k hours of speech data
|
| 48 |
-
|
| 49 |
- **Sing-0.4k**: about 400 hours of open-source singing voice data as follows:
|
| 50 |
-
|
| 51 |
| Dataset Name | \#Hours |
|
| 52 |
| ------------ | --------- |
|
| 53 |
| ACESinger | 320.6 |
|
|
@@ -58,62 +56,323 @@ The training data includes:
|
|
| 58 |
| Opencpop | 5.1 |
|
| 59 |
| CSD | 3.8 |
|
| 60 |
| **Total** | **438.9** |
|
| 61 |
-
|
| 62 |
- **SingNet-7k**: about 7,000 hours of internal singing voice data, preprocessed using the [SingNet pipeline](https://openreview.net/pdf?id=X6ffdf6nh3). The SingNet-3k is a 3000-hour subset of SingNet-7k.
|
| 63 |
|
| 64 |
-
##
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
### Clone and Environment Setup
|
| 74 |
-
|
| 75 |
-
#### 1. Clone the repository
|
| 76 |
-
|
| 77 |
-
```bash
|
| 78 |
-
git clone https://github.com/open-mmlab/Amphion.git
|
| 79 |
-
cd Amphion
|
| 80 |
```
|
| 81 |
|
| 82 |
-
#### 2. Install the environment
|
| 83 |
-
|
| 84 |
-
Before start installing, making sure you are under the `Amphion` directory. If not, use `cd` to enter.
|
| 85 |
-
|
| 86 |
-
Since we use `phonemizer` to convert text to phoneme, you need to install `espeak-ng` first. More details can be found [here](https://bootphon.github.io/phonemizer/install.html). Choose the correct installation command according to your operating system:
|
| 87 |
-
|
| 88 |
-
```bash
|
| 89 |
-
# For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
|
| 90 |
-
sudo apt-get install espeak-ng
|
| 91 |
-
# For RedHat-like distribution (e.g. CentOS, Fedora, etc.)
|
| 92 |
-
sudo yum install espeak-ng
|
| 93 |
-
```
|
| 94 |
-
|
| 95 |
-
Now, we are going to install the environment. It is recommended to use conda to configure:
|
| 96 |
-
|
| 97 |
-
```bash
|
| 98 |
-
conda create -n vevo python=3.10
|
| 99 |
-
conda activate vevo
|
| 100 |
-
|
| 101 |
-
pip install -r models/vc/vevo/requirements.txt
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
### Inference Script
|
| 105 |
-
|
| 106 |
-
```sh
|
| 107 |
-
# FM model only (i.e., timbre control. Usually for VC and SVC)
|
| 108 |
-
python -m models.svc.vevosing.infer_vevosing_fm
|
| 109 |
-
|
| 110 |
-
# AR + FM (i.e., text, prosody, and style control)
|
| 111 |
-
python -m models.svc.vevosing.infer_vevosing_ar
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
Running this will automatically download the pretrained model from HuggingFace and start the inference process. The generated audios are saved in `models/svc/vevosing/output/*.wav` by default.
|
| 115 |
-
|
| 116 |
-
|
| 117 |
## Citations
|
| 118 |
|
| 119 |
If you find this work useful for your research, please cite our paper:
|
|
|
|
| 45 |
The training data includes:
|
| 46 |
|
| 47 |
- **Emilia-101k**: about 101k hours of speech data
|
|
|
|
| 48 |
- **Sing-0.4k**: about 400 hours of open-source singing voice data as follows:
|
|
|
|
| 49 |
| Dataset Name | \#Hours |
|
| 50 |
| ------------ | --------- |
|
| 51 |
| ACESinger | 320.6 |
|
|
|
|
| 56 |
| Opencpop | 5.1 |
|
| 57 |
| CSD | 3.8 |
|
| 58 |
| **Total** | **438.9** |
|
|
|
|
| 59 |
- **SingNet-7k**: about 7,000 hours of internal singing voice data, preprocessed using the [SingNet pipeline](https://openreview.net/pdf?id=X6ffdf6nh3). The SingNet-3k is a 3000-hour subset of SingNet-7k.
|
| 60 |
|
| 61 |
+
## Usage
|
| 62 |
+
You can refer to our [recipe](https://github.com/open-mmlab/Amphion/blob/vevosing/models/svc/vevosing/README.md) at GitHub for more usage details. For example, to use Vevo1.5, after you clone the Amphion github repository, you can use the script like:
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
import os
|
| 66 |
+
from huggingface_hub import snapshot_download
|
| 67 |
+
|
| 68 |
+
from models.svc.vevosing.vevosing_utils import *
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def vevosing_tts(
|
| 72 |
+
tgt_text,
|
| 73 |
+
ref_wav_path,
|
| 74 |
+
ref_text=None,
|
| 75 |
+
timbre_ref_wav_path=None,
|
| 76 |
+
output_path=None,
|
| 77 |
+
src_language="en",
|
| 78 |
+
ref_language="en",
|
| 79 |
+
):
|
| 80 |
+
if timbre_ref_wav_path is None:
|
| 81 |
+
timbre_ref_wav_path = ref_wav_path
|
| 82 |
+
|
| 83 |
+
gen_audio = inference_pipeline.inference_ar_and_fm(
|
| 84 |
+
task="synthesis",
|
| 85 |
+
src_wav_path=None,
|
| 86 |
+
src_text=tgt_text,
|
| 87 |
+
style_ref_wav_path=ref_wav_path,
|
| 88 |
+
timbre_ref_wav_path=timbre_ref_wav_path,
|
| 89 |
+
style_ref_wav_text=ref_text,
|
| 90 |
+
src_text_language=src_language,
|
| 91 |
+
style_ref_wav_text_language=ref_language,
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
assert output_path is not None
|
| 95 |
+
save_audio(gen_audio, output_path=output_path)
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def vevosing_editing(
|
| 99 |
+
tgt_text,
|
| 100 |
+
raw_wav_path,
|
| 101 |
+
raw_text=None,
|
| 102 |
+
output_path=None,
|
| 103 |
+
raw_language="en",
|
| 104 |
+
tgt_language="en",
|
| 105 |
+
):
|
| 106 |
+
gen_audio = inference_pipeline.inference_ar_and_fm(
|
| 107 |
+
task="recognition-synthesis",
|
| 108 |
+
src_wav_path=raw_wav_path,
|
| 109 |
+
src_text=tgt_text,
|
| 110 |
+
style_ref_wav_path=raw_wav_path,
|
| 111 |
+
style_ref_wav_text=raw_text,
|
| 112 |
+
src_text_language=tgt_language,
|
| 113 |
+
style_ref_wav_text_language=raw_language,
|
| 114 |
+
timbre_ref_wav_path=raw_wav_path, # keep the timbre as the raw wav
|
| 115 |
+
use_style_tokens_as_ar_input=True, # To use the prosody code of the raw wav
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
assert output_path is not None
|
| 119 |
+
save_audio(gen_audio, output_path=output_path)
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def vevosing_singing_style_conversion(
|
| 123 |
+
raw_wav_path,
|
| 124 |
+
style_ref_wav_path,
|
| 125 |
+
output_path=None,
|
| 126 |
+
raw_text=None,
|
| 127 |
+
style_ref_text=None,
|
| 128 |
+
raw_language="en",
|
| 129 |
+
style_ref_language="en",
|
| 130 |
+
):
|
| 131 |
+
gen_audio = inference_pipeline.inference_ar_and_fm(
|
| 132 |
+
task="recognition-synthesis",
|
| 133 |
+
src_wav_path=raw_wav_path,
|
| 134 |
+
src_text=raw_text,
|
| 135 |
+
style_ref_wav_path=style_ref_wav_path,
|
| 136 |
+
style_ref_wav_text=style_ref_text,
|
| 137 |
+
src_text_language=raw_language,
|
| 138 |
+
style_ref_wav_text_language=style_ref_language,
|
| 139 |
+
timbre_ref_wav_path=raw_wav_path, # keep the timbre as the raw wav
|
| 140 |
+
use_style_tokens_as_ar_input=True, # To use the prosody code of the raw wav
|
| 141 |
+
)
|
| 142 |
+
|
| 143 |
+
assert output_path is not None
|
| 144 |
+
save_audio(gen_audio, output_path=output_path)
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
def vevosing_melody_control(
|
| 148 |
+
tgt_text,
|
| 149 |
+
tgt_melody_wav_path,
|
| 150 |
+
output_path=None,
|
| 151 |
+
style_ref_wav_path=None,
|
| 152 |
+
style_ref_text=None,
|
| 153 |
+
timbre_ref_wav_path=None,
|
| 154 |
+
tgt_language="en",
|
| 155 |
+
style_ref_language="en",
|
| 156 |
+
):
|
| 157 |
+
gen_audio = inference_pipeline.inference_ar_and_fm(
|
| 158 |
+
task="recognition-synthesis",
|
| 159 |
+
src_wav_path=tgt_melody_wav_path,
|
| 160 |
+
src_text=tgt_text,
|
| 161 |
+
style_ref_wav_path=style_ref_wav_path,
|
| 162 |
+
style_ref_wav_text=style_ref_text,
|
| 163 |
+
src_text_language=tgt_language,
|
| 164 |
+
style_ref_wav_text_language=style_ref_language,
|
| 165 |
+
timbre_ref_wav_path=timbre_ref_wav_path,
|
| 166 |
+
use_style_tokens_as_ar_input=True, # To use the prosody code
|
| 167 |
+
)
|
| 168 |
+
|
| 169 |
+
assert output_path is not None
|
| 170 |
+
save_audio(gen_audio, output_path=output_path)
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
def load_inference_pipeline():
|
| 174 |
+
# ===== Device =====
|
| 175 |
+
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
|
| 176 |
+
|
| 177 |
+
# ===== Prosody Tokenizer =====
|
| 178 |
+
local_dir = snapshot_download(
|
| 179 |
+
repo_id="amphion/Vevo1.5",
|
| 180 |
+
repo_type="model",
|
| 181 |
+
cache_dir="./ckpts/Vevo1.5",
|
| 182 |
+
allow_patterns=["tokenizer/prosody_fvq512_6.25hz/*"],
|
| 183 |
+
)
|
| 184 |
+
prosody_tokenizer_ckpt_path = os.path.join(
|
| 185 |
+
local_dir, "tokenizer/prosody_fvq512_6.25hz"
|
| 186 |
+
)
|
| 187 |
+
|
| 188 |
+
# ===== Content-Style Tokenizer =====
|
| 189 |
+
local_dir = snapshot_download(
|
| 190 |
+
repo_id="amphion/Vevo1.5",
|
| 191 |
+
repo_type="model",
|
| 192 |
+
cache_dir="./ckpts/Vevo1.5",
|
| 193 |
+
allow_patterns=["tokenizer/contentstyle_fvq16384_12.5hz/*"],
|
| 194 |
+
)
|
| 195 |
+
contentstyle_tokenizer_ckpt_path = os.path.join(
|
| 196 |
+
local_dir, "tokenizer/contentstyle_fvq16384_12.5hz"
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
# ===== Autoregressive Transformer =====
|
| 200 |
+
model_name = "ar_emilia101k_singnet7k"
|
| 201 |
+
|
| 202 |
+
local_dir = snapshot_download(
|
| 203 |
+
repo_id="amphion/Vevo1.5",
|
| 204 |
+
repo_type="model",
|
| 205 |
+
cache_dir="./ckpts/Vevo1.5",
|
| 206 |
+
allow_patterns=[f"contentstyle_modeling/{model_name}/*"],
|
| 207 |
+
)
|
| 208 |
+
|
| 209 |
+
ar_cfg_path = f"./models/svc/vevosing/config/{model_name}.json"
|
| 210 |
+
ar_ckpt_path = os.path.join(
|
| 211 |
+
local_dir,
|
| 212 |
+
f"contentstyle_modeling/{model_name}",
|
| 213 |
+
)
|
| 214 |
+
|
| 215 |
+
# ===== Flow Matching Transformer =====
|
| 216 |
+
model_name = "fm_emilia101k_singnet7k"
|
| 217 |
+
|
| 218 |
+
local_dir = snapshot_download(
|
| 219 |
+
repo_id="amphion/Vevo1.5",
|
| 220 |
+
repo_type="model",
|
| 221 |
+
cache_dir="./ckpts/Vevo1.5",
|
| 222 |
+
allow_patterns=[f"acoustic_modeling/{model_name}/*"],
|
| 223 |
+
)
|
| 224 |
+
|
| 225 |
+
fmt_cfg_path = f"./models/svc/vevosing/config/{model_name}.json"
|
| 226 |
+
fmt_ckpt_path = os.path.join(local_dir, f"acoustic_modeling/{model_name}")
|
| 227 |
+
|
| 228 |
+
# ===== Vocoder =====
|
| 229 |
+
local_dir = snapshot_download(
|
| 230 |
+
repo_id="amphion/Vevo1.5",
|
| 231 |
+
repo_type="model",
|
| 232 |
+
cache_dir="./ckpts/Vevo1.5",
|
| 233 |
+
allow_patterns=["acoustic_modeling/Vocoder/*"],
|
| 234 |
+
)
|
| 235 |
+
|
| 236 |
+
vocoder_cfg_path = "./models/svc/vevosing/config/vocoder.json"
|
| 237 |
+
vocoder_ckpt_path = os.path.join(local_dir, "acoustic_modeling/Vocoder")
|
| 238 |
+
|
| 239 |
+
# ===== Inference =====
|
| 240 |
+
inference_pipeline = VevosingInferencePipeline(
|
| 241 |
+
prosody_tokenizer_ckpt_path=prosody_tokenizer_ckpt_path,
|
| 242 |
+
content_style_tokenizer_ckpt_path=contentstyle_tokenizer_ckpt_path,
|
| 243 |
+
ar_cfg_path=ar_cfg_path,
|
| 244 |
+
ar_ckpt_path=ar_ckpt_path,
|
| 245 |
+
fmt_cfg_path=fmt_cfg_path,
|
| 246 |
+
fmt_ckpt_path=fmt_ckpt_path,
|
| 247 |
+
vocoder_cfg_path=vocoder_cfg_path,
|
| 248 |
+
vocoder_ckpt_path=vocoder_ckpt_path,
|
| 249 |
+
device=device,
|
| 250 |
+
)
|
| 251 |
+
return inference_pipeline
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
if __name__ == "__main__":
|
| 255 |
+
inference_pipeline = load_inference_pipeline()
|
| 256 |
+
|
| 257 |
+
output_dir = "./models/svc/vevosing/output"
|
| 258 |
+
os.makedirs(output_dir, exist_ok=True)
|
| 259 |
+
|
| 260 |
+
### Zero-shot Text-to-Speech and Text-to-Singing ###
|
| 261 |
+
tgt_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
|
| 262 |
+
ref_wav_path = "./models/vc/vevo/wav/arabic_male.wav"
|
| 263 |
+
ref_text = "Flip stood undecided, his ears strained to catch the slightest sound."
|
| 264 |
+
|
| 265 |
+
jaychou_path = "./models/svc/vevosing/wav/jaychou.wav"
|
| 266 |
+
jaychou_text = (
|
| 267 |
+
"对这个世界如果你有太多的抱怨,跌倒了就不该继续往前走,为什么,人要这么的脆弱堕"
|
| 268 |
+
)
|
| 269 |
+
taiyizhenren_path = "./models/svc/vevosing/wav/taiyizhenren.wav"
|
| 270 |
+
taiyizhenren_text = (
|
| 271 |
+
"对,这就是我,万人敬仰的太乙真人。虽然有点婴儿肥,但也掩不住我,逼人的帅气。"
|
| 272 |
+
)
|
| 273 |
+
|
| 274 |
+
# the style reference and timbre reference are same
|
| 275 |
+
vevosing_tts(
|
| 276 |
+
tgt_text=tgt_text,
|
| 277 |
+
ref_wav_path=ref_wav_path,
|
| 278 |
+
timbre_ref_wav_path=ref_wav_path,
|
| 279 |
+
output_path=os.path.join(output_dir, "zstts.wav"),
|
| 280 |
+
ref_text=ref_text,
|
| 281 |
+
src_language="en",
|
| 282 |
+
ref_language="en",
|
| 283 |
+
)
|
| 284 |
+
|
| 285 |
+
# the style reference and timbre reference are different
|
| 286 |
+
vevosing_tts(
|
| 287 |
+
tgt_text=tgt_text,
|
| 288 |
+
ref_wav_path=ref_wav_path,
|
| 289 |
+
timbre_ref_wav_path=jaychou_path,
|
| 290 |
+
output_path=os.path.join(output_dir, "zstts_disentangled.wav"),
|
| 291 |
+
ref_text=ref_text,
|
| 292 |
+
src_language="en",
|
| 293 |
+
ref_language="en",
|
| 294 |
+
)
|
| 295 |
+
|
| 296 |
+
# the style reference is a singing voice
|
| 297 |
+
vevosing_tts(
|
| 298 |
+
tgt_text="顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”",
|
| 299 |
+
ref_wav_path=jaychou_path,
|
| 300 |
+
ref_text=jaychou_text,
|
| 301 |
+
timbre_ref_wav_path=taiyizhenren_path,
|
| 302 |
+
output_path=os.path.join(output_dir, "zstts_singing.wav"),
|
| 303 |
+
src_language="zh",
|
| 304 |
+
ref_language="zh",
|
| 305 |
+
)
|
| 306 |
+
|
| 307 |
+
### Zero-shot Singing Editing ###
|
| 308 |
+
adele_path = "./models/svc/vevosing/wav/adele.wav"
|
| 309 |
+
adele_text = "Never mind, I'll find someone like you. I wish nothing but."
|
| 310 |
+
|
| 311 |
+
vevosing_editing(
|
| 312 |
+
tgt_text="Never mind, you'll find anyone like me. You wish nothing but.",
|
| 313 |
+
raw_wav_path=adele_path,
|
| 314 |
+
raw_text=adele_text, # "Never mind, I'll find someone like you. I wish nothing but."
|
| 315 |
+
output_path=os.path.join(output_dir, "editing_adele.wav"),
|
| 316 |
+
raw_language="en",
|
| 317 |
+
tgt_language="en",
|
| 318 |
+
)
|
| 319 |
+
|
| 320 |
+
vevosing_editing(
|
| 321 |
+
tgt_text="对你的人生如果你有太多的期盼,跌倒了��不该低头认输,为什么啊,人要这么的彷徨堕",
|
| 322 |
+
raw_wav_path=jaychou_path,
|
| 323 |
+
raw_text=jaychou_text, # "对这个世界如果你有太多的抱怨,跌倒了就不该继续往前走,为什么,人要这么的脆弱堕"
|
| 324 |
+
output_path=os.path.join(output_dir, "editing_jaychou.wav"),
|
| 325 |
+
raw_language="zh",
|
| 326 |
+
tgt_language="zh",
|
| 327 |
+
)
|
| 328 |
+
|
| 329 |
+
### Zero-shot Singing Style Conversion ###
|
| 330 |
+
breathy_path = "./models/svc/vevosing/wav/breathy.wav"
|
| 331 |
+
breathy_text = "离别没说再见你是否心酸"
|
| 332 |
+
|
| 333 |
+
vibrato_path = "./models/svc/vevosing/wav/vibrato.wav"
|
| 334 |
+
vibrato_text = "玫瑰的红,容易受伤的梦,握在手中却流失于指缝"
|
| 335 |
+
|
| 336 |
+
vevosing_singing_style_conversion(
|
| 337 |
+
raw_wav_path=breathy_path,
|
| 338 |
+
raw_text=breathy_text,
|
| 339 |
+
style_ref_wav_path=vibrato_path,
|
| 340 |
+
style_ref_text=vibrato_text,
|
| 341 |
+
output_path=os.path.join(output_dir, "ssc_breathy2vibrato.wav"),
|
| 342 |
+
raw_language="zh",
|
| 343 |
+
style_ref_language="zh",
|
| 344 |
+
)
|
| 345 |
+
|
| 346 |
+
### Melody Control for Singing Synthesis ##
|
| 347 |
+
humming_path = "./models/svc/vevosing/wav/humming.wav"
|
| 348 |
+
piano_path = "./models/svc/vevosing/wav/piano.wav"
|
| 349 |
+
|
| 350 |
+
# Humming to control the melody
|
| 351 |
+
vevosing_melody_control(
|
| 352 |
+
tgt_text="你是我的小呀小苹果,怎么爱,不嫌多",
|
| 353 |
+
tgt_melody_wav_path=humming_path,
|
| 354 |
+
output_path=os.path.join(output_dir, "melody_humming.wav"),
|
| 355 |
+
style_ref_wav_path=taiyizhenren_path,
|
| 356 |
+
style_ref_text=taiyizhenren_text,
|
| 357 |
+
timbre_ref_wav_path=taiyizhenren_path,
|
| 358 |
+
tgt_language="zh",
|
| 359 |
+
style_ref_language="zh",
|
| 360 |
+
)
|
| 361 |
+
|
| 362 |
+
# Piano to control the melody
|
| 363 |
+
vevosing_melody_control(
|
| 364 |
+
tgt_text="你是我的小呀小苹果,怎么爱,不嫌多",
|
| 365 |
+
tgt_melody_wav_path=piano_path,
|
| 366 |
+
output_path=os.path.join(output_dir, "melody_piano.wav"),
|
| 367 |
+
style_ref_wav_path=taiyizhenren_path,
|
| 368 |
+
style_ref_text=taiyizhenren_text,
|
| 369 |
+
timbre_ref_wav_path=taiyizhenren_path,
|
| 370 |
+
tgt_language="zh",
|
| 371 |
+
style_ref_language="zh",
|
| 372 |
+
)
|
| 373 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 374 |
```
|
| 375 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 376 |
## Citations
|
| 377 |
|
| 378 |
If you find this work useful for your research, please cite our paper:
|