|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# SylReg-Decoder Base |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
- **Model type:** Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2 |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** MIT |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** [Code](https://github.com/ryota-komatsu/speaker_disentangled_hubert) |
|
|
- **Demo:** [Project page](https://ryota-komatsu.github.io/speaker_disentangled_hubert) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```sh |
|
|
git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git |
|
|
cd speaker_disentangled_hubert |
|
|
|
|
|
sudo apt install git-lfs # for UTMOS |
|
|
|
|
|
conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0 |
|
|
conda activate py310 |
|
|
pip install -r requirements/requirements.txt |
|
|
|
|
|
sh scripts/setup.sh |
|
|
``` |
|
|
|
|
|
```python |
|
|
import re |
|
|
|
|
|
import torch |
|
|
import torchaudio |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
from src.flow_matching import FlowMatchingWithBigVGan |
|
|
from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery |
|
|
|
|
|
wav_path = "/path/to/wav" |
|
|
|
|
|
# download pretrained models from hugging face hub |
|
|
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda") |
|
|
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder-Base", device_map="cuda") |
|
|
|
|
|
# load a waveform |
|
|
waveform, sr = torchaudio.load(wav_path) |
|
|
waveform = torchaudio.functional.resample(waveform, sr, 16000) |
|
|
|
|
|
# encode a waveform into syllabic units |
|
|
outputs = encoder(waveform.to(encoder.device)) |
|
|
units = outputs[0]["units"] # [3950, 67, ..., 503] |
|
|
|
|
|
# unit-to-speech synthesis |
|
|
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu() |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
|
|
| | License | Provider | |
|
|
| --- | --- | --- | |
|
|
| [LibriTTS-R](https://www.openslr.org/141/) | CC BY 4.0 | Y. Koizumi *et al.* | |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** fp16 mixed precision |
|
|
|
|
|
## Hardware |
|
|
|
|
|
2 x A6000 |