metadata
library_name: transformers
license: cc-by-nc-sa-4.0
language:
- en
base_model:
- ryota-komatsu/sylreg-decoder-base
SylReg-Decoder
Model Details
Model Description
- Model type: Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2
- Language(s) (NLP): English
- License: CC BY-NC-SA 4.0
- Finetuned from model: SylReg-Decoder Base
Model Sources
- Repository: Code
- Demo: Project page
How to Get Started with the Model
Use the code below to get started with the model.
git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert
sudo apt install git-lfs # for UTMOS
conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0
conda activate py310
pip install -r requirements/requirements.txt
sh scripts/setup.sh
import re
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery
wav_path = "/path/to/wav"
# download pretrained models from hugging face hub
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder", device_map="cuda")
# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)
# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
units = outputs[0]["units"] # [3950, 67, ..., 503]
# unit-to-speech synthesis
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()
Training Details
Training Data
| License | Provider | |
|---|---|---|
| LibriTTS-R | CC BY 4.0 | Y. Koizumi et al. |
| Hi-Fi-CAPTAIN | CC BY-NC-SA 4.0 | T. Okamoto et al. |
Training Hyperparameters
- Training regime: fp16 mixed precision
Hardware
2 x A6000