--- library_name: transformers license: cc-by-nc-sa-4.0 language: - en base_model: - ryota-komatsu/sylreg-decoder-base --- # SylReg-Decoder ## Model Details ### Model Description - **Model type:** Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2 - **Language(s) (NLP):** English - **License:** CC BY-NC-SA 4.0 - **Finetuned from model:** [SylReg-Decoder Base](https://huggingface.co/ryota-komatsu/sylreg-decoder-base) ### Model Sources - **Repository:** [Code](https://github.com/ryota-komatsu/speaker_disentangled_hubert) - **Demo:** [Project page](https://ryota-komatsu.github.io/speaker_disentangled_hubert) ## How to Get Started with the Model Use the code below to get started with the model. ```sh git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git cd speaker_disentangled_hubert sudo apt install git-lfs # for UTMOS conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0 conda activate py310 pip install -r requirements/requirements.txt sh scripts/setup.sh ``` ```python import re import torch import torchaudio from transformers import AutoModelForCausalLM, AutoTokenizer from src.flow_matching import FlowMatchingWithBigVGan from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery wav_path = "/path/to/wav" # download pretrained models from hugging face hub encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda") decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder", device_map="cuda") # load a waveform waveform, sr = torchaudio.load(wav_path) waveform = torchaudio.functional.resample(waveform, sr, 16000) # encode a waveform into syllabic units outputs = encoder(waveform.to(encoder.device)) units = outputs[0]["units"] # [3950, 67, ..., 503] # unit-to-speech synthesis generated_speech = decoder(units.unsqueeze(0)).waveform.cpu() ``` ## Training Details ### Training Data | | License | Provider | | --- | --- | --- | | [LibriTTS-R](https://www.openslr.org/141/) | CC BY 4.0 | Y. Koizumi *et al.* | | [Hi-Fi-CAPTAIN](https://ast-astrec.nict.go.jp/en/release/hi-fi-captain/) | CC BY-NC-SA 4.0 | T. Okamoto *et al.* | ### Training Hyperparameters - **Training regime:** fp16 mixed precision ## Hardware 2 x A6000