ryota-komatsu
/

SylReg-Decoder-Base

flow_matching_with_bigvgan

Model card Files Files and versions

SylReg-Decoder-Base / README.md

ryota-komatsu's picture

Update README.md

6a14086 verified 14 days ago

|

history blame contribute delete

2.41 kB

	---
	library_name: transformers
	language:
	- en
	license: mit
	---

	# SylReg-Decoder Base

	<!-- Provide a quick summary of what the model is/does. -->

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Model type: Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2
	- Language(s) (NLP): English
	- License: MIT

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [Code](https://github.com/ryota-komatsu/speaker_disentangled_hubert)
	- Demo: [Project page](https://ryota-komatsu.github.io/speaker_disentangled_hubert)

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```sh
	git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
	cd speaker_disentangled_hubert

	sudo apt install git-lfs # for UTMOS

	conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0
	conda activate py310
	pip install -r requirements/requirements.txt

	sh scripts/setup.sh
	```

	```python
	import re

	import torch
	import torchaudio
	from transformers import AutoModelForCausalLM, AutoTokenizer

	from src.flow_matching import FlowMatchingWithBigVGan
	from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery

	wav_path = "/path/to/wav"

	# download pretrained models from hugging face hub
	encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
	decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder-Base", device_map="cuda")

	# load a waveform
	waveform, sr = torchaudio.load(wav_path)
	waveform = torchaudio.functional.resample(waveform, sr, 16000)

	# encode a waveform into syllabic units
	outputs = encoder(waveform.to(encoder.device))
	units = outputs[0]["units"] # [3950, 67, ..., 503]

	# unit-to-speech synthesis
	generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()
	```

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	\| \| License \| Provider \|
	\| --- \| --- \| --- \|
	\| [LibriTTS-R](https://www.openslr.org/141/) \| CC BY 4.0 \| Y. Koizumi et al. \|

	### Training Hyperparameters

	- Training regime: fp16 mixed precision

	## Hardware

	2 x A6000