---
library_name: transformers
license: cc-by-nc-sa-4.0
language:
- en
base_model:
- ryota-komatsu/sylreg-decoder-base
---

# SylReg-Decoder

<!-- Provide a quick summary of what the model is/does. -->


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Model type:** Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2
- **Language(s) (NLP):** English
- **License:** CC BY-NC-SA 4.0
- **Finetuned from model:** [SylReg-Decoder Base](https://huggingface.co/ryota-komatsu/sylreg-decoder-base)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [Code](https://github.com/ryota-komatsu/speaker_disentangled_hubert)
- **Demo:** [Project page](https://ryota-komatsu.github.io/speaker_disentangled_hubert)

## How to Get Started with the Model

Use the code below to get started with the model.

```sh
git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert

sudo apt install git-lfs  # for UTMOS

conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0
conda activate py310
pip install -r requirements/requirements.txt

sh scripts/setup.sh
```

```python
import re

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer

from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery

wav_path = "/path/to/wav"

# download pretrained models from hugging face hub
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder", device_map="cuda")

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
units = outputs[0]["units"]  # [3950, 67, ..., 503]

# unit-to-speech synthesis
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

|  | License | Provider |
| --- | --- | --- |
| [LibriTTS-R](https://www.openslr.org/141/) | CC BY 4.0 | Y. Koizumi *et al.* |
| [Hi-Fi-CAPTAIN](https://ast-astrec.nict.go.jp/en/release/hi-fi-captain/) | CC BY-NC-SA 4.0 | T. Okamoto *et al.* |

### Training Hyperparameters

- **Training regime:** fp16 mixed precision

## Hardware

2 x A6000