File size: 2,703 Bytes
a6829ac
 
c4451ff
 
 
 
 
a6829ac
 
c4451ff
a6829ac
 
 
 
 
 
 
 
 
 
 
c4451ff
 
 
 
a6829ac
c4451ff
a6829ac
 
 
c4451ff
 
a6829ac
c4451ff
a6829ac
c4451ff
a6829ac
c4451ff
 
 
a6829ac
c4451ff
a6829ac
c4451ff
 
 
a6829ac
c4451ff
 
a6829ac
c4451ff
 
a6829ac
c4451ff
 
 
a6829ac
c4451ff
 
a6829ac
c4451ff
a6829ac
c4451ff
 
 
a6829ac
c4451ff
 
 
a6829ac
c4451ff
 
 
a6829ac
c4451ff
 
 
a6829ac
 
 
 
 
 
 
c4451ff
 
 
 
a6829ac
c4451ff
a6829ac
c4451ff
a6829ac
c4451ff
a6829ac
c4451ff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
library_name: transformers
license: cc-by-nc-sa-4.0
language:
- en
base_model:
- ryota-komatsu/sylreg-decoder-base
---

# SylReg-Decoder

<!-- Provide a quick summary of what the model is/does. -->



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Model type:** Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2
- **Language(s) (NLP):** English
- **License:** CC BY-NC-SA 4.0
- **Finetuned from model:** [SylReg-Decoder Base](https://huggingface.co/ryota-komatsu/sylreg-decoder-base)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [Code](https://github.com/ryota-komatsu/speaker_disentangled_hubert)
- **Demo:** [Project page](https://ryota-komatsu.github.io/speaker_disentangled_hubert)

## How to Get Started with the Model

Use the code below to get started with the model.

```sh
git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert

sudo apt install git-lfs  # for UTMOS

conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0
conda activate py310
pip install -r requirements/requirements.txt

sh scripts/setup.sh
```

```python
import re

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer

from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery

wav_path = "/path/to/wav"

# download pretrained models from hugging face hub
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder", device_map="cuda")

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
units = outputs[0]["units"]  # [3950, 67, ..., 503]

# unit-to-speech synthesis
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

|  | License | Provider |
| --- | --- | --- |
| [LibriTTS-R](https://www.openslr.org/141/) | CC BY 4.0 | Y. Koizumi *et al.* |
| [Hi-Fi-CAPTAIN](https://ast-astrec.nict.go.jp/en/release/hi-fi-captain/) | CC BY-NC-SA 4.0 | T. Okamoto *et al.* |

### Training Hyperparameters

- **Training regime:** fp16 mixed precision

## Hardware

2 x A6000