File size: 2,414 Bytes
67e305c
 
6a14086
 
 
67e305c
 
6a14086
67e305c
 
 
 
 
 
 
 
 
6a14086
 
 
67e305c
6a14086
67e305c
 
 
6a14086
 
67e305c
6a14086
67e305c
6a14086
67e305c
6a14086
 
 
67e305c
6a14086
67e305c
6a14086
 
 
67e305c
6a14086
 
67e305c
6a14086
 
67e305c
6a14086
 
 
67e305c
6a14086
 
67e305c
6a14086
67e305c
6a14086
 
 
67e305c
6a14086
 
 
67e305c
6a14086
 
 
67e305c
6a14086
 
 
67e305c
 
 
 
 
 
 
6a14086
 
 
67e305c
6a14086
67e305c
6a14086
67e305c
6a14086
67e305c
6a14086
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
library_name: transformers
language:
- en
license: mit
---

# SylReg-Decoder Base

<!-- Provide a quick summary of what the model is/does. -->

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Model type:** Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2
- **Language(s) (NLP):** English
- **License:** MIT

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [Code](https://github.com/ryota-komatsu/speaker_disentangled_hubert)
- **Demo:** [Project page](https://ryota-komatsu.github.io/speaker_disentangled_hubert)

## How to Get Started with the Model

Use the code below to get started with the model.

```sh
git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert

sudo apt install git-lfs  # for UTMOS

conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0
conda activate py310
pip install -r requirements/requirements.txt

sh scripts/setup.sh
```

```python
import re

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer

from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery

wav_path = "/path/to/wav"

# download pretrained models from hugging face hub
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder-Base", device_map="cuda")

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
units = outputs[0]["units"]  # [3950, 67, ..., 503]

# unit-to-speech synthesis
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

|  | License | Provider |
| --- | --- | --- |
| [LibriTTS-R](https://www.openslr.org/141/) | CC BY 4.0 | Y. Koizumi *et al.* |

### Training Hyperparameters

- **Training regime:** fp16 mixed precision

## Hardware

2 x A6000