File size: 2,414 Bytes
67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 67e305c 6a14086 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
library_name: transformers
language:
- en
license: mit
---
# SylReg-Decoder Base
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Model type:** Flow-matching-based Diffusion Transformer (DiT) with BigVGAN-v2
- **Language(s) (NLP):** English
- **License:** MIT
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [Code](https://github.com/ryota-komatsu/speaker_disentangled_hubert)
- **Demo:** [Project page](https://ryota-komatsu.github.io/speaker_disentangled_hubert)
## How to Get Started with the Model
Use the code below to get started with the model.
```sh
git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert
sudo apt install git-lfs # for UTMOS
conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.19 pip=24.0 faiss-gpu=1.12.0
conda activate py310
pip install -r requirements/requirements.txt
sh scripts/setup.sh
```
```python
import re
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert.models.sylreg import SylRegForSyllableDiscovery
wav_path = "/path/to/wav"
# download pretrained models from hugging face hub
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder-Base", device_map="cuda")
# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)
# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
units = outputs[0]["units"] # [3950, 67, ..., 503]
# unit-to-speech synthesis
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
| | License | Provider |
| --- | --- | --- |
| [LibriTTS-R](https://www.openslr.org/141/) | CC BY 4.0 | Y. Koizumi *et al.* |
### Training Hyperparameters
- **Training regime:** fp16 mixed precision
## Hardware
2 x A6000 |