Transformers
Safetensors
English

SylReg-Distill

Model Details

Model Description

  • Language(s) (NLP): English

Model Sources

How to Get Started with the Model

Use the code below to get started with the model.

git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert

sudo apt install git-lfs  # for UTMOS

conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10 pip=24.0 setuptools=81.0.0 faiss-gpu=1.13.2
conda activate py310
pip install -r requirements/requirements.txt

sh scripts/setup.sh
import re

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer

from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert import SylRegForSyllableDiscovery

wav_path = "/path/to/wav"

# download pretrained models from hugging face hub
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder", device_map="cuda")
speechlm = AutoModelForCausalLM.from_pretrained("/path/to/speechLM", device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained("/path/to/speechLM")

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
units = outputs[0]["units"]  # [3950, 67, ..., 503]

# speech language modeling
messages = [
    {"role": "user", "content": "".join(f"<{unit}>" for unit in units)},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).input_ids.to(speechlm.device)

generated_ids = speechlm.generate(input_ids=input_ids, do_sample=True, temperature=0.8)[0]

units = tokenizer.decode(generated_ids)
units = torch.tensor([int(unit) for unit in re.findall(r"<(\d+)>", units)], device=decoder.device)

# unit-to-speech synthesis
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()

Training Details

Training Data

LibriSpeech train-clean-100

Training Hyperparameters

  • Training regime: bf16 mixed precision

Hardware

4 x A6000

Citation

BibTeX:

@inproceedings{Komatsu_Self-Supervised_Syllable_Discovery_2024,
  author    = {Komatsu, Ryota and Shinozaki, Takahiro},
  title     = {Self-Supervised Syllable Discovery Based on Speaker-Disentangled {HuBERT}},
  year      = {2024},
  month     = {Dec.},
  booktitle = {IEEE Spoken Language Technology Workshop},
  pages     = {1131--1136},
  doi       = {10.1109/SLT61566.2024.10832325},
}
Downloads last month
510
Safetensors
Model size
0.1B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ryota-komatsu/SylReg-Distill

Collection including ryota-komatsu/SylReg-Distill

Paper for ryota-komatsu/SylReg-Distill