openslr/librispeech_asr
Viewer • Updated • 585k • 109k • 228
How to use ryota-komatsu/SylReg-Distill with Transformers:
# Load model directly
from transformers import SylRegForSyllableDiscovery
model = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", dtype="auto")Use the code below to get started with the model.
git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert
sudo apt install git-lfs # for UTMOS
conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10 pip=24.0 setuptools=81.0.0 faiss-gpu=1.13.2
conda activate py310
pip install -r requirements/requirements.txt
sh scripts/setup.sh
import re
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert import SylRegForSyllableDiscovery
wav_path = "/path/to/wav"
# download pretrained models from hugging face hub
encoder = SylRegForSyllableDiscovery.from_pretrained("ryota-komatsu/SylReg-Distill", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/SylReg-Decoder", device_map="cuda")
speechlm = AutoModelForCausalLM.from_pretrained("/path/to/speechLM", device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained("/path/to/speechLM")
# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)
# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
units = outputs[0]["units"] # [3950, 67, ..., 503]
# speech language modeling
messages = [
{"role": "user", "content": "".join(f"<{unit}>" for unit in units)},
]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).input_ids.to(speechlm.device)
generated_ids = speechlm.generate(input_ids=input_ids, do_sample=True, temperature=0.8)[0]
units = tokenizer.decode(generated_ids)
units = torch.tensor([int(unit) for unit in re.findall(r"<(\d+)>", units)], device=decoder.device)
# unit-to-speech synthesis
generated_speech = decoder(units.unsqueeze(0)).waveform.cpu()
LibriSpeech train-clean-100
4 x A6000
BibTeX:
@inproceedings{Komatsu_Self-Supervised_Syllable_Discovery_2024,
author = {Komatsu, Ryota and Shinozaki, Takahiro},
title = {Self-Supervised Syllable Discovery Based on Speaker-Disentangled {HuBERT}},
year = {2024},
month = {Dec.},
booktitle = {IEEE Spoken Language Technology Workshop},
pages = {1131--1136},
doi = {10.1109/SLT61566.2024.10832325},
}