Robust Quantizer for HuBERT Base (Layer 9)
This model checkpoint contains a Robust Quantizer trained on top of the 9th layer of the hubert-base-ls960 model. It was developed as part of a reproduction and evaluation study on creating robust discrete speech units, originally proposed in Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling (Gat et al., 2023).
Model Details
This quantizer was trained to provide discrete pseudo-labels that are resilient to various acoustic perturbations. By applying data augmentations during the quantization process, the resulting discrete units become, and by extension downstream acoustic models, more robust to noise and varying acoustic conditions.
- Base Model: facebook/hubert-base-ls960
- Layer: 9
- Vocabulary Size (Clusters): 500
- Algorithm: K-Means
- Dataset: LibriSpeech (
train-clean-100)
Training Procedure
The model was trained for 10 epochs using the iterative training/pseudo-labeling procedure described in the original paper.
Data Augmentations Applied:
- Time Stretching
- Pitch Shifting
- Reverberation
- Additive Noise
Intended Use
This checkpoint is intended to be used to extract sequence of discrete units (pseudo-labels/tokens) from raw audio waveforms.
# Pseudo-code for usage
import torch
from transformers import HubertModel
hubert = HubertModel.from_pretrained("facebook/hubert-base-ls960")
# Load this quantizer
quantizer = torch.load("path_to_downloaded_checkpoint.pt")
# ... Pass audio through HuBERT to get layer 9 hidden states
# ... Apply quantizer to get discrete units
Relevant Links
- Original Paper: Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling (Gat et al., 2023)
- Project Repository: github