DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

A novel self-supervised speech representation learning model combining masked language modeling with self-distillation and online clustering techniques. Achieves SOTA performance on various speech processing tasks.

Model Details
Usage
Training
Evaluation
Citation
Additional Information

Model Details

Developers

Alexander H. Liu, Heng-Jui Chang (MIT CSAIL)
Michael Auli, Wei-Ning Hsu (Meta AI)
James Glass (MIT CSAIL)

Model Type

Self-supervised speech representation learning (Wav2Vec2 architecture variant)

Key Features

Self-distillation with teacher-student framework
Dynamic online clustering
Contextualized masking strategy
Combined contrastive + diversity losses

Usage

Feature Extraction

from transformers import Wav2Vec2ForPreTraining, Wav2Vec2FeatureExtractor
import torch
import librosa

# Load model components
model = Wav2Vec2ForPreTraining.from_pretrained("MohammadJRanjbar/DinoSR")
processor = Wav2Vec2FeatureExtractor.from_pretrained("MohammadJRanjbar/DinoSR")

# Process audio
audio, sr = librosa.load("speech.wav", sr=16000)
inputs = processor(audio, return_tensors="pt", sampling_rate=16000)

# Extract representations
with torch.no_grad():
    outputs = model(**inputs)
    
speech_features = outputs.projected_states  # [batch_size, seq_len, 256]

Fine-tuning for ASR

from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "MohammadJRanjbar/DinoSR",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    layerdrop=0.1,
    ctc_loss_reduction="mean"
)

# Freeze feature encoder
model.freeze_feature_encoder()

Citation

@article{liu2023dinosr,
  title={DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning},
  author={Liu, Alexander H and Chang, Heng-Jui and Auli, Michael and Hsu, Wei-Ning and Glass, James},
  journal={arXiv preprint arXiv:2305.10005},
  year={2023}
}

Additional Information

Resources

Original Paper
GitHub Repository
Hugging Face Documentation
This model was converted from Fairseq to Hugging Face using convert.py script. For the original model, check the GitHub repository.

Contact

For questions and feedback:

Alexander H. Liu: alexhliu@mit.edu
Model maintainer: MohammadJRanjbar

This model card was generated using best practices from Model Card Creator

Downloads last month: 24

Safetensors

Model size

95.8M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for MohammadJRanjbar/DinoSR

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Paper • 2305.10005 • Published May 17, 2023 • 4

MohammadJRanjbar
/

DinoSR

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Table of Contents

Model Details

Developers

Model Type

Key Features

Usage

Feature Extraction

Fine-tuning for ASR

Citation

Additional Information

Resources

Contact

Paper for MohammadJRanjbar/DinoSR

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning