iliasslasri
/

robust_speech_quantizer

@@ -5,54 +5,52 @@ language:
 datasets:
 - librispeech_asr
 metrics:
-- abx
-- wer
 - ued
 pipeline_tag: automatic-speech-recognition
 tags:
 - speech
 - discrete-units
 - quantization
 - hubert
-- clustering
 base_model:
 - facebook/hubert-base-ls960
 ---
-# Robust Quantizer from HuBERT Base (Layer 6)
-This model checkpoint contains a **Robust Quantizer** trained on top of the 6th layer of the `hubert-base-ls960` model. It was developed as part of a reproduction and evaluation study on creating robust discrete speech units, originally proposed in *Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling (Gat et al., 2023)*.
-## Model Details
-This quantizer was trained to provide discrete pseudo-labels that are resilient to various acoustic perturbations. By applying data augmentations during the quantization process, the resulting discrete units become, and by extension downstream acoustic models, more robust to noise and varying acoustic conditions.
-- **Base Model:** [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
-- **Layer:** 6
-- **Vocabulary Size (Clusters):** 100, 200, 500
-- **Algorithm:** K-Means
-- **Dataset:** [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) (`train-clean-100`)
-## Usage
-### Download the Model
 ```python
 from huggingface_hub import hf_hub_download
-model_path = hf_hub_download(repo_id="iliasslasri/robust_speech_quantizer",
-                              filename="500_vocab_size/round_1/E1_best.pt",
-                              force_download=True)
-config_path = hf_hub_download(repo_id="iliasslasri/robust_speech_quantizer",
-                               filename="500_vocab_size/config.yaml",
-                               force_download=True)
 ```
-## Augmentation Examples
-Here are examples of the data augmentations applied to the audio during the training of the quantizer:
-| Augmentation | Audio Example |
 |---|---|
 | Clean | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/00_clean.wav"></audio> |
 | Time Stretch | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/01_time_stretch.wav"></audio> |
@@ -70,6 +68,6 @@ Here are examples of the data augmentations applied to the audio during the trai
 | Duck Audio | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/13_duck_audio.wav"></audio> |
 | Up-Down Resample | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/14_updownresample.wav"></audio> |
-## Relevant Links
-- Original Paper: [Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling (Gat et al., 2023)](https://aclanthology.org/2023.iwslt-1.46/)
-- Project Repository: [github](https://github.com/iliasslasri/snlp_project)

 datasets:
 - librispeech_asr
 metrics:
 - ued
+- abx
 pipeline_tag: automatic-speech-recognition
 tags:
 - speech
 - discrete-units
 - quantization
 - hubert
+- dinosr
+- spidr
 base_model:
 - facebook/hubert-base-ls960
 ---
+# Robust Speech Quantizer (HuBERT / DinoSR / SpidR)
+**[GitHub Repository](https://github.com/iliasslasri/snlp_project)**
+MLP-based robust speech quantizers trained with CTC loss and iterative pseudo-labeling on augmented audio, following [Algayres et al., Interspeech 2023](https://aclanthology.org/2023.iwslt-1.46/). Evaluated on K ∈ {100, 200, 500} vocabulary sizes.
+## Encoders
+| Encoder | Checkpoint | Layer | Pre-training data |
+|---|---|---|---|
+| [HuBERT Base](https://huggingface.co/facebook/hubert-base-ls960) | `hubert-base-ls960` | 6 | LibriSpeech 960h |
+| [DinoSR](https://arxiv.org/abs/2305.04582) | original + SpidR-reproduced | 5 | LibriSpeech 960h |
+| [SpidR](https://arxiv.org/abs/2512.20308) | `spidr-base` | 6 | LibriSpeech 960h |
+## Quick Start
 ```python
 from huggingface_hub import hf_hub_download
+model_path = hf_hub_download(
+    repo_id="iliasslasri/robust_speech_quantizer",
+    filename="500_vocab_size/round_1/E1_best.pt"
+)
+config_path = hf_hub_download(
+    repo_id="iliasslasri/robust_speech_quantizer",
+    filename="500_vocab_size/config.yaml"
+)
 ```
+## Augmentations
+| Augmentation | Audio |
 |---|---|
 | Clean | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/00_clean.wav"></audio> |
 | Time Stretch | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/01_time_stretch.wav"></audio> |
 | Duck Audio | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/13_duck_audio.wav"></audio> |
 | Up-Down Resample | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/14_updownresample.wav"></audio> |
+## Links
+- Paper: [Algayres et al., Interspeech 2023](https://aclanthology.org/2023.iwslt-1.46/)
+- Code: [GitHub](https://github.com/iliasslasri/snlp_project)