File size: 4,951 Bytes
ea546b2 4d02b99 ed7f224 ea546b2 4d02b99 ad1c6a4 ea546b2 4d02b99 ea546b2 4d02b99 ea546b2 4d02b99 ea546b2 4d02b99 ea546b2 4d02b99 ccd40c2 4d02b99 ea546b2 4d02b99 ba17403 4d02b99 ba17403 4d02b99 376dda4 4d02b99 376dda4 e23a3b7 c1e62e7 ba17403 eff87e0 a9e7ffa eff87e0 4d02b99 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | ---
license: mit
language:
- en
datasets:
- librispeech_asr
metrics:
- ued
- abx
pipeline_tag: automatic-speech-recognition
tags:
- speech
- discrete-units
- quantization
- hubert
- dinosr
- spidr
base_model:
- facebook/hubert-base-ls960
---
# Robust Speech Quantizer (HuBERT / DinoSR / SpidR)
**[GitHub Repository](https://github.com/iliasslasri/snlp_project)**
MLP-based robust speech quantizers trained with CTC loss and iterative pseudo-labeling on augmented audio, following [Algayres et al., Interspeech 2023](https://aclanthology.org/2023.iwslt-1.46/). Evaluated on K ∈ {100, 200, 500} vocabulary sizes.
## Encoders
| Encoder | Checkpoint | Layer | Pre-training data |
|---|---|---|---|
| [HuBERT Base](https://huggingface.co/facebook/hubert-base-ls960) | `hubert-base-ls960` | 6 | LibriSpeech 960h |
| [DinoSR](https://arxiv.org/abs/2305.10005) | original + SpidR-reproduced | 5 | LibriSpeech 960h |
| [SpidR](https://arxiv.org/abs/2512.20308) | `spidr-base` | 6 | LibriSpeech 960h |
## Quick Start
```python
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="iliasslasri/robust_speech_quantizer",
filename="500_vocab_size/round_1/E1_best.pt"
)
config_path = hf_hub_download(
repo_id="iliasslasri/robust_speech_quantizer",
filename="500_vocab_size/config.yaml"
)
```
## Augmentations
| Augmentation | Audio |
|---|---|
| Clean | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/00_clean.wav"></audio> |
| Time Stretch | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/01_time_stretch.wav"></audio> |
| Pitch Shift | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/02_pitch_shift.wav"></audio> |
| Reverberation | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/03_reverberation.wav"></audio> |
| Noise | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/04_noise.wav"></audio> |
| Echo | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/05_echo.wav"></audio> |
| Random Noise | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/06_random_noise.wav"></audio> |
| Pink Noise | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/07_pink_noise.wav"></audio> |
| Lowpass Filter | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/08_lowpass_filter.wav"></audio> |
| Highpass Filter | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/09_highpass_filter.wav"></audio> |
| Bandpass Filter | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/10_bandpass_filter.wav"></audio> |
| Smooth | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/11_smooth.wav"></audio> |
| Boost Audio | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/12_boost_audio.wav"></audio> |
| Duck Audio | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/13_duck_audio.wav"></audio> |
| Up-Down Resample | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/14_updownresample.wav"></audio> |
## Experiments
We trained quantizers across different encoders, codebook sizes, and augmentation strategies. The augmentation configurations are:
- **All augmentations, chained** — all augmentations from the table above are enabled, and multiple augmentations are applied sequentially to each sample. The number of chained augmentations is sampled from a uniform distribution between 0 and 4.
- **All augmentations, single** — all augmentations are enabled, but only one randomly chosen augmentation is applied per sample.
- **No extra augmentations, single** — only the baseline augmentations (from the original paper) are used, with one applied per sample.
| Encoder | Layer | Codebook | Augmentation Strategy |
|:---|:---:|:---:|:---|
| HuBERT | 6 | 500 | All augmentations, chained |
| | | | All augmentations, single |
| | | | No extra augmentations, single |
| | | | |
| SpidR | 6 | 256 | No extra augmentations, single |
| | | | All augmentations, chained |
| | | | |
| DinoSR (original) | 5 | 256 | All augmentations, chained |
| DinoSR (reproduced) | 5 | 256 | All augmentations, chained |
## Links
- Paper: [Algayres et al., Interspeech 2023](https://aclanthology.org/2023.iwslt-1.46/)
- Code: [GitHub](https://github.com/iliasslasri/snlp_project) |