File size: 4,951 Bytes

ea546b2
 
 
 
 
 
 
 
4d02b99
ed7f224
ea546b2
 
 
 
 
4d02b99
 
ad1c6a4
 
ea546b2
 
4d02b99
ea546b2
4d02b99
ea546b2
4d02b99
ea546b2
4d02b99
ea546b2
4d02b99
 
 
ccd40c2
4d02b99
ea546b2
4d02b99
ba17403
 
 
 
4d02b99
 
 
 
 
 
 
 
ba17403
 
4d02b99
376dda4
4d02b99
376dda4
e23a3b7
c1e62e7
 
 
 
 
 
 
 
 
 
 
 
 
 
ba17403
eff87e0
 
 
 
 
 
 
 
a9e7ffa
 
 
 
 
 
 
 
 
 
 
eff87e0
4d02b99

---
license: mit
language:
- en
datasets:
- librispeech_asr
metrics:
- ued
- abx
pipeline_tag: automatic-speech-recognition
tags:
- speech
- discrete-units
- quantization
- hubert
- dinosr
- spidr
base_model:
- facebook/hubert-base-ls960
---

# Robust Speech Quantizer (HuBERT / DinoSR / SpidR)

**[GitHub Repository](https://github.com/iliasslasri/snlp_project)**

MLP-based robust speech quantizers trained with CTC loss and iterative pseudo-labeling on augmented audio, following [Algayres et al., Interspeech 2023](https://aclanthology.org/2023.iwslt-1.46/). Evaluated on K ∈ {100, 200, 500} vocabulary sizes.

## Encoders

| Encoder | Checkpoint | Layer | Pre-training data |
|---|---|---|---|
| [HuBERT Base](https://huggingface.co/facebook/hubert-base-ls960) | `hubert-base-ls960` | 6 | LibriSpeech 960h |
| [DinoSR](https://arxiv.org/abs/2305.10005) | original + SpidR-reproduced | 5 | LibriSpeech 960h |
| [SpidR](https://arxiv.org/abs/2512.20308) | `spidr-base` | 6 | LibriSpeech 960h |

## Quick Start

```python
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="iliasslasri/robust_speech_quantizer",
    filename="500_vocab_size/round_1/E1_best.pt"
)
config_path = hf_hub_download(
    repo_id="iliasslasri/robust_speech_quantizer",
    filename="500_vocab_size/config.yaml"
)
```

## Augmentations

| Augmentation | Audio |
|---|---|
| Clean | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/00_clean.wav"></audio> |
| Time Stretch | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/01_time_stretch.wav"></audio> |
| Pitch Shift | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/02_pitch_shift.wav"></audio> |
| Reverberation | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/03_reverberation.wav"></audio> |
| Noise | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/04_noise.wav"></audio> |
| Echo | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/05_echo.wav"></audio> |
| Random Noise | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/06_random_noise.wav"></audio> |
| Pink Noise | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/07_pink_noise.wav"></audio> |
| Lowpass Filter | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/08_lowpass_filter.wav"></audio> |
| Highpass Filter | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/09_highpass_filter.wav"></audio> |
| Bandpass Filter | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/10_bandpass_filter.wav"></audio> |
| Smooth | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/11_smooth.wav"></audio> |
| Boost Audio | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/12_boost_audio.wav"></audio> |
| Duck Audio | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/13_duck_audio.wav"></audio> |
| Up-Down Resample | <audio controls src="https://huggingface.co/iliasslasri/robust_speech_quantizer/resolve/main/augmentations/14_updownresample.wav"></audio> |

## Experiments

We trained quantizers across different encoders, codebook sizes, and augmentation strategies. The augmentation configurations are:

- **All augmentations, chained** — all augmentations from the table above are enabled, and multiple augmentations are applied sequentially to each sample. The number of chained augmentations is sampled from a uniform distribution between 0 and 4.
- **All augmentations, single** — all augmentations are enabled, but only one randomly chosen augmentation is applied per sample.
- **No extra augmentations, single** — only the baseline augmentations (from the original paper) are used, with one applied per sample.

| Encoder | Layer | Codebook | Augmentation Strategy |
|:---|:---:|:---:|:---|
| HuBERT | 6 | 500 | All augmentations, chained |
| | | | All augmentations, single |
| | | | No extra augmentations, single |
| | | | |
| SpidR | 6 | 256 | No extra augmentations, single |
| | | | All augmentations, chained |
| | | | |
| DinoSR (original) | 5 | 256 | All augmentations, chained |
| DinoSR (reproduced) | 5 | 256 | All augmentations, chained |

## Links
- Paper: [Algayres et al., Interspeech 2023](https://aclanthology.org/2023.iwslt-1.46/)
- Code: [GitHub](https://github.com/iliasslasri/snlp_project)