ILS-SSL
ILS-SSL: Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision
The data preparation and pre-training for the first iteration follow the same pipeline as Hubert. We give example scripts for ILS-Hubert pre-training and fine-tuning in src/examples/hubert/scripts
Pre-Trained and Fine-tuned Models
| Model | Pretraining Dataset | Finetuning Dataset | Model |
|---|---|---|---|
| ILS-Base | 960h LibriSpeech | - | Download |
| ILS-Large | 60k hrs Libri-Light | - | Download |
| ILS-Large | 60k hrs Libri-Light | 960h LibriSpeech | Download |
Results on Librispeech
| Base Model | Finetuning set | LM | test-clean | test-other |
|---|---|---|---|---|
| wav2vec2.0 | 1 hour | None | 24.5 | 29.7 |
| Hubert | 1 hour | None | 20.9 | 27.5 |
| ILS-SSL | 1 hour | None | 17.9 | 23.1 |
| wav2vec2.0 | 1 hour | 4-gram | 5.5 | 11.3 |
| Hubert | 1 hour | 4-gram | 6.1 | 11.3 |
| ILS-SSL | 1 hour | 4-gram | 5.4 | 10.2 |
| wav2vec2.0 | 10 hour | None | 11.1 | 17.6 |
| Hubert | 10 hour | None | 10.1 | 16.8 |
| ILS-SSL | 10 hour | None | 8.3 | 13.6 |
| wav2vec2.0 | 10 hour | 4-gram | 4.3 | 9.5 |
| Hubert | 10 hour | 4-gram | 4.3 | 9.4 |
| ILS-SSL | 10 hour | 4-gram | 3.8 | 8.1 |
| wav2vec2.0 | 100 hour | None | 6.1 | 13.3 |
| Hubert | 100 hour | None | 6.3 | 13.2 |
| ILS-SSL | 100 hour | None | 4.7 | 10.1 |
| wav2vec2.0 | 100 hour | 4-gram | 3.4 | 8.0 |
| Hubert | 100 hour | 4-gram | 3.4 | 8.1 |
| ILS-SSL | 100 hour | 4-gram | 3.0 | 6.9 |
| Large Model | Finetuning set | LM | test-clean | test-other |
|---|---|---|---|---|
| wav2vec2.0 | 1 hour | None | 17.2 | 20.3 |
| Hubert | 1 hour | None | 17.4 | 20.3 |
| ILS-SSL | 1 hour | None | 14.3 | 16.9 |
| wav2vec2.0 | 1 hour | Transf | 2.9 | 5.8 |
| Hubert | 1 hour | Transf | 2.9 | 5.4 |
| ILS-SSL | 1 hour | Transf | 2.8 | 5.3 |
| wav2vec2.0 | 10 hour | None | 6.3 | 10.0 |
| Hubert | 10 hour | None | 6.2 | 9.6 |
| ILS-SSL | 10 hour | None | 6.1 | 9.1 |
| wav2vec2.0 | 10 hour | Transf | 2.6 | 4.9 |
| Hubert | 10 hour | Transf | 2.4 | 4.6 |
| ILS-SSL | 10 hour | Transf | 2.5 | 4.5 |
| wav2vec2.0 | 100 hour | None | 3.1 | 6.3 |
| Hubert | 100 hour | None | 2.9 | 6.0 |
| ILS-SSL | 100 hour | None | 2.9 | 5.8 |
| wav2vec2.0 | 100 hour | Transf | 2.0 | 4.0 |
| Hubert | 100 hour | Transf | 2.1 | 3.9 |
| ILS-SSL | 100 hour | Transf | 2.0 | 4.0 |
| wav2vec2.0 | 960 hour | None | 2.2 | 4.5 |
| Hubert | 960 hour | None | 2.1 | 4.3 |
| ILS-SSL | 960 hour | None | 1.9 | 3.8 |
| wav2vec2.0 | 960 hour | Transf | 1.8 | 3.3 |
| Hubert | 960 hour | Transf | 1.9 | 3.3 |
| ILS-SSL | 960 hour | Transf | 1.8 | 3.2 |