Update README.md
Browse files
README.md
CHANGED
|
@@ -139,7 +139,7 @@ language:
|
|
| 139 |
|
| 140 |
mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
|
| 141 |
Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
|
| 142 |
-
Training employs a two-level language, data source up-sampling during training. See more information in our paper.
|
| 143 |
|
| 144 |
**This repository contains:**
|
| 145 |
* Fairseq checkpoint (original);
|
|
@@ -147,24 +147,22 @@ Training employs a two-level language, data source up-sampling during training.
|
|
| 147 |
* Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
|
| 148 |
|
| 149 |
**Related Models:**
|
| 150 |
-
* Second Iteration repository
|
| 151 |
-
* First Iteration repository
|
| 152 |
-
* CommonVoice Prototype (12 languages)
|
| 153 |
|
| 154 |
# Training
|
| 155 |
|
| 156 |
-
**Manifest list
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
-
*
|
| 161 |
-
|
| 162 |
-
**Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
|
| 163 |
|
| 164 |
# ML-SUPERB Scores
|
| 165 |
|
| 166 |
mHubert-147 reaches second and first position in the 10min and 1h leaderboards respectively. We achieve new SOTA scores for three LID tasks.
|
| 167 |
-
See more information in our paper.
|
| 168 |
|
| 169 |

|
| 170 |
|
|
|
|
| 139 |
|
| 140 |
mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
|
| 141 |
Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
|
| 142 |
+
Training employs a two-level language, data source up-sampling during training. See more information in [our paper](https://arxiv.org/pdf/2406.06371).
|
| 143 |
|
| 144 |
**This repository contains:**
|
| 145 |
* Fairseq checkpoint (original);
|
|
|
|
| 147 |
* Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
|
| 148 |
|
| 149 |
**Related Models:**
|
| 150 |
+
* [Second Iteration repository](https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter)
|
| 151 |
+
* [First Iteration repository](https://huggingface.co/utter-project/mHuBERT-147-base-1st-iter)
|
| 152 |
+
* [CommonVoice Prototype (12 languages)](https://huggingface.co/utter-project/hutter-12-3rd-base)
|
| 153 |
|
| 154 |
# Training
|
| 155 |
|
| 156 |
+
* **[Manifest list available here.](https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest)** Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
|
| 157 |
|
| 158 |
+
* **[Fairseq fork](https://github.com/utter-project/fairseq)** contains the scripts for training with multilingual batching with two-level up-sampling.
|
| 159 |
|
| 160 |
+
* **[Scripts for pre-processing/faiss clustering available here.](https://github.com/utter-project/mHuBERT-147-scripts)**
|
|
|
|
|
|
|
| 161 |
|
| 162 |
# ML-SUPERB Scores
|
| 163 |
|
| 164 |
mHubert-147 reaches second and first position in the 10min and 1h leaderboards respectively. We achieve new SOTA scores for three LID tasks.
|
| 165 |
+
See more information in [our paper](https://arxiv.org/pdf/2406.06371).
|
| 166 |
|
| 167 |

|
| 168 |
|