Update README.md
Browse filesFixed typo: the accent dataset has 800 minutes, not 800 hours
README.md
CHANGED
|
@@ -101,16 +101,16 @@ The results were the following:
|
|
| 101 |
|
| 102 |
## Catalan Accent Classification
|
| 103 |
|
| 104 |
-
We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 hours subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset.
|
| 105 |
For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available.
|
| 106 |
We created new (smaller) splits assuring that:
|
| 107 |
- Every accent has the same amount of speakers
|
| 108 |
- Every speaker has at most 10 sentences (to avoid super-present speakers).
|
| 109 |
-
As a result of that, we obtained balanced train (730
|
| 110 |
We used the field “assigned_accent” as target label.
|
| 111 |
This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic".
|
| 112 |
|
| 113 |
-
We fine-tuned on this Catalan Accent Classification-labelled 800
|
| 114 |
- Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
|
| 115 |
- English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
|
| 116 |
- Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)
|
|
|
|
| 101 |
|
| 102 |
## Catalan Accent Classification
|
| 103 |
|
| 104 |
+
We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 minutes (13 hours) subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset.
|
| 105 |
For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available.
|
| 106 |
We created new (smaller) splits assuring that:
|
| 107 |
- Every accent has the same amount of speakers
|
| 108 |
- Every speaker has at most 10 sentences (to avoid super-present speakers).
|
| 109 |
+
As a result of that, we obtained balanced train (730 minutes), validation (30 minutes) and test (37 minutes) splits.
|
| 110 |
We used the field “assigned_accent” as target label.
|
| 111 |
This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic".
|
| 112 |
|
| 113 |
+
We fine-tuned on this Catalan Accent Classification-labelled 800 minutes training split the following models:
|
| 114 |
- Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
|
| 115 |
- English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
|
| 116 |
- Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)
|