BSC-LT
/

hubert-base-ca-2k

@@ -101,16 +101,16 @@ The results were the following:
 ## Catalan Accent Classification
-We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 hours subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset.
 For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available.
 We created new (smaller) splits assuring that:
 - Every accent has the same amount of speakers
 - Every speaker has at most 10 sentences (to avoid super-present speakers).
-As a result of that, we obtained balanced train (730 hours), validation (30 hours) and test (37 hours) splits.
 We used the field “assigned_accent” as target label.
 This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic".
-We fine-tuned on this Catalan Accent Classification-labelled 800 hours training split the following models:
 - Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
 - English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
 - Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)

 ## Catalan Accent Classification
+We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 minutes (13 hours) subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset.
 For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available.
 We created new (smaller) splits assuring that:
 - Every accent has the same amount of speakers
 - Every speaker has at most 10 sentences (to avoid super-present speakers).
+As a result of that, we obtained balanced train (730 minutes), validation (30 minutes) and test (37 minutes) splits.
 We used the field “assigned_accent” as target label.
 This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic".
+We fine-tuned on this Catalan Accent Classification-labelled 800 minutes training split the following models:
 - Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
 - English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
 - Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)