federicocosta1989 commited on
Commit
4793388
·
verified ·
1 Parent(s): 0e21528

Update README.md

Browse files

Fixed typo: the accent dataset has 800 minutes, not 800 hours

Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -101,16 +101,16 @@ The results were the following:
101
 
102
  ## Catalan Accent Classification
103
 
104
- We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 hours subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset.
105
  For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available.
106
  We created new (smaller) splits assuring that:
107
  - Every accent has the same amount of speakers
108
  - Every speaker has at most 10 sentences (to avoid super-present speakers).
109
- As a result of that, we obtained balanced train (730 hours), validation (30 hours) and test (37 hours) splits.
110
  We used the field “assigned_accent” as target label.
111
  This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic".
112
 
113
- We fine-tuned on this Catalan Accent Classification-labelled 800 hours training split the following models:
114
  - Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
115
  - English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
116
  - Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)
 
101
 
102
  ## Catalan Accent Classification
103
 
104
+ We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 minutes (13 hours) subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset.
105
  For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available.
106
  We created new (smaller) splits assuring that:
107
  - Every accent has the same amount of speakers
108
  - Every speaker has at most 10 sentences (to avoid super-present speakers).
109
+ As a result of that, we obtained balanced train (730 minutes), validation (30 minutes) and test (37 minutes) splits.
110
  We used the field “assigned_accent” as target label.
111
  This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic".
112
 
113
+ We fine-tuned on this Catalan Accent Classification-labelled 800 minutes training split the following models:
114
  - Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
115
  - English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
116
  - Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)