Blanca commited on
Commit
66e9b75
·
1 Parent(s): 97e9d0f

Update dataset names

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -124,13 +124,13 @@ It contains the following tasks and their related datasets:
124
  1. Named Entity Recognition (NER)
125
 
126
 
127
- **[AnCora Catalan 2.0.0](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: extracted named entities from the original [Ancora](https://doi.org/10.5281/zenodo.4762030) version,
128
  filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format
129
 
130
 
131
  2. Part-of-Speech Tagging (POS)
132
 
133
- Catalan-Ancora: from the [Universal Dependencies treebank](https://github.com/UniversalDependencies/UD_Catalan-AnCora) of the well-known Ancora corpus.
134
 
135
  3. Text Classification (TC)
136
 
@@ -138,11 +138,11 @@ It contains the following tasks and their related datasets:
138
 
139
  4. Textual Entailment (TE)
140
 
141
- **[TECa](https://huggingface.co/datasets/projecte-aina/teca)**: consisting of 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction, or neutral), extracted from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
142
 
143
  5. Semantic Textual Similarity (STS)
144
 
145
- **[Catalan semantic textual similarity](https://huggingface.co/datasets/projecte-aina/sts-ca)**: consisting of more than 3000 sentence pairs, annotated with the semantic similarity between them, scraped from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
146
 
147
  6. Question Answering (QA):
148
 
@@ -152,7 +152,7 @@ It contains the following tasks and their related datasets:
152
 
153
  **[CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa)**: an aggregation of 2 previous datasets (VilaQuAD and ViquiQuAD), 21,427 pairs of Q/A balanced by type of question, containing one question and one answer per context, although the contexts can repeat multiple times.
154
 
155
- **[XQuAD](https://huggingface.co/datasets/projecte-aina/xquad-ca)**: the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_.
156
 
157
  Here are the train/dev/test splits of the datasets:
158
 
@@ -160,23 +160,23 @@ Here are the train/dev/test splits of the datasets:
160
  |:--|:--|:--|:--|:--|
161
  | NER (Ancora) |13,581 | 10,628 | 1,427 | 1,526 |
162
  | POS (Ancora)| 16,678 | 13,123 | 1,709 | 1,846 |
163
- | STS | 3,073 | 2,073 | 500 | 500 |
164
  | TC (TeCla) | 137,775 | 110,203 | 13,786 | 13,786|
165
- | TE (TECa) | 21,163 | 16,930 | 2,116 | 2,117
166
  | QA (VilaQuAD) | 6,282 | 3,882 | 1,200 | 1,200 |
167
  | QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
168
  | QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
169
 
170
  ### Evaluation Results
171
 
172
- | Task | NER (F1) | POS (F1) | STS (Comb) | TC (Acc.) | TE (Acc.) | QA (VilaQuAD) (F1/EM)| QA (ViquiQuAD) (F1/EM) | QA (CatalanQA) (F1/EM) | QA (XQuAD-Ca)<sup>1</sup> (F1/EM) |
173
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
174
  | RoBERTa-base-ca-v2 | **89.45** | 99.09 | 79.07 | **74.26** | **83.14** | **87.74/72.58** | **88.72/75.91** | **89.50**/76.63 | **73.64/55.42** |
175
  | BERTa | 88.94 | **99.10** | **80.19** | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/**77.14** | 69.20/51.47 |
176
  | mBERT | 87.36 | 98.98 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
177
  | XLM-RoBERTa | 88.07 | 99.03 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
178
 
179
- <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-Ca.
180
 
181
  ## Licensing Information
182
 
 
124
  1. Named Entity Recognition (NER)
125
 
126
 
127
+ **[NER (AnCora)](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: extracted named entities from the original [Ancora](https://doi.org/10.5281/zenodo.4762030) version,
128
  filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format
129
 
130
 
131
  2. Part-of-Speech Tagging (POS)
132
 
133
+ **[POS (AnCora)](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: from the [Universal Dependencies treebank](https://github.com/UniversalDependencies/UD_Catalan-AnCora) of the well-known Ancora corpus.
134
 
135
  3. Text Classification (TC)
136
 
 
138
 
139
  4. Textual Entailment (TE)
140
 
141
+ **[TE-ca](https://huggingface.co/datasets/projecte-aina/teca)**: consisting of 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction, or neutral), extracted from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
142
 
143
  5. Semantic Textual Similarity (STS)
144
 
145
+ **[STS-ca](https://huggingface.co/datasets/projecte-aina/sts-ca)**: consisting of more than 3000 sentence pairs, annotated with the semantic similarity between them, scraped from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
146
 
147
  6. Question Answering (QA):
148
 
 
152
 
153
  **[CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa)**: an aggregation of 2 previous datasets (VilaQuAD and ViquiQuAD), 21,427 pairs of Q/A balanced by type of question, containing one question and one answer per context, although the contexts can repeat multiple times.
154
 
155
+ **[XQuAD-ca](https://huggingface.co/datasets/projecte-aina/xquad-ca)**: the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_.
156
 
157
  Here are the train/dev/test splits of the datasets:
158
 
 
160
  |:--|:--|:--|:--|:--|
161
  | NER (Ancora) |13,581 | 10,628 | 1,427 | 1,526 |
162
  | POS (Ancora)| 16,678 | 13,123 | 1,709 | 1,846 |
163
+ | STS (STS-ca) | 3,073 | 2,073 | 500 | 500 |
164
  | TC (TeCla) | 137,775 | 110,203 | 13,786 | 13,786|
165
+ | TE (TE-ca) | 21,163 | 16,930 | 2,116 | 2,117
166
  | QA (VilaQuAD) | 6,282 | 3,882 | 1,200 | 1,200 |
167
  | QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
168
  | QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
169
 
170
  ### Evaluation Results
171
 
172
+ | Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TE-ca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
173
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
174
  | RoBERTa-base-ca-v2 | **89.45** | 99.09 | 79.07 | **74.26** | **83.14** | **87.74/72.58** | **88.72/75.91** | **89.50**/76.63 | **73.64/55.42** |
175
  | BERTa | 88.94 | **99.10** | **80.19** | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/**77.14** | 69.20/51.47 |
176
  | mBERT | 87.36 | 98.98 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
177
  | XLM-RoBERTa | 88.07 | 99.03 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
178
 
179
+ <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
180
 
181
  ## Licensing Information
182