Update dataset names
Browse files
README.md
CHANGED
|
@@ -124,13 +124,13 @@ It contains the following tasks and their related datasets:
|
|
| 124 |
1. Named Entity Recognition (NER)
|
| 125 |
|
| 126 |
|
| 127 |
-
**[AnCora
|
| 128 |
filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format
|
| 129 |
|
| 130 |
|
| 131 |
2. Part-of-Speech Tagging (POS)
|
| 132 |
|
| 133 |
-
|
| 134 |
|
| 135 |
3. Text Classification (TC)
|
| 136 |
|
|
@@ -138,11 +138,11 @@ It contains the following tasks and their related datasets:
|
|
| 138 |
|
| 139 |
4. Textual Entailment (TE)
|
| 140 |
|
| 141 |
-
**[
|
| 142 |
|
| 143 |
5. Semantic Textual Similarity (STS)
|
| 144 |
|
| 145 |
-
**[
|
| 146 |
|
| 147 |
6. Question Answering (QA):
|
| 148 |
|
|
@@ -152,7 +152,7 @@ It contains the following tasks and their related datasets:
|
|
| 152 |
|
| 153 |
**[CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa)**: an aggregation of 2 previous datasets (VilaQuAD and ViquiQuAD), 21,427 pairs of Q/A balanced by type of question, containing one question and one answer per context, although the contexts can repeat multiple times.
|
| 154 |
|
| 155 |
-
**[XQuAD](https://huggingface.co/datasets/projecte-aina/xquad-ca)**: the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_.
|
| 156 |
|
| 157 |
Here are the train/dev/test splits of the datasets:
|
| 158 |
|
|
@@ -160,23 +160,23 @@ Here are the train/dev/test splits of the datasets:
|
|
| 160 |
|:--|:--|:--|:--|:--|
|
| 161 |
| NER (Ancora) |13,581 | 10,628 | 1,427 | 1,526 |
|
| 162 |
| POS (Ancora)| 16,678 | 13,123 | 1,709 | 1,846 |
|
| 163 |
-
| STS | 3,073 | 2,073 | 500 | 500 |
|
| 164 |
| TC (TeCla) | 137,775 | 110,203 | 13,786 | 13,786|
|
| 165 |
-
| TE (
|
| 166 |
| QA (VilaQuAD) | 6,282 | 3,882 | 1,200 | 1,200 |
|
| 167 |
| QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
|
| 168 |
| QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
|
| 169 |
|
| 170 |
### Evaluation Results
|
| 171 |
|
| 172 |
-
| Task | NER (F1) | POS (F1) | STS (Comb) |
|
| 173 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
| 174 |
| RoBERTa-base-ca-v2 | **89.45** | 99.09 | 79.07 | **74.26** | **83.14** | **87.74/72.58** | **88.72/75.91** | **89.50**/76.63 | **73.64/55.42** |
|
| 175 |
| BERTa | 88.94 | **99.10** | **80.19** | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/**77.14** | 69.20/51.47 |
|
| 176 |
| mBERT | 87.36 | 98.98 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
|
| 177 |
| XLM-RoBERTa | 88.07 | 99.03 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
|
| 178 |
|
| 179 |
-
<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-
|
| 180 |
|
| 181 |
## Licensing Information
|
| 182 |
|
|
|
|
| 124 |
1. Named Entity Recognition (NER)
|
| 125 |
|
| 126 |
|
| 127 |
+
**[NER (AnCora)](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: extracted named entities from the original [Ancora](https://doi.org/10.5281/zenodo.4762030) version,
|
| 128 |
filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format
|
| 129 |
|
| 130 |
|
| 131 |
2. Part-of-Speech Tagging (POS)
|
| 132 |
|
| 133 |
+
**[POS (AnCora)](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: from the [Universal Dependencies treebank](https://github.com/UniversalDependencies/UD_Catalan-AnCora) of the well-known Ancora corpus.
|
| 134 |
|
| 135 |
3. Text Classification (TC)
|
| 136 |
|
|
|
|
| 138 |
|
| 139 |
4. Textual Entailment (TE)
|
| 140 |
|
| 141 |
+
**[TE-ca](https://huggingface.co/datasets/projecte-aina/teca)**: consisting of 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction, or neutral), extracted from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
|
| 142 |
|
| 143 |
5. Semantic Textual Similarity (STS)
|
| 144 |
|
| 145 |
+
**[STS-ca](https://huggingface.co/datasets/projecte-aina/sts-ca)**: consisting of more than 3000 sentence pairs, annotated with the semantic similarity between them, scraped from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
|
| 146 |
|
| 147 |
6. Question Answering (QA):
|
| 148 |
|
|
|
|
| 152 |
|
| 153 |
**[CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa)**: an aggregation of 2 previous datasets (VilaQuAD and ViquiQuAD), 21,427 pairs of Q/A balanced by type of question, containing one question and one answer per context, although the contexts can repeat multiple times.
|
| 154 |
|
| 155 |
+
**[XQuAD-ca](https://huggingface.co/datasets/projecte-aina/xquad-ca)**: the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_.
|
| 156 |
|
| 157 |
Here are the train/dev/test splits of the datasets:
|
| 158 |
|
|
|
|
| 160 |
|:--|:--|:--|:--|:--|
|
| 161 |
| NER (Ancora) |13,581 | 10,628 | 1,427 | 1,526 |
|
| 162 |
| POS (Ancora)| 16,678 | 13,123 | 1,709 | 1,846 |
|
| 163 |
+
| STS (STS-ca) | 3,073 | 2,073 | 500 | 500 |
|
| 164 |
| TC (TeCla) | 137,775 | 110,203 | 13,786 | 13,786|
|
| 165 |
+
| TE (TE-ca) | 21,163 | 16,930 | 2,116 | 2,117
|
| 166 |
| QA (VilaQuAD) | 6,282 | 3,882 | 1,200 | 1,200 |
|
| 167 |
| QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
|
| 168 |
| QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
|
| 169 |
|
| 170 |
### Evaluation Results
|
| 171 |
|
| 172 |
+
| Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TE-ca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
| 173 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
| 174 |
| RoBERTa-base-ca-v2 | **89.45** | 99.09 | 79.07 | **74.26** | **83.14** | **87.74/72.58** | **88.72/75.91** | **89.50**/76.63 | **73.64/55.42** |
|
| 175 |
| BERTa | 88.94 | **99.10** | **80.19** | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/**77.14** | 69.20/51.47 |
|
| 176 |
| mBERT | 87.36 | 98.98 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
|
| 177 |
| XLM-RoBERTa | 88.07 | 99.03 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
|
| 178 |
|
| 179 |
+
<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
|
| 180 |
|
| 181 |
## Licensing Information
|
| 182 |
|