projecte-aina
/

roberta-base-ca-v2

@@ -124,13 +124,13 @@ It contains the following tasks and their related datasets:
  1. Named Entity Recognition (NER)
-    **[AnCora Catalan 2.0.0](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: extracted named entities from the original [Ancora](https://doi.org/10.5281/zenodo.4762030) version,
     filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format
  2. Part-of-Speech Tagging (POS)
-    Catalan-Ancora: from the [Universal Dependencies treebank](https://github.com/UniversalDependencies/UD_Catalan-AnCora) of the well-known Ancora corpus.
  3. Text Classification (TC)
@@ -138,11 +138,11 @@ It contains the following tasks and their related datasets:
  4. Textual Entailment (TE)
-    **[TECa](https://huggingface.co/datasets/projecte-aina/teca)**: consisting of 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction, or neutral), extracted from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
  5. Semantic Textual Similarity (STS)
-    **[Catalan semantic textual similarity](https://huggingface.co/datasets/projecte-aina/sts-ca)**: consisting of more than 3000 sentence pairs, annotated with the semantic similarity between them, scraped from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
  6. Question Answering (QA):
@@ -152,7 +152,7 @@ It contains the following tasks and their related datasets:
     **[CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa)**: an aggregation of 2 previous datasets (VilaQuAD and ViquiQuAD), 21,427 pairs of Q/A balanced by type of question, containing one question and one answer per context, although the contexts can repeat multiple times.
-    **[XQuAD](https://huggingface.co/datasets/projecte-aina/xquad-ca)**: the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_.
 Here are the train/dev/test splits of the datasets:
@@ -160,23 +160,23 @@ Here are the train/dev/test splits of the datasets:
 |:--|:--|:--|:--|:--|
 | NER (Ancora)  |13,581 | 10,628 | 1,427 | 1,526 |
 | POS (Ancora)| 16,678 | 13,123 | 1,709 | 1,846 |
-| STS         | 3,073 | 2,073 | 500 | 500 |
 | TC (TeCla) |  137,775 | 110,203 | 13,786 |  13,786|
-| TE (TECa) |  21,163 | 16,930 | 2,116 | 2,117
 | QA (VilaQuAD) | 6,282  | 3,882  | 1,200  | 1,200 |
 | QA (ViquiQuAD) | 14,239  | 11,255  | 1,492  | 1,429 |
 | QA (CatalanQA) | 21,427  | 17,135  | 2,157  | 2,135 |
 ### Evaluation Results
-| Task        | NER (F1)      | POS (F1)   | STS (Comb)   | TC (Acc.) | TE (Acc.) | QA (VilaQuAD) (F1/EM)| QA (ViquiQuAD) (F1/EM) | QA (CatalanQA) (F1/EM) | QA (XQuAD-Ca)<sup>1</sup> (F1/EM) |
 | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
 | RoBERTa-base-ca-v2      | **89.45** | 99.09 | 79.07 | **74.26** | **83.14** | **87.74/72.58** | **88.72/75.91** | **89.50**/76.63 | **73.64/55.42** |
 | BERTa      | 88.94 | **99.10** | **80.19** | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/**77.14** | 69.20/51.47 |
 | mBERT      | 87.36 | 98.98 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
 | XLM-RoBERTa      | 88.07 | 99.03 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
-<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-Ca.
 ## Licensing Information

  1. Named Entity Recognition (NER)
+    **[NER (AnCora)](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: extracted named entities from the original [Ancora](https://doi.org/10.5281/zenodo.4762030) version,
     filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format
  2. Part-of-Speech Tagging (POS)
+    **[POS (AnCora)](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: from the [Universal Dependencies treebank](https://github.com/UniversalDependencies/UD_Catalan-AnCora) of the well-known Ancora corpus.
  3. Text Classification (TC)
  4. Textual Entailment (TE)
+    **[TE-ca](https://huggingface.co/datasets/projecte-aina/teca)**: consisting of 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction, or neutral), extracted from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
  5. Semantic Textual Similarity (STS)
+    **[STS-ca](https://huggingface.co/datasets/projecte-aina/sts-ca)**: consisting of more than 3000 sentence pairs, annotated with the semantic similarity between them, scraped from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
  6. Question Answering (QA):
     **[CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa)**: an aggregation of 2 previous datasets (VilaQuAD and ViquiQuAD), 21,427 pairs of Q/A balanced by type of question, containing one question and one answer per context, although the contexts can repeat multiple times.
+    **[XQuAD-ca](https://huggingface.co/datasets/projecte-aina/xquad-ca)**: the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_.
 Here are the train/dev/test splits of the datasets:
 |:--|:--|:--|:--|:--|
 | NER (Ancora)  |13,581 | 10,628 | 1,427 | 1,526 |
 | POS (Ancora)| 16,678 | 13,123 | 1,709 | 1,846 |
+| STS (STS-ca)         | 3,073 | 2,073 | 500 | 500 |
 | TC (TeCla) |  137,775 | 110,203 | 13,786 |  13,786|
+| TE (TE-ca) |  21,163 | 16,930 | 2,116 | 2,117
 | QA (VilaQuAD) | 6,282  | 3,882  | 1,200  | 1,200 |
 | QA (ViquiQuAD) | 14,239  | 11,255  | 1,492  | 1,429 |
 | QA (CatalanQA) | 21,427  | 17,135  | 2,157  | 2,135 |
 ### Evaluation Results
+| Task        | NER (F1)      | POS (F1)   | STS-ca (Comb)   | TeCla (Acc.) | TE-ca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
 | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
 | RoBERTa-base-ca-v2      | **89.45** | 99.09 | 79.07 | **74.26** | **83.14** | **87.74/72.58** | **88.72/75.91** | **89.50**/76.63 | **73.64/55.42** |
 | BERTa      | 88.94 | **99.10** | **80.19** | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/**77.14** | 69.20/51.47 |
 | mBERT      | 87.36 | 98.98 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
 | XLM-RoBERTa      | 88.07 | 99.03 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
+<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
 ## Licensing Information