projecte-aina
/

roberta-base-ca-v2

RoBERTa-base-ca-v2

Catalan Textual Corpus

Model card Files Files and versions

gonzalez-agirre commited on Jul 22, 2022

Commit

fdd8ca4

·

1 Parent(s): c57b9f4

Update README.md

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -120,14 +120,14 @@ that has been created along with the model.
 It contains the following tasks and their related datasets:
- 1. Part-of-Speech Tagging (POS)
-    Catalan-Ancora: from the [Universal Dependencies treebank](https://github.com/UniversalDependencies/UD_Catalan-AnCora) of the well-known Ancora corpus
- 2. Named Entity Recognition (NER)
-    **[AnCora Catalan 2.0.0](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: extracted named entities from the original [Ancora](https://doi.org/10.5281/zenodo.4762030) version,
-    filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format
  3. Text Classification (TC)
@@ -135,7 +135,7 @@ It contains the following tasks and their related datasets:
  4. Textual Entailment (TE)
-    **[TeCa](https://huggingface.co/datasets/projecte-aina/teca)**: consisting of 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction, or neutral), extracted from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
  5. Semantic Textual Similarity (STS)
@@ -159,7 +159,7 @@ Here are the train/dev/test splits of the datasets:
 | POS (Ancora)| 16,678 | 13,123 | 1,709 | 1,846 |
 | STS         | 3,073 | 2,073 | 500 | 500 |
 | TC (TeCla) |  137,775 | 110,203 | 13,786 |  13,786|
-| TE (TeCa) |  21,163 | 16,930 | 2,116 | 2,117
 | QA (VilaQuAD) | 6,282  | 3,882  | 1,200  | 1,200 |
 | QA (ViquiQuAD) | 14,239  | 11,255  | 1,492  | 1,429 |
 | QA (CatalanQA) | 21,427  | 17,135  | 2,157  | 2,135 |

 It contains the following tasks and their related datasets:
+ 1. Named Entity Recognition (NER)
+    **[AnCora Catalan 2.0.0](https://zenodo.org/record/4762031#.YKaFjqGxWUk)**: extracted named entities from the original [Ancora](https://doi.org/10.5281/zenodo.4762030) version, filtering out some unconventional ones, like book titles, and transcribed them into a standard CONLL-IOB format.
+ 2. Part-of-Speech Tagging (POS)
+    Catalan-Ancora: from the [Universal Dependencies treebank](https://github.com/UniversalDependencies/UD_Catalan-AnCora) of the well-known Ancora corpus.
  3. Text Classification (TC)
  4. Textual Entailment (TE)
+    **[TECa](https://huggingface.co/datasets/projecte-aina/teca)**: consisting of 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction, or neutral), extracted from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus).
  5. Semantic Textual Similarity (STS)
 | POS (Ancora)| 16,678 | 13,123 | 1,709 | 1,846 |
 | STS         | 3,073 | 2,073 | 500 | 500 |
 | TC (TeCla) |  137,775 | 110,203 | 13,786 |  13,786|
+| TE (TECa) |  21,163 | 16,930 | 2,116 | 2,117
 | QA (VilaQuAD) | 6,282  | 3,882  | 1,200  | 1,200 |
 | QA (ViquiQuAD) | 14,239  | 11,255  | 1,492  | 1,429 |
 | QA (CatalanQA) | 21,427  | 17,135  | 2,157  | 2,135 |