projecte-aina
/

roberta-base-ca-cased-qa

@@ -10,9 +10,9 @@ tags:
 - "catalan"
-- "named entity recognition"
-- "ner"
 - "CaText"
@@ -20,25 +20,29 @@ tags:
 datasets:
-- "projecte-aina/ancora-ca-ner"
 metrics:
 - f1
 model-index:
-- name: roberta-base-ca-cased-ner
   results:
   - task:
       type: token-classification
     dataset:
-      type:   projecte-aina/ancora-ca-ner
-      name: Ancora-ca-NER
     metrics:
       - name: F1
         type: f1
-        value: 0.8813
 widget:
 - text: "Em dic Lluïsa i visc a Santa Maria del Camí."
@@ -49,7 +53,7 @@ widget:
 ---
-# Catalan BERTa (roberta-base-ca) finetuned for Named Entity Recognition.
 ## Table of Contents
 - [Model Description](#model-description)
@@ -68,11 +72,11 @@ widget:
 ## Model description
-The **roberta-base-ca-cased-ner** is a Named Entity Recognition (NER) model for the Catalan language fine-tuned from the [roberta-base-ca](https://huggingface.co/projecte-aina/roberta-base-ca) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca model card for more details).
 ## Intended Uses and Limitations
-**roberta-base-ca-cased-ner** model can be used to recognize Named Entities in the provided text. The model is limited by its training dataset and may not generalize well for all use cases.
 ## How to Use
@@ -82,17 +86,16 @@ Here is how to use this model:
 from transformers import pipeline
 from pprint import pprint
-nlp = pipeline("ner", model="projecte-aina/roberta-base-ca-cased-ner")
 example = "Em dic Lluïsa i visc a Santa Maria del Camí."
-ner_results = nlp(example)
-pprint(ner_results)
 ```
 ## Training
 ### Training data
-We used the NER dataset in Catalan called [AnCora-Ca-NER](https://huggingface.co/datasets/projecte-aina/ancora-ca-ner) for training and evaluation.
 ### Training Procedure
 The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
@@ -103,16 +106,16 @@ The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5
 This model was finetuned maximizing F1 score.
-### Evaluation results
-We evaluated the _roberta-base-ca-cased-ner_ on the AnCora-Ca-NER test set against standard multilingual and monolingual baselines:
-| Model        | Ancora-ca-ner (F1)|
 | ------------|:-------------|
-| roberta-base-ca-cased-ner | **88.13** |
-| mBERT       | 86.38 |
-| XLM-RoBERTa | 87.66 |
-| WikiBERT-ca | 77.66 |
 For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
@@ -120,8 +123,7 @@ For more details, check the fine-tuning and evaluation scripts in the official [
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-## Citation Information
 If you use any of these resources (datasets or models) in your work, please cite our latest paper:
 ```bibtex
 @inproceedings{armengol-estape-etal-2021-multilingual,
@@ -146,4 +148,11 @@ If you use any of these resources (datasets or models) in your work, please cite
 ```
 ### Funding
 This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).

 - "catalan"
+- "part of speech tagging"
+- "pos"
 - "CaText"
 datasets:
+- "universal_dependencies"
 metrics:
 - f1
+inference:
+  parameters:
+    aggregation_strategy: "first"
 model-index:
+- name: roberta-base-ca-cased-pos
   results:
   - task:
       type: token-classification
     dataset:
+      type:   universal_dependencies
+      name: Ancora-ca-POS
     metrics:
       - name: F1
         type: f1
+        value: 0.9893832385244624
 widget:
 - text: "Em dic Lluïsa i visc a Santa Maria del Camí."
 ---
+# Catalan BERTa (roberta-base-ca) finetuned for Part-of-speech-tagging (POS)
 ## Table of Contents
 - [Model Description](#model-description)
 ## Model description
+The **roberta-base-ca-cased-pos** is a Part-of-speech-tagging (POS) model for the Catalan language fine-tuned from the roberta-base-ca model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers.
 ## Intended Uses and Limitations
+**roberta-base-ca-cased-pos** model can be used to Part-of-speech-tagging (POS) a text. The model is limited by its training dataset and may not generalize well for all use cases.
 ## How to Use
 from transformers import pipeline
 from pprint import pprint
+nlp = pipeline("token-classification", model="projecte-aina/roberta-base-ca-cased-pos")
 example = "Em dic Lluïsa i visc a Santa Maria del Camí."
+pos_results = nlp(example)
+pprint(pos_results)
 ```
 ## Training
 ### Training data
+We used the POS dataset in Catalan from the [Universal Dependencies Treebank](https://huggingface.co/datasets/universal_dependencies) we refer to _Ancora-ca-pos_ for training and evaluation.
 ### Training Procedure
 The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
 This model was finetuned maximizing F1 score.
+## Evaluation results
+We evaluated the _roberta-base-ca-cased-pos_ on the Ancora-ca-ner test set against standard multilingual and monolingual baselines:
+| Model        | AnCora-Ca-POS (F1)   |
 | ------------|:-------------|
+| roberta-base-ca-cased-pos |**98.93** |
+| mBERT       | 98.82 |
+| XLM-RoBERTa | 98.89 |
+| WikiBERT-ca | 97.60 |
 For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Citation Information
 If you use any of these resources (datasets or models) in your work, please cite our latest paper:
 ```bibtex
 @inproceedings{armengol-estape-etal-2021-multilingual,
 ```
 ### Funding
 This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
+## Contributions
+[N/A]