projecte-aina
/

roberta-base-ca-v2

@@ -1,9 +1,22 @@
 ---
-language: "ca"
 tags:
-- masked-lm
-- RoBERTa-base-ca-v2
-- catalan
 widget:
 - text: "El Català és una llengua molt <mask>."
 - text: "Salvador Dalí va viure a <mask>."
@@ -13,24 +26,57 @@ widget:
 - text: "Vaig al <mask> a buscar bolets."
 - text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
 - text: "Catalunya és una referència en <mask> a nivell europeu."
-license: apache-2.0
 ---
 ## Model description
 RoBERTa-ca-v2 is a transformer-based masked language model for the Catalan language.
 It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
 and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
-## Tokenization and pretraining
-The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
-used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens.
-The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
-with the same hyperparameters as in the original work.
-The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
-## Training corpora and preprocessing
 The training corpus consists of several corpora gathered from web crawling and public corpora.
@@ -52,9 +98,18 @@ The training corpus consists of several corpora gathered from web crawling and p
 | Vilaweb                 | 0.06       |
 | Tweets                  | 0.02       |
 ## Evaluation
-### CLUB benchmark
 The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
 that has been created along with the model.
@@ -95,7 +150,7 @@ Here are the train/dev/test splits of the datasets:
 | TC (TeCla) |  137,775 | 110,203 | 13,786 |  13,786|
 | QA (ViquiQuAD) | 14,239  | 11,255  | 1,492  | 1,429 |
-### Results
 | Task        | NER (F1)      | POS (F1)   | STS (Pearson)   | TC (accuracy) | QA (ViquiQuAD) (F1/EM)  | QA (XQuAD) (F1/EM) |
 | ------------|:-------------:| -----:|:------|:-------|:------|:----|
@@ -105,10 +160,39 @@ Here are the train/dev/test splits of the datasets:
 | XLM-RoBERTa | 87.66 | 98.89 | 75.40 | 71.68 | 85.50/70.47 | 67.10/46.42 |
 | WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 |
-## Intended uses & limitations
-The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
-However, the is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
-## Funding
-This work was funded by the Generalitat de Catalunya within the framework of the AINA language technologies plan.

 ---
+language:
+- ca
+license: apache-2.0
 tags:
+- "catalan"
+- "masked-lm"
+- "RoBERTa-base-ca-v2"
+- "CaText"
+- "Catalan Textual Corpus"
 widget:
 - text: "El Català és una llengua molt <mask>."
 - text: "Salvador Dalí va viure a <mask>."
 - text: "Vaig al <mask> a buscar bolets."
 - text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
 - text: "Catalunya és una referència en <mask> a nivell europeu."
 ---
+# Catalan BERTa-v2 (roberta-base-ca-v2) base model
+## Table of Contents
+- [Model Description](#model-description)
+- [Intended Uses and Limitations](#intended-uses-and-limitations)
+- [How to Use](#how-to-use)
+- [Training](#training)
+  - [Training Data](#training-data)
+  - [Training Procedure](#training-procedure)
+- [Evaluation](#evaluation)
+   - [CLUB Benchmark](#club-benchmark)
+   - [Evaluation Results](#evaluation-results)
+- [Licensing Information](#licensing-information)
+- [Citation Information](#citation-information)
+- [Funding](#funding)
+- [Contributions](#contributions)
 ## Model description
 RoBERTa-ca-v2 is a transformer-based masked language model for the Catalan language.
 It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
 and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
+## Intended Uses and Limitations
+**roberta-base-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
+However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
+## How to Use
+Here is how to use this model:
+```python
+from transformers import AutoModelForMaskedLM
+from transformers import AutoTokenizer, FillMaskPipeline
+from pprint import pprint
+tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/roberta-base-ca-v2')
+model = AutoModelForMaskedLM.from_pretrained('projecte-aina/roberta-base-ca-v2')
+model.eval()
+pipeline = FillMaskPipeline(model, tokenizer_hf)
+text = f"Em dic <mask>."
+res_hf = pipeline(text)
+pprint([r['token_str'] for r in res_hf])
+```
+## Training
+### Training data
 The training corpus consists of several corpora gathered from web crawling and public corpora.
 | Vilaweb                 | 0.06       |
 | Tweets                  | 0.02       |
+### Training Procedure
+The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
+used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens.
+The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
+with the same hyperparameters as in the original work.
+The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
 ## Evaluation
+### CLUB Benchmark
 The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
 that has been created along with the model.
 | TC (TeCla) |  137,775 | 110,203 | 13,786 |  13,786|
 | QA (ViquiQuAD) | 14,239  | 11,255  | 1,492  | 1,429 |
+### Evaluation Results
 | Task        | NER (F1)      | POS (F1)   | STS (Pearson)   | TC (accuracy) | QA (ViquiQuAD) (F1/EM)  | QA (XQuAD) (F1/EM) |
 | ------------|:-------------:| -----:|:------|:-------|:------|:----|
 | XLM-RoBERTa | 87.66 | 98.89 | 75.40 | 71.68 | 85.50/70.47 | 67.10/46.42 |
 | WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 |
+## Licensing Information
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Citation Information
+If you use any of these resources (datasets or models) in your work, please cite our latest paper:
+```bibtex
+@inproceedings{armengol-estape-etal-2021-multilingual,
+    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
+    author = "Armengol-Estap{\'e}, Jordi  and
+      Carrino, Casimiro Pio  and
+      Rodriguez-Penagos, Carlos  and
+      de Gibert Bonet, Ona  and
+      Armentano-Oller, Carme  and
+      Gonzalez-Agirre, Aitor  and
+      Melero, Maite  and
+      Villegas, Marta",
+    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
+    month = aug,
+    year = "2021",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2021.findings-acl.437",
+    doi = "10.18653/v1/2021.findings-acl.437",
+    pages = "4933--4946",
+}
+```
+### Funding
+This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
+## Contributions
+[N/A]