Commit
·
c993230
1
Parent(s):
944ed9d
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,9 +1,22 @@
|
|
| 1 |
---
|
| 2 |
-
language:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
-
|
| 5 |
-
-
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
widget:
|
| 8 |
- text: "El Català és una llengua molt <mask>."
|
| 9 |
- text: "Salvador Dalí va viure a <mask>."
|
|
@@ -13,24 +26,57 @@ widget:
|
|
| 13 |
- text: "Vaig al <mask> a buscar bolets."
|
| 14 |
- text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
|
| 15 |
- text: "Catalunya és una referència en <mask> a nivell europeu."
|
| 16 |
-
|
| 17 |
---
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
## Model description
|
| 20 |
|
| 21 |
RoBERTa-ca-v2 is a transformer-based masked language model for the Catalan language.
|
| 22 |
It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
|
| 23 |
and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
|
| 24 |
|
| 25 |
-
##
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
The training corpus consists of several corpora gathered from web crawling and public corpora.
|
| 36 |
|
|
@@ -52,9 +98,18 @@ The training corpus consists of several corpora gathered from web crawling and p
|
|
| 52 |
| Vilaweb | 0.06 |
|
| 53 |
| Tweets | 0.02 |
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
## Evaluation
|
| 56 |
|
| 57 |
-
### CLUB
|
| 58 |
|
| 59 |
The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
|
| 60 |
that has been created along with the model.
|
|
@@ -95,7 +150,7 @@ Here are the train/dev/test splits of the datasets:
|
|
| 95 |
| TC (TeCla) | 137,775 | 110,203 | 13,786 | 13,786|
|
| 96 |
| QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
|
| 97 |
|
| 98 |
-
### Results
|
| 99 |
|
| 100 |
| Task | NER (F1) | POS (F1) | STS (Pearson) | TC (accuracy) | QA (ViquiQuAD) (F1/EM) | QA (XQuAD) (F1/EM) |
|
| 101 |
| ------------|:-------------:| -----:|:------|:-------|:------|:----|
|
|
@@ -105,10 +160,39 @@ Here are the train/dev/test splits of the datasets:
|
|
| 105 |
| XLM-RoBERTa | 87.66 | 98.89 | 75.40 | 71.68 | 85.50/70.47 | 67.10/46.42 |
|
| 106 |
| WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 |
|
| 107 |
|
| 108 |
-
##
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
|
| 4 |
+
- ca
|
| 5 |
+
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
|
| 8 |
tags:
|
| 9 |
+
|
| 10 |
+
- "catalan"
|
| 11 |
+
|
| 12 |
+
- "masked-lm"
|
| 13 |
+
|
| 14 |
+
- "RoBERTa-base-ca-v2"
|
| 15 |
+
|
| 16 |
+
- "CaText"
|
| 17 |
+
|
| 18 |
+
- "Catalan Textual Corpus"
|
| 19 |
+
|
| 20 |
widget:
|
| 21 |
- text: "El Català és una llengua molt <mask>."
|
| 22 |
- text: "Salvador Dalí va viure a <mask>."
|
|
|
|
| 26 |
- text: "Vaig al <mask> a buscar bolets."
|
| 27 |
- text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
|
| 28 |
- text: "Catalunya és una referència en <mask> a nivell europeu."
|
| 29 |
+
|
| 30 |
---
|
| 31 |
|
| 32 |
+
# Catalan BERTa-v2 (roberta-base-ca-v2) base model
|
| 33 |
+
|
| 34 |
+
## Table of Contents
|
| 35 |
+
- [Model Description](#model-description)
|
| 36 |
+
- [Intended Uses and Limitations](#intended-uses-and-limitations)
|
| 37 |
+
- [How to Use](#how-to-use)
|
| 38 |
+
- [Training](#training)
|
| 39 |
+
- [Training Data](#training-data)
|
| 40 |
+
- [Training Procedure](#training-procedure)
|
| 41 |
+
- [Evaluation](#evaluation)
|
| 42 |
+
- [CLUB Benchmark](#club-benchmark)
|
| 43 |
+
- [Evaluation Results](#evaluation-results)
|
| 44 |
+
- [Licensing Information](#licensing-information)
|
| 45 |
+
- [Citation Information](#citation-information)
|
| 46 |
+
- [Funding](#funding)
|
| 47 |
+
- [Contributions](#contributions)
|
| 48 |
+
|
| 49 |
+
|
| 50 |
## Model description
|
| 51 |
|
| 52 |
RoBERTa-ca-v2 is a transformer-based masked language model for the Catalan language.
|
| 53 |
It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
|
| 54 |
and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
|
| 55 |
|
| 56 |
+
## Intended Uses and Limitations
|
| 57 |
|
| 58 |
+
**roberta-base-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
|
| 59 |
+
However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|
| 60 |
+
|
| 61 |
+
## How to Use
|
|
|
|
| 62 |
|
| 63 |
+
Here is how to use this model:
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
from transformers import AutoModelForMaskedLM
|
| 67 |
+
from transformers import AutoTokenizer, FillMaskPipeline
|
| 68 |
+
from pprint import pprint
|
| 69 |
+
tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/roberta-base-ca-v2')
|
| 70 |
+
model = AutoModelForMaskedLM.from_pretrained('projecte-aina/roberta-base-ca-v2')
|
| 71 |
+
model.eval()
|
| 72 |
+
pipeline = FillMaskPipeline(model, tokenizer_hf)
|
| 73 |
+
text = f"Em dic <mask>."
|
| 74 |
+
res_hf = pipeline(text)
|
| 75 |
+
pprint([r['token_str'] for r in res_hf])
|
| 76 |
+
```
|
| 77 |
+
## Training
|
| 78 |
+
|
| 79 |
+
### Training data
|
| 80 |
|
| 81 |
The training corpus consists of several corpora gathered from web crawling and public corpora.
|
| 82 |
|
|
|
|
| 98 |
| Vilaweb | 0.06 |
|
| 99 |
| Tweets | 0.02 |
|
| 100 |
|
| 101 |
+
### Training Procedure
|
| 102 |
+
|
| 103 |
+
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
|
| 104 |
+
used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens.
|
| 105 |
+
The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
|
| 106 |
+
with the same hyperparameters as in the original work.
|
| 107 |
+
The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
|
| 108 |
+
|
| 109 |
+
|
| 110 |
## Evaluation
|
| 111 |
|
| 112 |
+
### CLUB Benchmark
|
| 113 |
|
| 114 |
The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
|
| 115 |
that has been created along with the model.
|
|
|
|
| 150 |
| TC (TeCla) | 137,775 | 110,203 | 13,786 | 13,786|
|
| 151 |
| QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
|
| 152 |
|
| 153 |
+
### Evaluation Results
|
| 154 |
|
| 155 |
| Task | NER (F1) | POS (F1) | STS (Pearson) | TC (accuracy) | QA (ViquiQuAD) (F1/EM) | QA (XQuAD) (F1/EM) |
|
| 156 |
| ------------|:-------------:| -----:|:------|:-------|:------|:----|
|
|
|
|
| 160 |
| XLM-RoBERTa | 87.66 | 98.89 | 75.40 | 71.68 | 85.50/70.47 | 67.10/46.42 |
|
| 161 |
| WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 |
|
| 162 |
|
| 163 |
+
## Licensing Information
|
| 164 |
+
|
| 165 |
+
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
| 166 |
+
|
| 167 |
+
## Citation Information
|
| 168 |
+
|
| 169 |
+
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
|
| 170 |
+
```bibtex
|
| 171 |
+
@inproceedings{armengol-estape-etal-2021-multilingual,
|
| 172 |
+
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
|
| 173 |
+
author = "Armengol-Estap{\'e}, Jordi and
|
| 174 |
+
Carrino, Casimiro Pio and
|
| 175 |
+
Rodriguez-Penagos, Carlos and
|
| 176 |
+
de Gibert Bonet, Ona and
|
| 177 |
+
Armentano-Oller, Carme and
|
| 178 |
+
Gonzalez-Agirre, Aitor and
|
| 179 |
+
Melero, Maite and
|
| 180 |
+
Villegas, Marta",
|
| 181 |
+
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
|
| 182 |
+
month = aug,
|
| 183 |
+
year = "2021",
|
| 184 |
+
address = "Online",
|
| 185 |
+
publisher = "Association for Computational Linguistics",
|
| 186 |
+
url = "https://aclanthology.org/2021.findings-acl.437",
|
| 187 |
+
doi = "10.18653/v1/2021.findings-acl.437",
|
| 188 |
+
pages = "4933--4946",
|
| 189 |
+
}
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
### Funding
|
| 193 |
+
This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
## Contributions
|
| 197 |
+
|
| 198 |
+
[N/A]
|