gonzalez-agirre commited on
Commit
c993230
·
1 Parent(s): 944ed9d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -21
README.md CHANGED
@@ -1,9 +1,22 @@
1
  ---
2
- language: "ca"
 
 
 
 
 
3
  tags:
4
- - masked-lm
5
- - RoBERTa-base-ca-v2
6
- - catalan
 
 
 
 
 
 
 
 
7
  widget:
8
  - text: "El Català és una llengua molt <mask>."
9
  - text: "Salvador Dalí va viure a <mask>."
@@ -13,24 +26,57 @@ widget:
13
  - text: "Vaig al <mask> a buscar bolets."
14
  - text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
15
  - text: "Catalunya és una referència en <mask> a nivell europeu."
16
- license: apache-2.0
17
  ---
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Model description
20
 
21
  RoBERTa-ca-v2 is a transformer-based masked language model for the Catalan language.
22
  It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
23
  and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
24
 
25
- ## Tokenization and pretraining
26
 
27
- The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
28
- used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens.
29
- The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
30
- with the same hyperparameters as in the original work.
31
- The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
32
 
33
- ## Training corpora and preprocessing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  The training corpus consists of several corpora gathered from web crawling and public corpora.
36
 
@@ -52,9 +98,18 @@ The training corpus consists of several corpora gathered from web crawling and p
52
  | Vilaweb | 0.06 |
53
  | Tweets | 0.02 |
54
 
 
 
 
 
 
 
 
 
 
55
  ## Evaluation
56
 
57
- ### CLUB benchmark
58
 
59
  The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
60
  that has been created along with the model.
@@ -95,7 +150,7 @@ Here are the train/dev/test splits of the datasets:
95
  | TC (TeCla) | 137,775 | 110,203 | 13,786 | 13,786|
96
  | QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
97
 
98
- ### Results
99
 
100
  | Task | NER (F1) | POS (F1) | STS (Pearson) | TC (accuracy) | QA (ViquiQuAD) (F1/EM) | QA (XQuAD) (F1/EM) |
101
  | ------------|:-------------:| -----:|:------|:-------|:------|:----|
@@ -105,10 +160,39 @@ Here are the train/dev/test splits of the datasets:
105
  | XLM-RoBERTa | 87.66 | 98.89 | 75.40 | 71.68 | 85.50/70.47 | 67.10/46.42 |
106
  | WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 |
107
 
108
- ## Intended uses & limitations
109
- The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
110
- However, the is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
111
-
112
-
113
- ## Funding
114
- This work was funded by the Generalitat de Catalunya within the framework of the AINA language technologies plan.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+
4
+ - ca
5
+
6
+ license: apache-2.0
7
+
8
  tags:
9
+
10
+ - "catalan"
11
+
12
+ - "masked-lm"
13
+
14
+ - "RoBERTa-base-ca-v2"
15
+
16
+ - "CaText"
17
+
18
+ - "Catalan Textual Corpus"
19
+
20
  widget:
21
  - text: "El Català és una llengua molt <mask>."
22
  - text: "Salvador Dalí va viure a <mask>."
 
26
  - text: "Vaig al <mask> a buscar bolets."
27
  - text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
28
  - text: "Catalunya és una referència en <mask> a nivell europeu."
29
+
30
  ---
31
 
32
+ # Catalan BERTa-v2 (roberta-base-ca-v2) base model
33
+
34
+ ## Table of Contents
35
+ - [Model Description](#model-description)
36
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
37
+ - [How to Use](#how-to-use)
38
+ - [Training](#training)
39
+ - [Training Data](#training-data)
40
+ - [Training Procedure](#training-procedure)
41
+ - [Evaluation](#evaluation)
42
+ - [CLUB Benchmark](#club-benchmark)
43
+ - [Evaluation Results](#evaluation-results)
44
+ - [Licensing Information](#licensing-information)
45
+ - [Citation Information](#citation-information)
46
+ - [Funding](#funding)
47
+ - [Contributions](#contributions)
48
+
49
+
50
  ## Model description
51
 
52
  RoBERTa-ca-v2 is a transformer-based masked language model for the Catalan language.
53
  It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
54
  and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
55
 
56
+ ## Intended Uses and Limitations
57
 
58
+ **roberta-base-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
59
+ However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
60
+
61
+ ## How to Use
 
62
 
63
+ Here is how to use this model:
64
+
65
+ ```python
66
+ from transformers import AutoModelForMaskedLM
67
+ from transformers import AutoTokenizer, FillMaskPipeline
68
+ from pprint import pprint
69
+ tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/roberta-base-ca-v2')
70
+ model = AutoModelForMaskedLM.from_pretrained('projecte-aina/roberta-base-ca-v2')
71
+ model.eval()
72
+ pipeline = FillMaskPipeline(model, tokenizer_hf)
73
+ text = f"Em dic <mask>."
74
+ res_hf = pipeline(text)
75
+ pprint([r['token_str'] for r in res_hf])
76
+ ```
77
+ ## Training
78
+
79
+ ### Training data
80
 
81
  The training corpus consists of several corpora gathered from web crawling and public corpora.
82
 
 
98
  | Vilaweb | 0.06 |
99
  | Tweets | 0.02 |
100
 
101
+ ### Training Procedure
102
+
103
+ The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
104
+ used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens.
105
+ The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
106
+ with the same hyperparameters as in the original work.
107
+ The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
108
+
109
+
110
  ## Evaluation
111
 
112
+ ### CLUB Benchmark
113
 
114
  The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
115
  that has been created along with the model.
 
150
  | TC (TeCla) | 137,775 | 110,203 | 13,786 | 13,786|
151
  | QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
152
 
153
+ ### Evaluation Results
154
 
155
  | Task | NER (F1) | POS (F1) | STS (Pearson) | TC (accuracy) | QA (ViquiQuAD) (F1/EM) | QA (XQuAD) (F1/EM) |
156
  | ------------|:-------------:| -----:|:------|:-------|:------|:----|
 
160
  | XLM-RoBERTa | 87.66 | 98.89 | 75.40 | 71.68 | 85.50/70.47 | 67.10/46.42 |
161
  | WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 |
162
 
163
+ ## Licensing Information
164
+
165
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
166
+
167
+ ## Citation Information
168
+
169
+ If you use any of these resources (datasets or models) in your work, please cite our latest paper:
170
+ ```bibtex
171
+ @inproceedings{armengol-estape-etal-2021-multilingual,
172
+ title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
173
+ author = "Armengol-Estap{\'e}, Jordi and
174
+ Carrino, Casimiro Pio and
175
+ Rodriguez-Penagos, Carlos and
176
+ de Gibert Bonet, Ona and
177
+ Armentano-Oller, Carme and
178
+ Gonzalez-Agirre, Aitor and
179
+ Melero, Maite and
180
+ Villegas, Marta",
181
+ booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
182
+ month = aug,
183
+ year = "2021",
184
+ address = "Online",
185
+ publisher = "Association for Computational Linguistics",
186
+ url = "https://aclanthology.org/2021.findings-acl.437",
187
+ doi = "10.18653/v1/2021.findings-acl.437",
188
+ pages = "4933--4946",
189
+ }
190
+ ```
191
+
192
+ ### Funding
193
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
194
+
195
+
196
+ ## Contributions
197
+
198
+ [N/A]