Update README.md
Browse files
README.md
CHANGED
|
@@ -67,7 +67,7 @@ This model is ready-to-use only for masked language modeling (MLM) to perform th
|
|
| 67 |
```python
|
| 68 |
from pprint import pprint
|
| 69 |
from transformers import pipeline
|
| 70 |
-
pipe = pipeline("fill-mask", model="aina/
|
| 71 |
text = "El <mask> és el meu dia preferit de la setmana."
|
| 72 |
pprint(pipe(text))
|
| 73 |
```
|
|
@@ -90,22 +90,22 @@ So, in a “teacher-student learning” setup, a relatively small student model
|
|
| 90 |
|
| 91 |
The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
|
| 92 |
|
| 93 |
-
| Corpus
|
| 94 |
-
|
| 95 |
-
| Catalan Crawling
|
| 96 |
-
| RacoCatalá
|
| 97 |
-
| Catalan Oscar
|
| 98 |
-
| CaWaC
|
| 99 |
-
| Cat. General Crawling
|
| 100 |
-
| Wikipedia
|
| 101 |
-
| DOGC
|
| 102 |
-
| Padicat
|
| 103 |
-
| ACN
|
| 104 |
-
| Nació Digital
|
| 105 |
-
| Cat.
|
| 106 |
-
| Vilaweb
|
| 107 |
-
| Catalan Open Subtitles
|
| 108 |
-
| Tweets
|
| 109 |
|
| 110 |
## Evaluation
|
| 111 |
|
|
@@ -128,11 +128,10 @@ This model has been fine-tuned on the downstream tasks of the [Catalan Language
|
|
| 128 |
|
| 129 |
This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:
|
| 130 |
|
| 131 |
-
|
|
| 132 |
-
|
|
| 133 |
-
| RoBERTa-
|
| 134 |
-
|
|
| 135 |
-
| DistilRoBERTa-base-ca-v2| xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx |
|
| 136 |
|
| 137 |
<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
|
| 138 |
|
|
@@ -146,7 +145,7 @@ The Text Mining Unit (TeMU) from Barcelona Supercomputing Center ([bsc-temu@bsc.
|
|
| 146 |
|
| 147 |
For further information, send an email to [aina@bsc.es](aina@bsc.es).
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
Copyright by the Text Mining Unit at Barcelona Supercomputing Center.
|
| 152 |
|
|
|
|
| 67 |
```python
|
| 68 |
from pprint import pprint
|
| 69 |
from transformers import pipeline
|
| 70 |
+
pipe = pipeline("fill-mask", model="projecte-aina/distilroberta-base-ca")
|
| 71 |
text = "El <mask> és el meu dia preferit de la setmana."
|
| 72 |
pprint(pipe(text))
|
| 73 |
```
|
|
|
|
| 90 |
|
| 91 |
The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
|
| 92 |
|
| 93 |
+
| Corpus | Size (GB) |
|
| 94 |
+
|--------------------------|------------|
|
| 95 |
+
| Catalan Crawling | 13.00 |
|
| 96 |
+
| RacoCatalá | 8.10 |
|
| 97 |
+
| Catalan Oscar | 4.00 |
|
| 98 |
+
| CaWaC | 3.60 |
|
| 99 |
+
| Cat. General Crawling | 2.50 |
|
| 100 |
+
| Wikipedia | 1.10 |
|
| 101 |
+
| DOGC | 0.78 |
|
| 102 |
+
| Padicat | 0.63 |
|
| 103 |
+
| ACN | 0.42 |
|
| 104 |
+
| Nació Digital | 0.42 |
|
| 105 |
+
| Cat. Government Crawling | 0.24 |
|
| 106 |
+
| Vilaweb | 0.06 |
|
| 107 |
+
| Catalan Open Subtitles | 0.02 |
|
| 108 |
+
| Tweets | 0.02 |
|
| 109 |
|
| 110 |
## Evaluation
|
| 111 |
|
|
|
|
| 128 |
|
| 129 |
This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:
|
| 130 |
|
| 131 |
+
| Model \ Task | NER (F1) |POS (F1)|STS-ca (Comb)|TeCla (Acc.)| TEca (Acc.) | VilaQuAD (F1/EM)|ViquiQuAD (F1/EM)| CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
| 132 |
+
| ------------------------|:-------------:|:-------|:------------|:-----------|:------------|:----------------|:----------------|:----|:----|
|
| 133 |
+
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/75.91 | 89.50/76.63 | 73.64/55.42 |
|
| 134 |
+
| DistilRoBERTa-base-ca-v2| xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx |
|
|
|
|
| 135 |
|
| 136 |
<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
|
| 137 |
|
|
|
|
| 145 |
|
| 146 |
For further information, send an email to [aina@bsc.es](aina@bsc.es).
|
| 147 |
|
| 148 |
+
### Copyright
|
| 149 |
|
| 150 |
Copyright by the Text Mining Unit at Barcelona Supercomputing Center.
|
| 151 |
|