mapama247
/

DistilBERTa

Catalan

catalan

masked-lm

distilroberta

Model card Files Files and versions

xet

Community

mapama247 commited on Jan 2, 2023

Commit

15b1de2

1 Parent(s): c6b7ead

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -23

README.md CHANGED Viewed

@@ -67,7 +67,7 @@ This model is ready-to-use only for masked language modeling (MLM) to perform th
 ```python
 from pprint import pprint
 from transformers import pipeline
-pipe = pipeline("fill-mask", model="aina/DistilBERTa")
 text = "El <mask> és el meu dia preferit de la setmana."
 pprint(pipe(text))
 ```
@@ -90,22 +90,22 @@ So, in a “teacher-student learning” setup, a relatively small student model
 The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
-| Corpus                  | Size (GB)  |
-|-------------------------|------------|
-| Catalan Crawling        | 13.00      |
-| RacoCatalá              | 8.10       |
-| Catalan Oscar           | 4.00       |
-| CaWaC                   | 3.60       |
-| Cat. General Crawling   | 2.50       |
-| Wikipedia               | 1.10       |
-| DOGC                    | 0.78       |
-| Padicat                 | 0.63       |
-| ACN                     | 0.42       |
-| Nació Digital           | 0.42       |
-| Cat. Goverment Crawling | 0.24       |
-| Vilaweb                 | 0.06       |
-| Catalan Open Subtitles  | 0.02       |
-| Tweets                  | 0.02       |
 ## Evaluation
@@ -128,11 +128,10 @@ This model has been fine-tuned on the downstream tasks of the [Catalan Language
 This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:
-| Model \ Task| NER (F1)      | POS (F1)   | STS-ca (Comb)   | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
-| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
-| RoBERTa-large-ca-v2     | 89.82 | 99.02 | 83.41 | 75.46 | 83.61 | 89.34/75.50 | 89.20/75.77 | 90.72/79.06 | 73.79/55.34 |
-| RoBERTa-base-ca-v2      | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/75.91 | 89.50/76.63 | 73.64/55.42 |
-| DistilRoBERTa-base-ca-v2| xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx |
 <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
@@ -146,7 +145,7 @@ The Text Mining Unit (TeMU) from Barcelona Supercomputing Center ([bsc-temu@bsc.
 For further information, send an email to [aina@bsc.es](aina@bsc.es).
-## Copyright
 Copyright by the Text Mining Unit at Barcelona Supercomputing Center.

 ```python
 from pprint import pprint
 from transformers import pipeline
+pipe = pipeline("fill-mask", model="projecte-aina/distilroberta-base-ca")
 text = "El <mask> és el meu dia preferit de la setmana."
 pprint(pipe(text))
 ```
 The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
+| Corpus                   | Size (GB)  |
+|--------------------------|------------|
+| Catalan Crawling         | 13.00      |
+| RacoCatalá               | 8.10       |
+| Catalan Oscar            | 4.00       |
+| CaWaC                    | 3.60       |
+| Cat. General Crawling    | 2.50       |
+| Wikipedia                | 1.10       |
+| DOGC                     | 0.78       |
+| Padicat                  | 0.63       |
+| ACN                      | 0.42       |
+| Nació Digital            | 0.42       |
+| Cat. Government Crawling | 0.24       |
+| Vilaweb                  | 0.06       |
+| Catalan Open Subtitles   | 0.02       |
+| Tweets                   | 0.02       |
 ## Evaluation
 This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:
+|      Model  \  Task     |    NER (F1)   |POS (F1)|STS-ca (Comb)|TeCla (Acc.)| TEca (Acc.) | VilaQuAD (F1/EM)|ViquiQuAD (F1/EM)| CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
+| ------------------------|:-------------:|:-------|:------------|:-----------|:------------|:----------------|:----------------|:----|:----|
+| RoBERTa-base-ca-v2      | 89.29 | 98.96 | 79.07  | 74.26       | 83.14      | 87.74/72.58 | 88.72/75.91     | 89.50/76.63     | 73.64/55.42 |
+| DistilRoBERTa-base-ca-v2| xx.xx | xx.xx | xx.xx  | xx.xx       | xx.xx      | xx.xx/xx.xx | xx.xx/xx.xx     | xx.xx/xx.xx     | xx.xx/xx.xx |
 <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
 For further information, send an email to [aina@bsc.es](aina@bsc.es).
+### Copyright
 Copyright by the Text Mining Unit at Barcelona Supercomputing Center.