mapama247 commited on
Commit
15b1de2
·
1 Parent(s): c6b7ead

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -23
README.md CHANGED
@@ -67,7 +67,7 @@ This model is ready-to-use only for masked language modeling (MLM) to perform th
67
  ```python
68
  from pprint import pprint
69
  from transformers import pipeline
70
- pipe = pipeline("fill-mask", model="aina/DistilBERTa")
71
  text = "El <mask> és el meu dia preferit de la setmana."
72
  pprint(pipe(text))
73
  ```
@@ -90,22 +90,22 @@ So, in a “teacher-student learning” setup, a relatively small student model
90
 
91
  The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
92
 
93
- | Corpus | Size (GB) |
94
- |-------------------------|------------|
95
- | Catalan Crawling | 13.00 |
96
- | RacoCatalá | 8.10 |
97
- | Catalan Oscar | 4.00 |
98
- | CaWaC | 3.60 |
99
- | Cat. General Crawling | 2.50 |
100
- | Wikipedia | 1.10 |
101
- | DOGC | 0.78 |
102
- | Padicat | 0.63 |
103
- | ACN | 0.42 |
104
- | Nació Digital | 0.42 |
105
- | Cat. Goverment Crawling | 0.24 |
106
- | Vilaweb | 0.06 |
107
- | Catalan Open Subtitles | 0.02 |
108
- | Tweets | 0.02 |
109
 
110
  ## Evaluation
111
 
@@ -128,11 +128,10 @@ This model has been fine-tuned on the downstream tasks of the [Catalan Language
128
 
129
  This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:
130
 
131
- | Model \ Task| NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
132
- | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
133
- | RoBERTa-large-ca-v2 | 89.82 | 99.02 | 83.41 | 75.46 | 83.61 | 89.34/75.50 | 89.20/75.77 | 90.72/79.06 | 73.79/55.34 |
134
- | RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/75.91 | 89.50/76.63 | 73.64/55.42 |
135
- | DistilRoBERTa-base-ca-v2| xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx |
136
 
137
  <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
138
 
@@ -146,7 +145,7 @@ The Text Mining Unit (TeMU) from Barcelona Supercomputing Center ([bsc-temu@bsc.
146
 
147
  For further information, send an email to [aina@bsc.es](aina@bsc.es).
148
 
149
- ## Copyright
150
 
151
  Copyright by the Text Mining Unit at Barcelona Supercomputing Center.
152
 
 
67
  ```python
68
  from pprint import pprint
69
  from transformers import pipeline
70
+ pipe = pipeline("fill-mask", model="projecte-aina/distilroberta-base-ca")
71
  text = "El <mask> és el meu dia preferit de la setmana."
72
  pprint(pipe(text))
73
  ```
 
90
 
91
  The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
92
 
93
+ | Corpus | Size (GB) |
94
+ |--------------------------|------------|
95
+ | Catalan Crawling | 13.00 |
96
+ | RacoCatalá | 8.10 |
97
+ | Catalan Oscar | 4.00 |
98
+ | CaWaC | 3.60 |
99
+ | Cat. General Crawling | 2.50 |
100
+ | Wikipedia | 1.10 |
101
+ | DOGC | 0.78 |
102
+ | Padicat | 0.63 |
103
+ | ACN | 0.42 |
104
+ | Nació Digital | 0.42 |
105
+ | Cat. Government Crawling | 0.24 |
106
+ | Vilaweb | 0.06 |
107
+ | Catalan Open Subtitles | 0.02 |
108
+ | Tweets | 0.02 |
109
 
110
  ## Evaluation
111
 
 
128
 
129
  This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:
130
 
131
+ | Model \ Task | NER (F1) |POS (F1)|STS-ca (Comb)|TeCla (Acc.)| TEca (Acc.) | VilaQuAD (F1/EM)|ViquiQuAD (F1/EM)| CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
132
+ | ------------------------|:-------------:|:-------|:------------|:-----------|:------------|:----------------|:----------------|:----|:----|
133
+ | RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/75.91 | 89.50/76.63 | 73.64/55.42 |
134
+ | DistilRoBERTa-base-ca-v2| xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx |
 
135
 
136
  <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
137
 
 
145
 
146
  For further information, send an email to [aina@bsc.es](aina@bsc.es).
147
 
148
+ ### Copyright
149
 
150
  Copyright by the Text Mining Unit at Barcelona Supercomputing Center.
151