mapama247 commited on
Commit
69aa348
·
1 Parent(s): d987a1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -20
README.md CHANGED
@@ -5,8 +5,6 @@ tags:
5
  - "catalan"
6
  - "masked-lm"
7
  - "distilroberta"
8
- - "CaText"
9
- - "Catalan Textual Corpus"
10
  widget:
11
  - text: "El Català és una llengua molt <mask>."
12
  - text: "Salvador Dalí va viure a <mask>."
@@ -54,15 +52,15 @@ widget:
54
 
55
  ## Model description
56
 
57
- This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found [HERE_TODO](https://github.com/TeMU-BSC/distillation).
58
 
59
- The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average, it is twice as fast as its teacher.
60
 
61
  We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the teacher model, as well as the training and evaluation data.
62
 
63
  ## Intended uses and limitations
64
 
65
- This model is ready-to-use only for masked language modeling (MLM) to perform the Fill-Mask task. However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
66
 
67
  ## How to use
68
 
@@ -82,42 +80,40 @@ At the time of submission, no measures have been taken to estimate the bias embe
82
 
83
  ### Training data
84
 
85
- The training corpus consists of several corpora gathered from web crawling and public corpora.
86
 
87
- | Corpus | Size in GB |
88
  |-------------------------|------------|
89
  | Catalan Crawling | 13.00 |
90
- | Wikipedia | 1.10 |
91
- | DOGC | 0.78 |
92
- | Catalan Open Subtitles | 0.02 |
93
  | Catalan Oscar | 4.00 |
94
  | CaWaC | 3.60 |
95
  | Cat. General Crawling | 2.50 |
96
- | Cat. Goverment Crawling | 0.24 |
97
- | ACN | 0.42 |
98
  | Padicat | 0.63 |
99
- | RacoCatalá | 8.10 |
100
  | Nació Digital | 0.42 |
 
101
  | Vilaweb | 0.06 |
 
102
  | Tweets | 0.02 |
103
 
104
  ### Training procedure
105
 
106
  This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
107
 
108
- The main idea is to distill a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
109
 
110
- So, in a “teacher-student learning” setup, a small student model is trained to mimic the behavior of a larger teacher model.
111
 
112
- As an example, the distilled version of BERT has 40% fewer parameters and runs 60% faster while preserving 97% of BERT's performance on the GLUE benchmark. This translates in lower inference time and the ability to run in commodity hardware.
113
 
114
  ## Evaluation
115
 
116
  ### Evaluation benchmark
117
 
118
- This model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB).
119
-
120
- Here are the train/dev/test splits of each dataset:
121
 
122
  | Dataset | Task| Total | Train | Dev | Test |
123
  |:----------|:----|:--------|:-------|:------|:------|
@@ -132,7 +128,7 @@ Here are the train/dev/test splits of each dataset:
132
 
133
  ### Evaluation results
134
 
135
- This is how it compares to the teacher model when fine-tuned on the same downstream tasks:
136
 
137
  | Model \ Task| NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
138
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
 
5
  - "catalan"
6
  - "masked-lm"
7
  - "distilroberta"
 
 
8
  widget:
9
  - text: "El Català és una llengua molt <mask>."
10
  - text: "Salvador Dalí va viure a <mask>."
 
52
 
53
  ## Model description
54
 
55
+ This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation from [the paper's repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
56
 
57
+ The model has 6 layers, 768 dimensional embeddings and 12 attention heads, totalizing 82M parameters (compared to 125M parameters of standrard RoBERTa-base models). On average, it is twice as fast as its teacher.
58
 
59
  We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the teacher model, as well as the training and evaluation data.
60
 
61
  ## Intended uses and limitations
62
 
63
+ This model is ready-to-use only for masked language modeling (MLM) to perform the Fill-Mask task. However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
64
 
65
  ## How to use
66
 
 
80
 
81
  ### Training data
82
 
83
+ The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
84
 
85
+ | Corpus | Size (GB) |
86
  |-------------------------|------------|
87
  | Catalan Crawling | 13.00 |
88
+ | RacoCatalá | 8.10 |
 
 
89
  | Catalan Oscar | 4.00 |
90
  | CaWaC | 3.60 |
91
  | Cat. General Crawling | 2.50 |
92
+ | Wikipedia | 1.10 |
93
+ | DOGC | 0.78 |
94
  | Padicat | 0.63 |
95
+ | ACN | 0.42 |
96
  | Nació Digital | 0.42 |
97
+ | Cat. Goverment Crawling | 0.24 |
98
  | Vilaweb | 0.06 |
99
+ | Catalan Open Subtitles | 0.02 |
100
  | Tweets | 0.02 |
101
 
102
  ### Training procedure
103
 
104
  This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
105
 
106
+ It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
107
 
108
+ So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
109
 
110
+ As a result, the student has lower inference time and the ability to run in commodity hardware.
111
 
112
  ## Evaluation
113
 
114
  ### Evaluation benchmark
115
 
116
+ This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets:
 
 
117
 
118
  | Dataset | Task| Total | Train | Dev | Test |
119
  |:----------|:----|:--------|:-------|:------|:------|
 
128
 
129
  ### Evaluation results
130
 
131
+ This is how it compares to its teacher when fine-tuned on the same downstream tasks:
132
 
133
  | Model \ Task| NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
134
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|