mapama247
/

DistilBERTa

Catalan

catalan

masked-lm

distilroberta

Model card Files Files and versions

xet

Community

mapama247 commited on Dec 28, 2022

Commit

69aa348

1 Parent(s): d987a1e

Update README.md

Browse files

Files changed (1) hide show

README.md +16 -20

README.md CHANGED Viewed

@@ -5,8 +5,6 @@ tags:
 - "catalan"
 - "masked-lm"
 - "distilroberta"
-- "CaText"
-- "Catalan Textual Corpus"
 widget:
 - text: "El Català és una llengua molt <mask>."
 - text: "Salvador Dalí va viure a <mask>."
@@ -54,15 +52,15 @@ widget:
 ## Model description
-This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found [HERE_TODO](https://github.com/TeMU-BSC/distillation).
-The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average, it is twice as fast as its teacher.
 We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the teacher model, as well as the training and evaluation data.
 ## Intended uses and limitations
-This model is ready-to-use only for masked language modeling (MLM) to perform the Fill-Mask task. However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
 ## How to use
@@ -82,42 +80,40 @@ At the time of submission, no measures have been taken to estimate the bias embe
 ### Training data
-The training corpus consists of several corpora gathered from web crawling and public corpora.
-| Corpus                  | Size in GB |
 |-------------------------|------------|
 | Catalan Crawling        | 13.00      |
-| Wikipedia               | 1.10       |
-| DOGC                    | 0.78       |
-| Catalan Open Subtitles  | 0.02       |
 | Catalan Oscar           | 4.00       |
 | CaWaC                   | 3.60       |
 | Cat. General Crawling   | 2.50       |
-| Cat. Goverment Crawling | 0.24       |
-| ACN                     | 0.42       |
 | Padicat                 | 0.63       |
-| RacoCatalá              | 8.10       |
 | Nació Digital           | 0.42       |
 | Vilaweb                 | 0.06       |
 | Tweets                  | 0.02       |
 ### Training procedure
 This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
-The main idea is to distill a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
-So, in a “teacher-student learning” setup, a small student model is trained to mimic the behavior of a larger teacher model.
-As an example, the distilled version of BERT has 40% fewer parameters and runs 60% faster while preserving 97% of BERT's performance on the GLUE benchmark. This translates in lower inference time and the ability to run in commodity hardware.
 ## Evaluation
 ### Evaluation benchmark
-This model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB).
-Here are the train/dev/test splits of each dataset:
 | Dataset   | Task| Total   | Train  | Dev   | Test  |
 |:----------|:----|:--------|:-------|:------|:------|
@@ -132,7 +128,7 @@ Here are the train/dev/test splits of each dataset:
 ### Evaluation results
-This is how it compares to the teacher model when fine-tuned on the same downstream tasks:
 | Model \ Task| NER (F1)      | POS (F1)   | STS-ca (Comb)   | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
 | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|

 - "catalan"
 - "masked-lm"
 - "distilroberta"
 widget:
 - text: "El Català és una llengua molt <mask>."
 - text: "Salvador Dalí va viure a <mask>."
 ## Model description
+This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation from [the paper's repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
+The model has 6 layers, 768 dimensional embeddings and 12 attention heads, totalizing 82M parameters (compared to 125M parameters of standrard RoBERTa-base models). On average, it is twice as fast as its teacher.
 We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the teacher model, as well as the training and evaluation data.
 ## Intended uses and limitations
+This model is ready-to-use only for masked language modeling (MLM) to perform the Fill-Mask task. However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
 ## How to use
 ### Training data
+The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
+| Corpus                  | Size (GB)  |
 |-------------------------|------------|
 | Catalan Crawling        | 13.00      |
+| RacoCatalá              | 8.10       |
 | Catalan Oscar           | 4.00       |
 | CaWaC                   | 3.60       |
 | Cat. General Crawling   | 2.50       |
+| Wikipedia               | 1.10       |
+| DOGC                    | 0.78       |
 | Padicat                 | 0.63       |
+| ACN                     | 0.42       |
 | Nació Digital           | 0.42       |
+| Cat. Goverment Crawling | 0.24       |
 | Vilaweb                 | 0.06       |
+| Catalan Open Subtitles  | 0.02       |
 | Tweets                  | 0.02       |
 ### Training procedure
 This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
+It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
+So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
+As a result, the student has lower inference time and the ability to run in commodity hardware.
 ## Evaluation
 ### Evaluation benchmark
+This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets:
 | Dataset   | Task| Total   | Train  | Dev   | Test  |
 |:----------|:----|:--------|:-------|:------|:------|
 ### Evaluation results
+This is how it compares to its teacher when fine-tuned on the same downstream tasks:
 | Model \ Task| NER (F1)      | POS (F1)   | STS-ca (Comb)   | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
 | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|