Update README.md
Browse files
README.md
CHANGED
|
@@ -5,8 +5,6 @@ tags:
|
|
| 5 |
- "catalan"
|
| 6 |
- "masked-lm"
|
| 7 |
- "distilroberta"
|
| 8 |
-
- "CaText"
|
| 9 |
-
- "Catalan Textual Corpus"
|
| 10 |
widget:
|
| 11 |
- text: "El Català és una llengua molt <mask>."
|
| 12 |
- text: "Salvador Dalí va viure a <mask>."
|
|
@@ -54,15 +52,15 @@ widget:
|
|
| 54 |
|
| 55 |
## Model description
|
| 56 |
|
| 57 |
-
This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108)
|
| 58 |
|
| 59 |
-
The model has 6 layers, 768
|
| 60 |
|
| 61 |
We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the teacher model, as well as the training and evaluation data.
|
| 62 |
|
| 63 |
## Intended uses and limitations
|
| 64 |
|
| 65 |
-
This model is ready-to-use only for masked language modeling (MLM) to perform the Fill-Mask task. However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification
|
| 66 |
|
| 67 |
## How to use
|
| 68 |
|
|
@@ -82,42 +80,40 @@ At the time of submission, no measures have been taken to estimate the bias embe
|
|
| 82 |
|
| 83 |
### Training data
|
| 84 |
|
| 85 |
-
The training corpus consists of several corpora gathered from web crawling and public corpora
|
| 86 |
|
| 87 |
-
| Corpus | Size
|
| 88 |
|-------------------------|------------|
|
| 89 |
| Catalan Crawling | 13.00 |
|
| 90 |
-
|
|
| 91 |
-
| DOGC | 0.78 |
|
| 92 |
-
| Catalan Open Subtitles | 0.02 |
|
| 93 |
| Catalan Oscar | 4.00 |
|
| 94 |
| CaWaC | 3.60 |
|
| 95 |
| Cat. General Crawling | 2.50 |
|
| 96 |
-
|
|
| 97 |
-
|
|
| 98 |
| Padicat | 0.63 |
|
| 99 |
-
|
|
| 100 |
| Nació Digital | 0.42 |
|
|
|
|
| 101 |
| Vilaweb | 0.06 |
|
|
|
|
| 102 |
| Tweets | 0.02 |
|
| 103 |
|
| 104 |
### Training procedure
|
| 105 |
|
| 106 |
This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
So, in a “teacher-student learning” setup, a small student model is trained to mimic the behavior of a larger teacher model.
|
| 111 |
|
| 112 |
-
As
|
| 113 |
|
| 114 |
## Evaluation
|
| 115 |
|
| 116 |
### Evaluation benchmark
|
| 117 |
|
| 118 |
-
This model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB).
|
| 119 |
-
|
| 120 |
-
Here are the train/dev/test splits of each dataset:
|
| 121 |
|
| 122 |
| Dataset | Task| Total | Train | Dev | Test |
|
| 123 |
|:----------|:----|:--------|:-------|:------|:------|
|
|
@@ -132,7 +128,7 @@ Here are the train/dev/test splits of each dataset:
|
|
| 132 |
|
| 133 |
### Evaluation results
|
| 134 |
|
| 135 |
-
This is how it compares to
|
| 136 |
|
| 137 |
| Model \ Task| NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
| 138 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
|
|
|
| 5 |
- "catalan"
|
| 6 |
- "masked-lm"
|
| 7 |
- "distilroberta"
|
|
|
|
|
|
|
| 8 |
widget:
|
| 9 |
- text: "El Català és una llengua molt <mask>."
|
| 10 |
- text: "Salvador Dalí va viure a <mask>."
|
|
|
|
| 52 |
|
| 53 |
## Model description
|
| 54 |
|
| 55 |
+
This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation from [the paper's repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
|
| 56 |
|
| 57 |
+
The model has 6 layers, 768 dimensional embeddings and 12 attention heads, totalizing 82M parameters (compared to 125M parameters of standrard RoBERTa-base models). On average, it is twice as fast as its teacher.
|
| 58 |
|
| 59 |
We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the teacher model, as well as the training and evaluation data.
|
| 60 |
|
| 61 |
## Intended uses and limitations
|
| 62 |
|
| 63 |
+
This model is ready-to-use only for masked language modeling (MLM) to perform the Fill-Mask task. However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
|
| 64 |
|
| 65 |
## How to use
|
| 66 |
|
|
|
|
| 80 |
|
| 81 |
### Training data
|
| 82 |
|
| 83 |
+
The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
|
| 84 |
|
| 85 |
+
| Corpus | Size (GB) |
|
| 86 |
|-------------------------|------------|
|
| 87 |
| Catalan Crawling | 13.00 |
|
| 88 |
+
| RacoCatalá | 8.10 |
|
|
|
|
|
|
|
| 89 |
| Catalan Oscar | 4.00 |
|
| 90 |
| CaWaC | 3.60 |
|
| 91 |
| Cat. General Crawling | 2.50 |
|
| 92 |
+
| Wikipedia | 1.10 |
|
| 93 |
+
| DOGC | 0.78 |
|
| 94 |
| Padicat | 0.63 |
|
| 95 |
+
| ACN | 0.42 |
|
| 96 |
| Nació Digital | 0.42 |
|
| 97 |
+
| Cat. Goverment Crawling | 0.24 |
|
| 98 |
| Vilaweb | 0.06 |
|
| 99 |
+
| Catalan Open Subtitles | 0.02 |
|
| 100 |
| Tweets | 0.02 |
|
| 101 |
|
| 102 |
### Training procedure
|
| 103 |
|
| 104 |
This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
|
| 105 |
|
| 106 |
+
It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
|
| 107 |
|
| 108 |
+
So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
|
| 109 |
|
| 110 |
+
As a result, the student has lower inference time and the ability to run in commodity hardware.
|
| 111 |
|
| 112 |
## Evaluation
|
| 113 |
|
| 114 |
### Evaluation benchmark
|
| 115 |
|
| 116 |
+
This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets:
|
|
|
|
|
|
|
| 117 |
|
| 118 |
| Dataset | Task| Total | Train | Dev | Test |
|
| 119 |
|:----------|:----|:--------|:-------|:------|:------|
|
|
|
|
| 128 |
|
| 129 |
### Evaluation results
|
| 130 |
|
| 131 |
+
This is how it compares to its teacher when fine-tuned on the same downstream tasks:
|
| 132 |
|
| 133 |
| Model \ Task| NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
| 134 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|