Update README.md
Browse files
README.md
CHANGED
|
@@ -56,6 +56,13 @@ The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (c
|
|
| 56 |
|
| 57 |
We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the training and evaluation data.
|
| 58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
## Intended uses and limitations
|
| 60 |
|
| 61 |
This model is ready-to-use only for masked language modeling (MLM) to perform the Fill-Mask task. However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|
|
|
|
| 56 |
|
| 57 |
We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the training and evaluation data.
|
| 58 |
|
| 59 |
+
**About Knowledge Distiallation**
|
| 60 |
+
It is a technique used to shrink networks to a reasonable size while minimizing the loss in performance.
|
| 61 |
+
|
| 62 |
+
The main idea is to distill a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student). So, in a “teacher-student learning” setup, a small student model is trained to mimic the behavior of a larger teacher model.
|
| 63 |
+
|
| 64 |
+
As an example, the distilled version of BERT has 40% fewer parameters and runs 60% faster while preserving 97% of BERT's performance on the GLUE benchmark. This translates in lower inference time and the ability to run in commodity hardware.
|
| 65 |
+
|
| 66 |
## Intended uses and limitations
|
| 67 |
|
| 68 |
This model is ready-to-use only for masked language modeling (MLM) to perform the Fill-Mask task. However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|