Update README.md
Browse files
README.md
CHANGED
|
@@ -57,9 +57,12 @@ The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (c
|
|
| 57 |
We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the training and evaluation data.
|
| 58 |
|
| 59 |
**About Knowledge Distiallation**
|
|
|
|
| 60 |
It is a technique used to shrink networks to a reasonable size while minimizing the loss in performance.
|
| 61 |
|
| 62 |
-
The main idea is to distill a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
|
|
|
|
|
|
|
| 63 |
|
| 64 |
As an example, the distilled version of BERT has 40% fewer parameters and runs 60% faster while preserving 97% of BERT's performance on the GLUE benchmark. This translates in lower inference time and the ability to run in commodity hardware.
|
| 65 |
|
|
|
|
| 57 |
We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the training and evaluation data.
|
| 58 |
|
| 59 |
**About Knowledge Distiallation**
|
| 60 |
+
|
| 61 |
It is a technique used to shrink networks to a reasonable size while minimizing the loss in performance.
|
| 62 |
|
| 63 |
+
The main idea is to distill a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
|
| 64 |
+
|
| 65 |
+
So, in a “teacher-student learning” setup, a small student model is trained to mimic the behavior of a larger teacher model.
|
| 66 |
|
| 67 |
As an example, the distilled version of BERT has 40% fewer parameters and runs 60% faster while preserving 97% of BERT's performance on the GLUE benchmark. This translates in lower inference time and the ability to run in commodity hardware.
|
| 68 |
|