mapama247 commited on
Commit
938eb16
·
1 Parent(s): 7220a2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -1
README.md CHANGED
@@ -57,9 +57,12 @@ The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (c
57
  We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the training and evaluation data.
58
 
59
  **About Knowledge Distiallation**
 
60
  It is a technique used to shrink networks to a reasonable size while minimizing the loss in performance.
61
 
62
- The main idea is to distill a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student). So, in a “teacher-student learning” setup, a small student model is trained to mimic the behavior of a larger teacher model.
 
 
63
 
64
  As an example, the distilled version of BERT has 40% fewer parameters and runs 60% faster while preserving 97% of BERT's performance on the GLUE benchmark. This translates in lower inference time and the ability to run in commodity hardware.
65
 
 
57
  We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the training and evaluation data.
58
 
59
  **About Knowledge Distiallation**
60
+
61
  It is a technique used to shrink networks to a reasonable size while minimizing the loss in performance.
62
 
63
+ The main idea is to distill a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
64
+
65
+ So, in a “teacher-student learning” setup, a small student model is trained to mimic the behavior of a larger teacher model.
66
 
67
  As an example, the distilled version of BERT has 40% fewer parameters and runs 60% faster while preserving 97% of BERT's performance on the GLUE benchmark. This translates in lower inference time and the ability to run in commodity hardware.
68