Update README.md
Browse files
README.md
CHANGED
|
@@ -27,8 +27,8 @@ widget:
|
|
| 27 |
- [How to use](#how-to-use)
|
| 28 |
- [Limitations and bias](#limitations-and-bias)
|
| 29 |
- [Training](#training)
|
| 30 |
-
- [Training data](#training-data)
|
| 31 |
- [Training procedure](#training-procedure)
|
|
|
|
| 32 |
- [Evaluation](#evaluation)
|
| 33 |
- [Evaluation benchmark](#evaluation-benchmark)
|
| 34 |
- [Evaluation results](#evaluation-results)
|
|
@@ -78,6 +78,16 @@ At the time of submission, no measures have been taken to estimate the bias embe
|
|
| 78 |
|
| 79 |
## Training
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
### Training data
|
| 82 |
|
| 83 |
The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
|
|
@@ -99,16 +109,6 @@ The training corpus consists of several corpora gathered from web crawling and p
|
|
| 99 |
| Catalan Open Subtitles | 0.02 |
|
| 100 |
| Tweets | 0.02 |
|
| 101 |
|
| 102 |
-
### Training procedure
|
| 103 |
-
|
| 104 |
-
This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
|
| 105 |
-
|
| 106 |
-
It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
|
| 107 |
-
|
| 108 |
-
So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
|
| 109 |
-
|
| 110 |
-
As a result, the student has lower inference time and the ability to run in commodity hardware.
|
| 111 |
-
|
| 112 |
## Evaluation
|
| 113 |
|
| 114 |
### Evaluation benchmark
|
|
|
|
| 27 |
- [How to use](#how-to-use)
|
| 28 |
- [Limitations and bias](#limitations-and-bias)
|
| 29 |
- [Training](#training)
|
|
|
|
| 30 |
- [Training procedure](#training-procedure)
|
| 31 |
+
- [Training data](#training-data)
|
| 32 |
- [Evaluation](#evaluation)
|
| 33 |
- [Evaluation benchmark](#evaluation-benchmark)
|
| 34 |
- [Evaluation results](#evaluation-results)
|
|
|
|
| 78 |
|
| 79 |
## Training
|
| 80 |
|
| 81 |
+
### Training procedure
|
| 82 |
+
|
| 83 |
+
This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
|
| 84 |
+
|
| 85 |
+
It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
|
| 86 |
+
|
| 87 |
+
So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
|
| 88 |
+
|
| 89 |
+
As a result, the student has lower inference time and the ability to run in commodity hardware.
|
| 90 |
+
|
| 91 |
### Training data
|
| 92 |
|
| 93 |
The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
|
|
|
|
| 109 |
| Catalan Open Subtitles | 0.02 |
|
| 110 |
| Tweets | 0.02 |
|
| 111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
## Evaluation
|
| 113 |
|
| 114 |
### Evaluation benchmark
|