mapama247 commited on
Commit
90f4cae
·
1 Parent(s): 0c46b1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -27,8 +27,8 @@ widget:
27
  - [How to use](#how-to-use)
28
  - [Limitations and bias](#limitations-and-bias)
29
  - [Training](#training)
30
- - [Training data](#training-data)
31
  - [Training procedure](#training-procedure)
 
32
  - [Evaluation](#evaluation)
33
  - [Evaluation benchmark](#evaluation-benchmark)
34
  - [Evaluation results](#evaluation-results)
@@ -78,6 +78,16 @@ At the time of submission, no measures have been taken to estimate the bias embe
78
 
79
  ## Training
80
 
 
 
 
 
 
 
 
 
 
 
81
  ### Training data
82
 
83
  The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
@@ -99,16 +109,6 @@ The training corpus consists of several corpora gathered from web crawling and p
99
  | Catalan Open Subtitles | 0.02 |
100
  | Tweets | 0.02 |
101
 
102
- ### Training procedure
103
-
104
- This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
105
-
106
- It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
107
-
108
- So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
109
-
110
- As a result, the student has lower inference time and the ability to run in commodity hardware.
111
-
112
  ## Evaluation
113
 
114
  ### Evaluation benchmark
 
27
  - [How to use](#how-to-use)
28
  - [Limitations and bias](#limitations-and-bias)
29
  - [Training](#training)
 
30
  - [Training procedure](#training-procedure)
31
+ - [Training data](#training-data)
32
  - [Evaluation](#evaluation)
33
  - [Evaluation benchmark](#evaluation-benchmark)
34
  - [Evaluation results](#evaluation-results)
 
78
 
79
  ## Training
80
 
81
+ ### Training procedure
82
+
83
+ This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
84
+
85
+ It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
86
+
87
+ So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
88
+
89
+ As a result, the student has lower inference time and the ability to run in commodity hardware.
90
+
91
  ### Training data
92
 
93
  The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
 
109
  | Catalan Open Subtitles | 0.02 |
110
  | Tweets | 0.02 |
111
 
 
 
 
 
 
 
 
 
 
 
112
  ## Evaluation
113
 
114
  ### Evaluation benchmark