mapama247
/

DistilBERTa

Model card Files Files and versions

mapama247 commited on Dec 28, 2022

Commit

90f4cae

·

1 Parent(s): 0c46b1d

Update README.md

Files changed (1) hide show

README.md +11 -11

README.md CHANGED Viewed

@@ -27,8 +27,8 @@ widget:
 - [How to use](#how-to-use)
 - [Limitations and bias](#limitations-and-bias)
 - [Training](#training)
-  - [Training data](#training-data)
   - [Training procedure](#training-procedure)
 - [Evaluation](#evaluation)
   - [Evaluation benchmark](#evaluation-benchmark)
   - [Evaluation results](#evaluation-results)
@@ -78,6 +78,16 @@ At the time of submission, no measures have been taken to estimate the bias embe
 ## Training
 ### Training data
 The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
@@ -99,16 +109,6 @@ The training corpus consists of several corpora gathered from web crawling and p
 | Catalan Open Subtitles  | 0.02       |
 | Tweets                  | 0.02       |
-### Training procedure
-This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
-It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
-So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
-As a result, the student has lower inference time and the ability to run in commodity hardware.
 ## Evaluation
 ### Evaluation benchmark

 - [How to use](#how-to-use)
 - [Limitations and bias](#limitations-and-bias)
 - [Training](#training)
   - [Training procedure](#training-procedure)
+  - [Training data](#training-data)
 - [Evaluation](#evaluation)
   - [Evaluation benchmark](#evaluation-benchmark)
   - [Evaluation results](#evaluation-results)
 ## Training
+### Training procedure
+This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
+It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
+So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
+As a result, the student has lower inference time and the ability to run in commodity hardware.
 ### Training data
 The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
 | Catalan Open Subtitles  | 0.02       |
 | Tweets                  | 0.02       |
 ## Evaluation
 ### Evaluation benchmark