Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
XtremeDistilTransformers for Distilling Massive Neural Networks
|
| 2 |
+
XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation.
|
| 3 |
+
|
| 4 |
+
We leverage task transfer combined with multi-task distillation techniques from the papers XtremeDistil: Multi-stage Distillation for Massive Multilingual Models and MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers with the following Github code.
|
| 5 |
+
|
| 6 |
+
This l6-h384 checkpoint with 6 layers, 384 hidden size, 12 attention heads corresponds to 22 million parameters with 5.3x speedup over BERT-base.
|
| 7 |
+
|
| 8 |
+
Other available checkpoints: xtremedistil-l6-h384-uncased and xtremedistil-l12-h384-uncased
|
| 9 |
+
|
| 10 |
+
The following table shows the results on GLUE dev set and SQuAD-v2.
|
| 11 |
+
|
| 12 |
+
Models #Params Speedup MNLI QNLI QQP RTE SST MRPC SQUAD2 Avg
|
| 13 |
+
BERT 109 1x 84.5 91.7 91.3 68.6 93.2 87.3 76.8 84.8
|
| 14 |
+
DistilBERT 66 2x 82.2 89.2 88.5 59.9 91.3 87.5 70.7 81.3
|
| 15 |
+
TinyBERT 66 2x 83.5 90.5 90.6 72.2 91.6 88.4 73.1 84.3
|
| 16 |
+
MiniLM 66 2x 84.0 91.0 91.0 71.5 92.0 88.4 76.4 84.9
|
| 17 |
+
MiniLM 22 5.3x 82.8 90.3 90.6 68.9 91.3 86.6 72.9 83.3
|
| 18 |
+
XtremeDistil-l6-h256 13 8.7x 83.9 89.5 90.6 80.1 91.2 90.0 74.1 85.6
|
| 19 |
+
XtremeDistil-l6-h384 22 5.3x 85.4 90.3 91.0 80.9 92.3 90.0 76.6 86.6
|
| 20 |
+
XtremeDistil-l12-h384 33 2.7x 87.2 91.9 91.3 85.6 93.1 90.4 80.2 88.5
|
| 21 |
+
Tested with tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0
|
| 22 |
+
|
| 23 |
+
If you use this checkpoint in your work, please cite:
|
| 24 |
+
|
| 25 |
+
@misc{mukherjee2021xtremedistiltransformers,
|
| 26 |
+
title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation},
|
| 27 |
+
author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
|
| 28 |
+
year={2021},
|
| 29 |
+
eprint={2106.04563},
|
| 30 |
+
archivePrefix={arXiv},
|
| 31 |
+
primaryClass={cs.CL}
|
| 32 |
+
}
|