Jinawei commited on
Commit
1fbf003
·
1 Parent(s): 56edca2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -16
README.md CHANGED
@@ -1,27 +1,40 @@
1
- XtremeDistilTransformers for Distilling Massive Neural Networks
2
- XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation.
 
 
 
 
 
3
 
4
- We leverage task transfer combined with multi-task distillation techniques from the papers XtremeDistil: Multi-stage Distillation for Massive Multilingual Models and MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers with the following Github code.
5
 
6
- This l6-h384 checkpoint with 6 layers, 384 hidden size, 12 attention heads corresponds to 22 million parameters with 5.3x speedup over BERT-base.
7
 
8
- Other available checkpoints: xtremedistil-l6-h384-uncased and xtremedistil-l12-h384-uncased
 
 
 
 
 
9
 
10
  The following table shows the results on GLUE dev set and SQuAD-v2.
11
 
12
- Models #Params Speedup MNLI QNLI QQP RTE SST MRPC SQUAD2 Avg
13
- BERT 109 1x 84.5 91.7 91.3 68.6 93.2 87.3 76.8 84.8
14
- DistilBERT 66 2x 82.2 89.2 88.5 59.9 91.3 87.5 70.7 81.3
15
- TinyBERT 66 2x 83.5 90.5 90.6 72.2 91.6 88.4 73.1 84.3
16
- MiniLM 66 2x 84.0 91.0 91.0 71.5 92.0 88.4 76.4 84.9
17
- MiniLM 22 5.3x 82.8 90.3 90.6 68.9 91.3 86.6 72.9 83.3
18
- XtremeDistil-l6-h256 13 8.7x 83.9 89.5 90.6 80.1 91.2 90.0 74.1 85.6
19
- XtremeDistil-l6-h384 22 5.3x 85.4 90.3 91.0 80.9 92.3 90.0 76.6 86.6
20
- XtremeDistil-l12-h384 33 2.7x 87.2 91.9 91.3 85.6 93.1 90.4 80.2 88.5
21
- Tested with tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0
 
 
22
 
23
  If you use this checkpoint in your work, please cite:
24
 
 
25
  @misc{mukherjee2021xtremedistiltransformers,
26
  title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation},
27
  author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
@@ -29,4 +42,5 @@ If you use this checkpoint in your work, please cite:
29
  eprint={2106.04563},
30
  archivePrefix={arXiv},
31
  primaryClass={cs.CL}
32
- }
 
 
1
+ ---
2
+ language: en
3
+ thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
4
+ tags:
5
+ - text-classification
6
+ license: mit
7
+ ---
8
 
9
+ # XtremeDistilTransformers for Distilling Massive Neural Networks
10
 
11
+ XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation](https://arxiv.org/abs/2106.04563).
12
 
13
+ We leverage task transfer combined with multi-task distillation techniques from the papers [XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf) and [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://proceedings.neurips.cc/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) with the following [Github code](https://github.com/microsoft/xtreme-distil-transformers).
14
+
15
+
16
+ This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
17
+
18
+ Other available checkpoints: [xtremedistil-l6-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased) and [xtremedistil-l12-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l12-h384-uncased)
19
 
20
  The following table shows the results on GLUE dev set and SQuAD-v2.
21
 
22
+ | Models | #Params | Speedup | MNLI | QNLI | QQP | RTE | SST | MRPC | SQUAD2 | Avg |
23
+ |----------------|--------|---------|------|------|------|------|------|------|--------|-------|
24
+ | BERT | 109 | 1x | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 76.8 | 84.8 |
25
+ | DistilBERT | 66 | 2x | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 70.7 | 81.3 |
26
+ | TinyBERT | 66 | 2x | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 73.1 | 84.3 |
27
+ | MiniLM | 66 | 2x | 84.0 | 91.0 | 91.0 | 71.5 | 92.0 | 88.4 | 76.4 | 84.9 |
28
+ | MiniLM | 22 | 5.3x | 82.8 | 90.3 | 90.6 | 68.9 | 91.3 | 86.6 | 72.9 | 83.3 |
29
+ | XtremeDistil-l6-h256 | 13 | 8.7x | 83.9 | 89.5 | 90.6 | 80.1 | 91.2 | 90.0 | 74.1 | 85.6 |
30
+ | XtremeDistil-l6-h384 | 22 | 5.3x | 85.4 | 90.3 | 91.0 | 80.9 | 92.3 | 90.0 | 76.6 | 86.6 |
31
+ | XtremeDistil-l12-h384 | 33 | 2.7x | 87.2 | 91.9 | 91.3 | 85.6 | 93.1 | 90.4 | 80.2 | 88.5 |
32
+
33
+ Tested with `tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0`
34
 
35
  If you use this checkpoint in your work, please cite:
36
 
37
+ ``` latex
38
  @misc{mukherjee2021xtremedistiltransformers,
39
  title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation},
40
  author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
 
42
  eprint={2106.04563},
43
  archivePrefix={arXiv},
44
  primaryClass={cs.CL}
45
+ }
46
+ ```