ncsu-dk-lab
/

AutoDisProxyT-COLA

@@ -1,27 +1,40 @@
-XtremeDistilTransformers for Distilling Massive Neural Networks
-XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation.
-We leverage task transfer combined with multi-task distillation techniques from the papers XtremeDistil: Multi-stage Distillation for Massive Multilingual Models and MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers with the following Github code.
-This l6-h384 checkpoint with 6 layers, 384 hidden size, 12 attention heads corresponds to 22 million parameters with 5.3x speedup over BERT-base.
-Other available checkpoints: xtremedistil-l6-h384-uncased and xtremedistil-l12-h384-uncased
 The following table shows the results on GLUE dev set and SQuAD-v2.
-Models	#Params	Speedup	MNLI	QNLI	QQP	RTE	SST	MRPC	SQUAD2	Avg
-BERT	109	1x	84.5	91.7	91.3	68.6	93.2	87.3	76.8	84.8
-DistilBERT	66	2x	82.2	89.2	88.5	59.9	91.3	87.5	70.7	81.3
-TinyBERT	66	2x	83.5	90.5	90.6	72.2	91.6	88.4	73.1	84.3
-MiniLM	66	2x	84.0	91.0	91.0	71.5	92.0	88.4	76.4	84.9
-MiniLM	22	5.3x	82.8	90.3	90.6	68.9	91.3	86.6	72.9	83.3
-XtremeDistil-l6-h256	13	8.7x	83.9	89.5	90.6	80.1	91.2	90.0	74.1	85.6
-XtremeDistil-l6-h384	22	5.3x	85.4	90.3	91.0	80.9	92.3	90.0	76.6	86.6
-XtremeDistil-l12-h384	33	2.7x	87.2	91.9	91.3	85.6	93.1	90.4	80.2	88.5
-Tested with tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0
 If you use this checkpoint in your work, please cite:
 @misc{mukherjee2021xtremedistiltransformers,
       title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation},
       author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
@@ -29,4 +42,5 @@ If you use this checkpoint in your work, please cite:
       eprint={2106.04563},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
-}

+---
+language: en
+thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
+tags:
+- text-classification
+license: mit
+---
+# XtremeDistilTransformers for Distilling Massive Neural Networks
+XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation](https://arxiv.org/abs/2106.04563).
+We leverage task transfer combined with multi-task distillation techniques from the papers [XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf) and [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://proceedings.neurips.cc/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) with the following [Github code](https://github.com/microsoft/xtreme-distil-transformers).
+This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
+Other available checkpoints: [xtremedistil-l6-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased) and [xtremedistil-l12-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l12-h384-uncased)
 The following table shows the results on GLUE dev set and SQuAD-v2.
+| Models         | #Params | Speedup | MNLI | QNLI | QQP  | RTE  | SST  | MRPC | SQUAD2 | Avg   |
+|----------------|--------|---------|------|------|------|------|------|------|--------|-------|
+| BERT        | 109    | 1x       | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 76.8   | 84.8 |
+| DistilBERT  | 66     | 2x       | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 70.7   | 81.3 |
+| TinyBERT    | 66     | 2x       | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 73.1   | 84.3 |
+| MiniLM      | 66     | 2x       | 84.0   | 91.0   | 91.0   | 71.5 | 92.0   | 88.4 | 76.4   | 84.9  |
+| MiniLM      | 22     | 5.3x     | 82.8 | 90.3 | 90.6 | 68.9 | 91.3 | 86.6 | 72.9   | 83.3 |
+| XtremeDistil-l6-h256   | 13     | 8.7x     | 83.9 | 89.5 | 90.6   | 80.1 | 91.2 | 90.0   | 74.1   | 85.6 |
+| XtremeDistil-l6-h384   | 22     | 5.3x     | 85.4 | 90.3 | 91.0   | 80.9 | 92.3 | 90.0   | 76.6   | 86.6 |
+| XtremeDistil-l12-h384   | 33     | 2.7x     | 87.2 | 91.9 | 91.3   | 85.6 | 93.1 | 90.4   | 80.2   | 88.5 |
+Tested with `tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0`
 If you use this checkpoint in your work, please cite:
+``` latex
 @misc{mukherjee2021xtremedistiltransformers,
       title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation},
       author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
       eprint={2106.04563},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
+}
+```