microsoft
/

deberta-v2-xxlarge

@@ -7,11 +7,11 @@ license: mit
 ## DeBERTa: Decoding-enhanced BERT with Disentangled Attention
-[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data.
 Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
-This is the DeBERTa V2 xxlarge model with 48 layers, 1536 hidden size. Total parameters 1.5B. It's trained with 160GB data.
 ### Fine-tuning on NLU tasks
@@ -46,20 +46,20 @@ export TASK_NAME=mnli
 output_dir="ds_results"
 num_gpus=8
 batch_size=8
-python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
-  run_glue.py \
-  --model_name_or_path microsoft/deberta-v2-xxlarge \
-  --task_name $TASK_NAME \
-  --do_train \
-  --do_eval \
-  --max_seq_length 256 \
-  --per_device_train_batch_size ${batch_size} \
-  --learning_rate 3e-6 \
-  --num_train_epochs 3 \
-  --output_dir $output_dir \
-  --overwrite_output_dir \
-  --logging_steps 10 \
-  --logging_dir $output_dir \
   --deepspeed ds_config.json
 ```
@@ -67,8 +67,8 @@ You can also run with `--sharded_ddp`
 ```bash
 cd transformers/examples/text-classification/
 export TASK_NAME=mnli
-python -m torch.distributed.launch --nproc_per_node=8 run_glue.py   --model_name_or_path microsoft/deberta-v2-xxlarge   \
---task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 256   --per_device_train_batch_size 8   \
 --learning_rate 3e-6   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
 ```

 ## DeBERTa: Decoding-enhanced BERT with Disentangled Attention
+[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. It outperforms BERT and RoBERTa on  majority of NLU tasks with 80GB training data.
 Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
+This is the DeBERTa V2 xxlarge model with 48 layers, 1536 hidden size. The total parameters are 1.5B and it is trained with 160GB raw data.
 ### Fine-tuning on NLU tasks
 output_dir="ds_results"
 num_gpus=8
 batch_size=8
+python -m torch.distributed.launch --nproc_per_node=${num_gpus} \\
+  run_glue.py \\
+  --model_name_or_path microsoft/deberta-v2-xxlarge \\
+  --task_name $TASK_NAME \\
+  --do_train \\
+  --do_eval \\
+  --max_seq_length 256 \\
+  --per_device_train_batch_size ${batch_size} \\
+  --learning_rate 3e-6 \\
+  --num_train_epochs 3 \\
+  --output_dir $output_dir \\
+  --overwrite_output_dir \\
+  --logging_steps 10 \\
+  --logging_dir $output_dir \\
   --deepspeed ds_config.json
 ```
 ```bash
 cd transformers/examples/text-classification/
 export TASK_NAME=mnli
+python -m torch.distributed.launch --nproc_per_node=8 run_glue.py   --model_name_or_path microsoft/deberta-v2-xxlarge   \\
+--task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 256   --per_device_train_batch_size 8   \\
 --learning_rate 3e-6   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
 ```