File size: 8,227 Bytes
fc3f9ec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
# Discourse Mutual Information (DMI)
This repository hosts the PyTorch-based implementation for the DMI model proposed in [**Representation Learning for Conversational Data using Discourse Mutual Information Maximization**](https://arxiv.org/abs/2112.05787).
## Requirements
- wandb
- transformers
- datasets
- torch 1.8.2 (lts)
## Getting Access to the Source Code or Pretrained Models
To get access to the source-code or pretrained-model checkpoints, please send a request to [AcadGrants@service.microsoft.com](mailto:AcadGrants@service.microsoft.com) and cc to *pawang.iitk [_at_] iitkgp.ac.in* and *bsantraigi [_at_] gmail.com*.
**Note:** The requesting third party (1) can download and use these deliverables for research as well as commercial use, (2) modify it as they like but should include citation to our work and include this readme, and (3) cannot redistribute strictly to any other organization.
**Cite As**
```bibtex
@inproceedings{santra2022representation,
title={Representation Learning for Conversational Data using Discourse Mutual Information Maximization},
author={Santra, Bishal and Roychowdhury, Sumegh and Mandal, Aishik and Gurram, Vasu and Naik, Atharva and Gupta, Manish and Goyal, Pawan},
booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
year={2022}
}
```
## How to run?
### Loading and Finetuning the model for a task
For finetuning the model on the tasks mentioned in the paper, or on a new task, use the `run_finetune.py` script or modify it according to your requirements. Example commands for launching finetuning based on some DMI checkpoints can be found in the `auto_eval` directory.
For example, if you have downloaded the checkpoint `DMI_small_model.pth` in `checkpoints/` directory, you can launch one of these `auto_eval` scripts as:
```bash
MODEL_NAME_PATH="checkpoints/DMI_small_model.pth" bash long_eval/probe_part1_rob.sh
```
#### Special Note
For running experiments with the checkpoints of different sizes, use the corresponding code branches as directed below -
* **DMI_Base**: `master` branch
- Finetuning scripts in `auto_eval/`
* **DMI_Medium**: `berty` branch
- Finetuning scripts in `auto_eval/8L/`
* **DMI_Small**: `berty` branch
- Finetuning scripts in `auto_eval/`
The main difference among these models is the DMI_Base model uses the `Roberta-base` architecture as the core for its encoder, whereas the other two uses `bert-8L` and `bert-6L` respectively.
### Pretraining Dataset
There are two types of dataset structure that are available for model pretraining.
In case of smaller or **"Normal"** datasets, a single train_dialog file contains all the training data and is consumed fully during each epoch.
In case of **"Large"** datasets, the files are split into smaller shards and saved as .json files.
1. **Normal Datasets**: For example of this, check the `data/dailydialog` or `data/reddit_1M` directories.
```sh
data/reddit_1M
βββ test_dialogues.txt
βββ train_dialogues.txt
βββ val_dialogues.txt
```
2. **Large Datasets**: This mode can be activated by setting the `--dataset` argument to `rMax`, i.e., `--dataset rMax` or `-dd rMax`. This also require you to provide the `-rmp` argument for the directory path containing the json files. For validation during pretraining, this model uses the DailyDialog validation set by default.
```sh
data/rMax-subset
βββ test-00000-of-01000.json
βββ test-00001-of-01000.json
βββ test-00002-of-01000.json
βββ test-00003-of-01000.json
βββ ...
βββ train-00000-of-01000.json
βββ train-00001-of-01000.json
βββ train-00002-of-01000.json
βββ train-00003-of-01000.json
βββ ...
```
### For training a model
To train a new model, it can be started using the pretrain.py script.
**Example:**
1. For training from scratch:
```bash
python pretrain.py \
-dd rMax -voc roberta \
--roberta_init \
-sym \
-bs 64 -ep 1000 -vi 400 -li 50 -lr 5e-5 -scdl \
--data_path ./data \
-rmp /disk2/infonce-dialog/data/r727m/ \
-t 1 \
-ddp --world_size 6 \
-ntq
```
2. To resume training from an existing checkpoint: This example shows resuming training from a checkpoint saved under `checkpoints/DMI-Small_BERT-26Jan/`. Also note how we specify a name an existing BERT/RoBERTa model which defines the architecture and the original initialization of the model weights.
```
python pretrain.py \
-dd rMax -voc bert \
--roberta_init \
-robname google/bert_uncased_L-8_H-768_A-12 \
-sym -bs 130 -lr 1e-5 -scdl -ep 1000 -vi 400 -li 50 \
--data_path ./data \
-rmp /disk2/infonce-dialog/data/r727m/ \
-ddp --world_size 4 \
-ntq -t 1 \
-re -rept checkpoints/DMI-Small_BERT-26Jan/model_current.pth
```
**It accepts the following arguments.**
```
-h, --help show this help message and exit
-dd {dd,r5k,r100k,r1M,r1M/cc,rMax,rMax++,paa,WoW}, --dataset {dd,r5k,r100k,r1M,r1M/cc,rMax,rMax++,paa,WoW}
which dataset to use for pretraining.
-rf, --reddit_filter_enabled
Enable reddit data filter for removing low quality dialogs.
-rmp RMAX_PATH, --rmax_path RMAX_PATH
path to dir for r727m (.json) data files.
-dp DATA_PATH, --data_path DATA_PATH
path to the root data folder.
-op OUTPUT_PATH, --output_path OUTPUT_PATH
Path to store the output ``model.pth'' files
-voc {bert,blender,roberta,dgpt-m}, --vocab {bert,blender,roberta,dgpt-m}
mention which tokenizer was used for pretraining? bert or blender
-rob, --roberta_init Initialize transformer-encoder with roberta weights?
-robname ROBERTA_NAME, --roberta_name ROBERTA_NAME
name of checkpoint from huggingface
-d D_MODEL, --d_model D_MODEL
size of transformer encoders' hidden representation
-d_ff DIM_FEEDFORWARD, --dim_feedforward DIM_FEEDFORWARD
dim_feedforward for transformer encoder.
-p PROJECTION, --projection PROJECTION
size of projection layer output
-el ENCODER_LAYERS, --encoder_layers ENCODER_LAYERS
number of layers in transformer encoder
-eh ENCODER_HEADS, --encoder_heads ENCODER_HEADS
number of heads in tformer enc
-sym, --symmetric_loss
whether to train using symmetric infonce
-udrl, --unsupervised_discourse_losses
Additional unsupervised discourse-relation loss components
-sdrl, --supervised_discourse_losses
Additional supervised discourse-relation loss components
-es {infonce,jsd,nwj,tuba,dv,smile,infonce/td}, --estimator {infonce,jsd,nwj,tuba,dv,smile,infonce/td}
which MI estimator is used as the loss function.
-bs BATCH_SIZE, --batch_size BATCH_SIZE
batch size during pretraining
-ep EPOCHS, --epochs EPOCHS
epochs for pretraining
-vi VAL_INTERVAL, --val_interval VAL_INTERVAL
validation interval during training
-li LOG_INTERVAL, --log_interval LOG_INTERVAL
logging interval during training
-lr LEARNING_RATE, --learning_rate LEARNING_RATE
set learning rate
-lrc, --learning_rate_control
LRC: outer layer and projection layer will have faster LR and rest will be LR/10
-t {0,1}, --tracking {0,1}
whether to track training+validation loss wandb
-scdl, --use_scheduler
whether to use a warmup+decay schedule for LR
-ntq, --no_tqdm disable tqdm to create concise log files!
-ddp, --distdp Should it use pytorch Distributed dataparallel?
-ws WORLD_SIZE, --world_size WORLD_SIZE
world size when using DDP with pytorch.
-re, --resume 2-stage pretrain: Resume training from a previous checkpoint?
-rept RESUME_MODEL_PATH, --resume_model_path RESUME_MODEL_PATH
If ``Resuming'', path to ckpt file.
```
|