bytestorm commited on
Commit
fc3f9ec
Β·
verified Β·
1 Parent(s): c402aa6

Upload 6 files

Browse files
DMI_base_model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c8510a30f0ea990bf6fe3019406dafb98610afc4dc0c02d29770b5cfff207e88
3
+ size 1495966005
DMI_medium_model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a731803871088db88c971fb33eb2f6eb4ed087798575d902f456c51c97237f31
3
+ size 973719413
DMI_small_model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0f1553b691e52078e5d669a0c28ebd634fb15c428abd19e3ed750031eefaed48
3
+ size 803576213
Discourse-Mutual-Information-DMI-berty.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a3c5a1177ed55f25226b18c091d97a5bac29d2c9019408960494843572cd64b
3
+ size 3667861
Discourse-Mutual-Information-DMI-main.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c72b21d9749a93bfab1c041c6f273dbb04a9f680ab08032e8a292b1facb9faa
3
+ size 3663207
README.md CHANGED
@@ -1,5 +1,176 @@
1
- ---
2
- license: other
3
- license_name: custom-no-redistribution-license
4
- license_link: https://bsantraigi.github.io/DMI/#note
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Discourse Mutual Information (DMI)
2
+
3
+ This repository hosts the PyTorch-based implementation for the DMI model proposed in [**Representation Learning for Conversational Data using Discourse Mutual Information Maximization**](https://arxiv.org/abs/2112.05787).
4
+
5
+ ## Requirements
6
+
7
+ - wandb
8
+ - transformers
9
+ - datasets
10
+ - torch 1.8.2 (lts)
11
+
12
+ ## Getting Access to the Source Code or Pretrained Models
13
+
14
+ To get access to the source-code or pretrained-model checkpoints, please send a request to [AcadGrants@service.microsoft.com](mailto:AcadGrants@service.microsoft.com) and cc to *pawang.iitk [_at_] iitkgp.ac.in* and *bsantraigi [_at_] gmail.com*.
15
+
16
+ **Note:** The requesting third party (1) can download and use these deliverables for research as well as commercial use, (2) modify it as they like but should include citation to our work and include this readme, and (3) cannot redistribute strictly to any other organization.
17
+
18
+ **Cite As**
19
+
20
+ ```bibtex
21
+ @inproceedings{santra2022representation,
22
+ title={Representation Learning for Conversational Data using Discourse Mutual Information Maximization},
23
+ author={Santra, Bishal and Roychowdhury, Sumegh and Mandal, Aishik and Gurram, Vasu and Naik, Atharva and Gupta, Manish and Goyal, Pawan},
24
+ booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
25
+ year={2022}
26
+ }
27
+ ```
28
+
29
+ ## How to run?
30
+
31
+ ### Loading and Finetuning the model for a task
32
+
33
+ For finetuning the model on the tasks mentioned in the paper, or on a new task, use the `run_finetune.py` script or modify it according to your requirements. Example commands for launching finetuning based on some DMI checkpoints can be found in the `auto_eval` directory.
34
+
35
+ For example, if you have downloaded the checkpoint `DMI_small_model.pth` in `checkpoints/` directory, you can launch one of these `auto_eval` scripts as:
36
+
37
+ ```bash
38
+ MODEL_NAME_PATH="checkpoints/DMI_small_model.pth" bash long_eval/probe_part1_rob.sh
39
+ ```
40
+
41
+ #### Special Note
42
+ For running experiments with the checkpoints of different sizes, use the corresponding code branches as directed below -
43
+
44
+ * **DMI_Base**: `master` branch
45
+ - Finetuning scripts in `auto_eval/`
46
+ * **DMI_Medium**: `berty` branch
47
+ - Finetuning scripts in `auto_eval/8L/`
48
+ * **DMI_Small**: `berty` branch
49
+ - Finetuning scripts in `auto_eval/`
50
+
51
+ The main difference among these models is the DMI_Base model uses the `Roberta-base` architecture as the core for its encoder, whereas the other two uses `bert-8L` and `bert-6L` respectively.
52
+
53
+ ### Pretraining Dataset
54
+
55
+ There are two types of dataset structure that are available for model pretraining.
56
+
57
+ In case of smaller or **"Normal"** datasets, a single train_dialog file contains all the training data and is consumed fully during each epoch.
58
+
59
+ In case of **"Large"** datasets, the files are split into smaller shards and saved as .json files.
60
+
61
+ 1. **Normal Datasets**: For example of this, check the `data/dailydialog` or `data/reddit_1M` directories.
62
+ ```sh
63
+ data/reddit_1M
64
+ β”œβ”€β”€ test_dialogues.txt
65
+ β”œβ”€β”€ train_dialogues.txt
66
+ └── val_dialogues.txt
67
+ ```
68
+ 2. **Large Datasets**: This mode can be activated by setting the `--dataset` argument to `rMax`, i.e., `--dataset rMax` or `-dd rMax`. This also require you to provide the `-rmp` argument for the directory path containing the json files. For validation during pretraining, this model uses the DailyDialog validation set by default.
69
+ ```sh
70
+ data/rMax-subset
71
+ β”œβ”€β”€ test-00000-of-01000.json
72
+ β”œβ”€β”€ test-00001-of-01000.json
73
+ β”œβ”€β”€ test-00002-of-01000.json
74
+ β”œβ”€β”€ test-00003-of-01000.json
75
+ β”œβ”€β”€ ...
76
+ β”œβ”€β”€ train-00000-of-01000.json
77
+ β”œβ”€β”€ train-00001-of-01000.json
78
+ β”œβ”€β”€ train-00002-of-01000.json
79
+ β”œβ”€β”€ train-00003-of-01000.json
80
+ └── ...
81
+ ```
82
+
83
+ ### For training a model
84
+
85
+ To train a new model, it can be started using the pretrain.py script.
86
+
87
+ **Example:**
88
+
89
+ 1. For training from scratch:
90
+ ```bash
91
+ python pretrain.py \
92
+ -dd rMax -voc roberta \
93
+ --roberta_init \
94
+ -sym \
95
+ -bs 64 -ep 1000 -vi 400 -li 50 -lr 5e-5 -scdl \
96
+ --data_path ./data \
97
+ -rmp /disk2/infonce-dialog/data/r727m/ \
98
+ -t 1 \
99
+ -ddp --world_size 6 \
100
+ -ntq
101
+ ```
102
+ 2. To resume training from an existing checkpoint: This example shows resuming training from a checkpoint saved under `checkpoints/DMI-Small_BERT-26Jan/`. Also note how we specify a name an existing BERT/RoBERTa model which defines the architecture and the original initialization of the model weights.
103
+ ```
104
+ python pretrain.py \
105
+ -dd rMax -voc bert \
106
+ --roberta_init \
107
+ -robname google/bert_uncased_L-8_H-768_A-12 \
108
+ -sym -bs 130 -lr 1e-5 -scdl -ep 1000 -vi 400 -li 50 \
109
+ --data_path ./data \
110
+ -rmp /disk2/infonce-dialog/data/r727m/ \
111
+ -ddp --world_size 4 \
112
+ -ntq -t 1 \
113
+ -re -rept checkpoints/DMI-Small_BERT-26Jan/model_current.pth
114
+ ```
115
+
116
+ **It accepts the following arguments.**
117
+
118
+ ```
119
+ -h, --help show this help message and exit
120
+ -dd {dd,r5k,r100k,r1M,r1M/cc,rMax,rMax++,paa,WoW}, --dataset {dd,r5k,r100k,r1M,r1M/cc,rMax,rMax++,paa,WoW}
121
+ which dataset to use for pretraining.
122
+ -rf, --reddit_filter_enabled
123
+ Enable reddit data filter for removing low quality dialogs.
124
+ -rmp RMAX_PATH, --rmax_path RMAX_PATH
125
+ path to dir for r727m (.json) data files.
126
+ -dp DATA_PATH, --data_path DATA_PATH
127
+ path to the root data folder.
128
+ -op OUTPUT_PATH, --output_path OUTPUT_PATH
129
+ Path to store the output ``model.pth'' files
130
+ -voc {bert,blender,roberta,dgpt-m}, --vocab {bert,blender,roberta,dgpt-m}
131
+ mention which tokenizer was used for pretraining? bert or blender
132
+ -rob, --roberta_init Initialize transformer-encoder with roberta weights?
133
+ -robname ROBERTA_NAME, --roberta_name ROBERTA_NAME
134
+ name of checkpoint from huggingface
135
+ -d D_MODEL, --d_model D_MODEL
136
+ size of transformer encoders' hidden representation
137
+ -d_ff DIM_FEEDFORWARD, --dim_feedforward DIM_FEEDFORWARD
138
+ dim_feedforward for transformer encoder.
139
+ -p PROJECTION, --projection PROJECTION
140
+ size of projection layer output
141
+ -el ENCODER_LAYERS, --encoder_layers ENCODER_LAYERS
142
+ number of layers in transformer encoder
143
+ -eh ENCODER_HEADS, --encoder_heads ENCODER_HEADS
144
+ number of heads in tformer enc
145
+ -sym, --symmetric_loss
146
+ whether to train using symmetric infonce
147
+ -udrl, --unsupervised_discourse_losses
148
+ Additional unsupervised discourse-relation loss components
149
+ -sdrl, --supervised_discourse_losses
150
+ Additional supervised discourse-relation loss components
151
+ -es {infonce,jsd,nwj,tuba,dv,smile,infonce/td}, --estimator {infonce,jsd,nwj,tuba,dv,smile,infonce/td}
152
+ which MI estimator is used as the loss function.
153
+ -bs BATCH_SIZE, --batch_size BATCH_SIZE
154
+ batch size during pretraining
155
+ -ep EPOCHS, --epochs EPOCHS
156
+ epochs for pretraining
157
+ -vi VAL_INTERVAL, --val_interval VAL_INTERVAL
158
+ validation interval during training
159
+ -li LOG_INTERVAL, --log_interval LOG_INTERVAL
160
+ logging interval during training
161
+ -lr LEARNING_RATE, --learning_rate LEARNING_RATE
162
+ set learning rate
163
+ -lrc, --learning_rate_control
164
+ LRC: outer layer and projection layer will have faster LR and rest will be LR/10
165
+ -t {0,1}, --tracking {0,1}
166
+ whether to track training+validation loss wandb
167
+ -scdl, --use_scheduler
168
+ whether to use a warmup+decay schedule for LR
169
+ -ntq, --no_tqdm disable tqdm to create concise log files!
170
+ -ddp, --distdp Should it use pytorch Distributed dataparallel?
171
+ -ws WORLD_SIZE, --world_size WORLD_SIZE
172
+ world size when using DDP with pytorch.
173
+ -re, --resume 2-stage pretrain: Resume training from a previous checkpoint?
174
+ -rept RESUME_MODEL_PATH, --resume_model_path RESUME_MODEL_PATH
175
+ If ``Resuming'', path to ckpt file.
176
+ ```