# Discourse Mutual Information (DMI) This repository hosts the PyTorch-based implementation for the DMI model proposed in [**Representation Learning for Conversational Data using Discourse Mutual Information Maximization**](https://arxiv.org/abs/2112.05787). ## Requirements - wandb - transformers - datasets - torch 1.8.2 (lts) ## Getting Access to the Source Code or Pretrained Models To get access to the source-code or pretrained-model checkpoints, please send a request to [AcadGrants@service.microsoft.com](mailto:AcadGrants@service.microsoft.com) and cc to *pawang.iitk [_at_] iitkgp.ac.in* and *bsantraigi [_at_] gmail.com*. **Note:** The requesting third party (1) can download and use these deliverables for research as well as commercial use, (2) modify it as they like but should include citation to our work and include this readme, and (3) cannot redistribute strictly to any other organization. **Cite As** ```bibtex @inproceedings{santra2022representation, title={Representation Learning for Conversational Data using Discourse Mutual Information Maximization}, author={Santra, Bishal and Roychowdhury, Sumegh and Mandal, Aishik and Gurram, Vasu and Naik, Atharva and Gupta, Manish and Goyal, Pawan}, booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, year={2022} } ``` ## How to run? ### Loading and Finetuning the model for a task For finetuning the model on the tasks mentioned in the paper, or on a new task, use the `run_finetune.py` script or modify it according to your requirements. Example commands for launching finetuning based on some DMI checkpoints can be found in the `auto_eval` directory. For example, if you have downloaded the checkpoint `DMI_small_model.pth` in `checkpoints/` directory, you can launch one of these `auto_eval` scripts as: ```bash MODEL_NAME_PATH="checkpoints/DMI_small_model.pth" bash long_eval/probe_part1_rob.sh ``` #### Special Note For running experiments with the checkpoints of different sizes, use the corresponding code branches as directed below - * **DMI_Base**: `master` branch - Finetuning scripts in `auto_eval/` * **DMI_Medium**: `berty` branch - Finetuning scripts in `auto_eval/8L/` * **DMI_Small**: `berty` branch - Finetuning scripts in `auto_eval/` The main difference among these models is the DMI_Base model uses the `Roberta-base` architecture as the core for its encoder, whereas the other two uses `bert-8L` and `bert-6L` respectively. ### Pretraining Dataset There are two types of dataset structure that are available for model pretraining. In case of smaller or **"Normal"** datasets, a single train_dialog file contains all the training data and is consumed fully during each epoch. In case of **"Large"** datasets, the files are split into smaller shards and saved as .json files. 1. **Normal Datasets**: For example of this, check the `data/dailydialog` or `data/reddit_1M` directories. ```sh data/reddit_1M ├── test_dialogues.txt ├── train_dialogues.txt └── val_dialogues.txt ``` 2. **Large Datasets**: This mode can be activated by setting the `--dataset` argument to `rMax`, i.e., `--dataset rMax` or `-dd rMax`. This also require you to provide the `-rmp` argument for the directory path containing the json files. For validation during pretraining, this model uses the DailyDialog validation set by default. ```sh data/rMax-subset ├── test-00000-of-01000.json ├── test-00001-of-01000.json ├── test-00002-of-01000.json ├── test-00003-of-01000.json ├── ... ├── train-00000-of-01000.json ├── train-00001-of-01000.json ├── train-00002-of-01000.json ├── train-00003-of-01000.json └── ... ``` ### For training a model To train a new model, it can be started using the pretrain.py script. **Example:** 1. For training from scratch: ```bash python pretrain.py \ -dd rMax -voc roberta \ --roberta_init \ -sym \ -bs 64 -ep 1000 -vi 400 -li 50 -lr 5e-5 -scdl \ --data_path ./data \ -rmp /disk2/infonce-dialog/data/r727m/ \ -t 1 \ -ddp --world_size 6 \ -ntq ``` 2. To resume training from an existing checkpoint: This example shows resuming training from a checkpoint saved under `checkpoints/DMI-Small_BERT-26Jan/`. Also note how we specify a name an existing BERT/RoBERTa model which defines the architecture and the original initialization of the model weights. ``` python pretrain.py \ -dd rMax -voc bert \ --roberta_init \ -robname google/bert_uncased_L-8_H-768_A-12 \ -sym -bs 130 -lr 1e-5 -scdl -ep 1000 -vi 400 -li 50 \ --data_path ./data \ -rmp /disk2/infonce-dialog/data/r727m/ \ -ddp --world_size 4 \ -ntq -t 1 \ -re -rept checkpoints/DMI-Small_BERT-26Jan/model_current.pth ``` **It accepts the following arguments.** ``` -h, --help show this help message and exit -dd {dd,r5k,r100k,r1M,r1M/cc,rMax,rMax++,paa,WoW}, --dataset {dd,r5k,r100k,r1M,r1M/cc,rMax,rMax++,paa,WoW} which dataset to use for pretraining. -rf, --reddit_filter_enabled Enable reddit data filter for removing low quality dialogs. -rmp RMAX_PATH, --rmax_path RMAX_PATH path to dir for r727m (.json) data files. -dp DATA_PATH, --data_path DATA_PATH path to the root data folder. -op OUTPUT_PATH, --output_path OUTPUT_PATH Path to store the output ``model.pth'' files -voc {bert,blender,roberta,dgpt-m}, --vocab {bert,blender,roberta,dgpt-m} mention which tokenizer was used for pretraining? bert or blender -rob, --roberta_init Initialize transformer-encoder with roberta weights? -robname ROBERTA_NAME, --roberta_name ROBERTA_NAME name of checkpoint from huggingface -d D_MODEL, --d_model D_MODEL size of transformer encoders' hidden representation -d_ff DIM_FEEDFORWARD, --dim_feedforward DIM_FEEDFORWARD dim_feedforward for transformer encoder. -p PROJECTION, --projection PROJECTION size of projection layer output -el ENCODER_LAYERS, --encoder_layers ENCODER_LAYERS number of layers in transformer encoder -eh ENCODER_HEADS, --encoder_heads ENCODER_HEADS number of heads in tformer enc -sym, --symmetric_loss whether to train using symmetric infonce -udrl, --unsupervised_discourse_losses Additional unsupervised discourse-relation loss components -sdrl, --supervised_discourse_losses Additional supervised discourse-relation loss components -es {infonce,jsd,nwj,tuba,dv,smile,infonce/td}, --estimator {infonce,jsd,nwj,tuba,dv,smile,infonce/td} which MI estimator is used as the loss function. -bs BATCH_SIZE, --batch_size BATCH_SIZE batch size during pretraining -ep EPOCHS, --epochs EPOCHS epochs for pretraining -vi VAL_INTERVAL, --val_interval VAL_INTERVAL validation interval during training -li LOG_INTERVAL, --log_interval LOG_INTERVAL logging interval during training -lr LEARNING_RATE, --learning_rate LEARNING_RATE set learning rate -lrc, --learning_rate_control LRC: outer layer and projection layer will have faster LR and rest will be LR/10 -t {0,1}, --tracking {0,1} whether to track training+validation loss wandb -scdl, --use_scheduler whether to use a warmup+decay schedule for LR -ntq, --no_tqdm disable tqdm to create concise log files! -ddp, --distdp Should it use pytorch Distributed dataparallel? -ws WORLD_SIZE, --world_size WORLD_SIZE world size when using DDP with pytorch. -re, --resume 2-stage pretrain: Resume training from a previous checkpoint? -rept RESUME_MODEL_PATH, --resume_model_path RESUME_MODEL_PATH If ``Resuming'', path to ckpt file. ```