| # Gupshup | |
| GupShup: Summarizing Open-Domain Code-Switched Conversations EMNLP 2021 | |
| Paper: [https://aclanthology.org/2021.emnlp-main.499.pdf](https://aclanthology.org/2021.emnlp-main.499.pdf) | |
| Github: [https://github.com/midas-research/gupshup](https://github.com/midas-research/gupshup) | |
| ### Dataset | |
| Please request for the Gupshup data using [this Google form](https://docs.google.com/forms/d/1zvUk7WcldVF3RCoHdWzQPzPprtSJClrnHoIOYbzaJEI/edit?ts=61381ec0). | |
| Dataset is available for `Hinglish Dilaogues to English Summarization`(h2e) and `English Dialogues to English Summarization`(e2e). For each task, Dialogues/conversastion have `.source`(train.source) as file extension whereas Summary has `.target`(train.target) file extension. ".source" file need to be provided to `input_path` and ".target" file to `reference_path` argument in the scripts. | |
| ## Models | |
| All model weights are available on the Huggingface model hub. Users can either directly download these weights in their local and provide this path to `model_name` argument in the scripts or use the provided alias (to `model_name` argument) in scripts directly; this will lead to download weights automatically by scripts. | |
| Model names were aliased in "gupshup_TASK_MODEL" sense, where "TASK" can be h2e,e2e and MODEL can be mbart, pegasus, etc., as listed below. | |
| **1. Hinglish Dialogues to English Summary (h2e)** | |
| | Model | Huggingface Alias | | |
| |---------|-------------------------------------------------------------------------------| | |
| | mBART | [midas/gupshup_h2e_mbart](https://huggingface.co/midas/gupshup_h2e_mbart) | | |
| | PEGASUS | [midas/gupshup_h2e_pegasus](https://huggingface.co/midas/gupshup_h2e_pegasus) | | |
| | T5 MTL | [midas/gupshup_h2e_t5_mtl](https://huggingface.co/midas/gupshup_h2e_t5_mtl) | | |
| | T5 | [midas/gupshup_h2e_t5](https://huggingface.co/midas/gupshup_h2e_t5) | | |
| | BART | [midas/gupshup_h2e_bart](https://huggingface.co/midas/gupshup_h2e_bart) | | |
| | GPT-2 | [midas/gupshup_h2e_gpt](https://huggingface.co/midas/gupshup_h2e_gpt) | | |
| **2. English Dialogues to English Summary (e2e)** | |
| | Model | Huggingface Alias | | |
| |---------|-------------------------------------------------------------------------------| | |
| | mBART | [midas/gupshup_e2e_mbart](https://huggingface.co/midas/gupshup_e2e_mbart) | | |
| | PEGASUS | [midas/gupshup_e2e_pegasus](https://huggingface.co/midas/gupshup_e2e_pegasus) | | |
| | T5 MTL | [midas/gupshup_e2e_t5_mtl](https://huggingface.co/midas/gupshup_e2e_t5_mtl) | | |
| | T5 | [midas/gupshup_e2e_t5](https://huggingface.co/midas/gupshup_e2e_t5) | | |
| | BART | [midas/gupshup_e2e_bart](https://huggingface.co/midas/gupshup_e2e_bart) | | |
| | GPT-2 | [midas/gupshup_e2e_gpt](https://huggingface.co/midas/gupshup_e2e_gpt) | | |
| ## Inference | |
| ### Using command line | |
| 1. Clone this repo and create a python virtual environment (https://docs.python.org/3/library/venv.html). Install the required packages using | |
| ``` | |
| git clone https://github.com/midas-research/gupshup.git | |
| pip install -r requirements.txt | |
| ``` | |
| 2. run_eval script has the following arguments. | |
| * **model_name** : Path or alias to one of our models available on Huggingface as listed above. | |
| * **input_path** : Source file or path to file containing conversations, which will be summarized. | |
| * **save_path** : File path where to save summaries generated by the model. | |
| * **reference_path** : Target file or path to file containing summaries, used to calculate matrices. | |
| * **score_path** : File path where to save scores. | |
| * **bs** : Batch size | |
| * **device**: Cuda devices to use. | |
| Please make sure you have downloaded the Gupshup dataset using the above google form and provide the correct path to these files in the argument's `input_path` and `refrence_path.` Or you can simply put `test.source` and `test.target` in `data/h2e/`(hinglish to english) or `data/e2e/`(english to english) folder. For example, to generate English summaries from Hinglish dialogues using the mbart model, run the following command | |
| ``` | |
| python run_eval.py \ | |
| --model_name midas/gupshup_h2e_mbart \ | |
| --input_path data/h2e/test.source \ | |
| --save_path generated_summary.txt \ | |
| --reference_path data/h2e/test.target \ | |
| --score_path scores.txt \ | |
| --bs 8 | |
| ``` | |
| Another example, to generate English summaries from English dialogues using the Pegasus model | |
| ``` | |
| python run_eval.py \ | |
| --model_name midas/gupshup_e2e_pegasus \ | |
| --input_path data/e2e/test.source \ | |
| --save_path generated_summary.txt \ | |
| --reference_path data/e2e/test.target \ | |
| --score_path scores.txt \ | |
| --bs 8 | |
| ``` | |
| Please create an issue if you are facing any difficulties in replicating the results. | |
| ### References | |
| Please cite [[1]](https://arxiv.org/abs/1910.04073) if you found the resources in this repository useful. | |
| [1] Mehnaz, Laiba, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, and Rajiv Shah. [*GupShup: Summarizing Open-Domain Code-Switched Conversations*](https://aclanthology.org/2021.emnlp-main.499.pdf) | |
| ``` | |
| @inproceedings{mehnaz2021gupshup, | |
| title={GupShup: Summarizing Open-Domain Code-Switched Conversations}, | |
| author={Mehnaz, Laiba and Mahata, Debanjan and Gosangi, Rakesh and Gunturi, Uma Sushmitha and Jain, Riya and Gupta, Gauri and Kumar, Amardeep and Lee, Isabelle G and Acharya, Anish and Shah, Rajiv}, | |
| booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, | |
| pages={6177--6192}, | |
| year={2021} | |
| } | |
| ``` | |