nianlong commited on
Commit
b855bf2
Β·
verified Β·
1 Parent(s): 36fb1d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +322 -0
README.md CHANGED
@@ -1,3 +1,325 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ [![DOI](https://img.shields.io/badge/DOI-10%2E18653%2Fv1%2F2022%2Eacl--long%2E450-blue)](http://dx.doi.org/10.18653/v1/2022.acl-long.450)
5
+ # MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes
6
+
7
+ Code for ACL 2022 paper on the topic of long document extractive summarization: [MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes](https://aclanthology.org/2022.acl-long.450/).
8
+
9
+ ## Set Up Environment
10
+
11
+ 1. create an Anaconda environment, with a name e.g. memsum
12
+
13
+ **Note**: Without further notification, the following commands need to be run in the working directory where this jupyter notebook is located.
14
+ ```bash
15
+ conda create -n memsum python=3.10
16
+ ```
17
+ 2. activate this environment
18
+ ```bash
19
+ source activate memsum
20
+ ```
21
+
22
+ 3. Install pytorch (GPU version).
23
+ ```bash
24
+ pip install torch torchvision torchaudio
25
+ ```
26
+ 4. Install dependencies via pip
27
+ ```bash
28
+ pip install -r requirements.txt
29
+ ```
30
+
31
+ ## Download Datasets and Pretrained Model Checkpoints
32
+
33
+ ### Download All Datasets Used in the Paper
34
+
35
+
36
+ ```python
37
+ import os
38
+ import subprocess
39
+ import wget
40
+
41
+ for dataset_name in [ "arxiv", "pubmed", "gov-report"]:
42
+ print(dataset_name)
43
+ os.makedirs( "data/"+dataset_name, exist_ok=True )
44
+
45
+ ## dataset is stored at huggingface hub
46
+ train_dataset_path = f"https://huggingface.co/datasets/nianlong/long-doc-extractive-summarization-{dataset_name}/resolve/main/train.jsonl"
47
+ val_dataset_path = f"https://huggingface.co/datasets/nianlong/long-doc-extractive-summarization-{dataset_name}/resolve/main/val.jsonl"
48
+ test_dataset_path = f"https://huggingface.co/datasets/nianlong/long-doc-extractive-summarization-{dataset_name}/resolve/main/test.jsonl"
49
+
50
+ wget.download( train_dataset_path, out = "data/"+dataset_name )
51
+ wget.download( val_dataset_path, out = "data/"+dataset_name )
52
+ wget.download( test_dataset_path, out = "data/"+dataset_name )
53
+ ```
54
+
55
+ ### Download Pretrained Model Checkpoints
56
+
57
+ The trained MemSum model checkpoints are stored on huggingface hub
58
+
59
+
60
+ ```python
61
+ from huggingface_hub import snapshot_download
62
+ ## download the pretrained glove word embedding (200 dimension)
63
+ snapshot_download('nianlong/memsum-word-embedding', local_dir = "model/word_embedding" )
64
+
65
+ ## download model checkpoint on the arXiv dataset
66
+ snapshot_download('nianlong/memsum-arxiv-summarization', local_dir = "model/memsum-arxiv" )
67
+
68
+ ## download model checkpoint on the PubMed dataset
69
+ snapshot_download('nianlong/memsum-pubmed-summarization', local_dir = "model/memsum-pubmed" )
70
+
71
+ ## download model checkpoint on the Gov-Report dataset
72
+ snapshot_download('nianlong/memsum-gov-report-summarization', local_dir = "model/memsum-gov-report" )
73
+ ```
74
+
75
+ ## Testing Pretrained Model on a Given Dataset
76
+
77
+ For example, the following command test the performance of the full MemSum model. Berfore runing these codes, make sure current working directory is the main directory "MemSum/" where the .py file summarizers.py is located.
78
+
79
+
80
+ ```python
81
+ from src.summarizer import MemSum
82
+ from tqdm import tqdm
83
+ from rouge_score import rouge_scorer
84
+ import json
85
+ import numpy as np
86
+ ```
87
+
88
+
89
+ ```python
90
+ rouge_cal = rouge_scorer.RougeScorer(['rouge1','rouge2', 'rougeLsum'], use_stemmer=True)
91
+
92
+ memsum_arxiv = MemSum( "model/memsum-arxiv/model.pt",
93
+ "model/word_embedding/vocabulary_200dim.pkl",
94
+ gpu = 0 , max_doc_len = 500 )
95
+
96
+ memsum_pubmed = MemSum( "model/memsum-pubmed/model.pt",
97
+ "model/word_embedding/vocabulary_200dim.pkl",
98
+ gpu = 0 , max_doc_len = 500 )
99
+
100
+ memsum_gov_report = MemSum( "model/memsum-gov-report/model.pt",
101
+ "model/word_embedding/vocabulary_200dim.pkl",
102
+ gpu = 0 , max_doc_len = 500 )
103
+ ```
104
+
105
+
106
+ ```python
107
+ test_corpus_arxiv = [ json.loads(line) for line in open("data/arxiv/test.jsonl") ]
108
+ test_corpus_pubmed = [ json.loads(line) for line in open("data/pubmed/test.jsonl") ]
109
+ test_corpus_gov_report = [ json.loads(line) for line in open("data/gov-report/test.jsonl") ]
110
+ ```
111
+
112
+ ### Evaluation on ROUGE
113
+
114
+
115
+ ```python
116
+ def evaluate( model, corpus, p_stop, max_extracted_sentences, rouge_cal ):
117
+ scores = []
118
+ for data in tqdm(corpus):
119
+ gold_summary = data["summary"]
120
+ extracted_summary = model.extract( [data["text"]], p_stop_thres = p_stop, max_extracted_sentences_per_document = max_extracted_sentences )[0]
121
+
122
+ score = rouge_cal.score( "\n".join( gold_summary ), "\n".join(extracted_summary) )
123
+ scores.append( [score["rouge1"].fmeasure, score["rouge2"].fmeasure, score["rougeLsum"].fmeasure ] )
124
+
125
+ return np.asarray(scores).mean(axis = 0)
126
+ ```
127
+
128
+
129
+ ```python
130
+ evaluate( memsum_arxiv, test_corpus_arxiv, 0.5, 5, rouge_cal )
131
+ ```
132
+
133
+ 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6440/6440 [08:00<00:00, 13.41it/s]
134
+
135
+ array([0.47946925, 0.19970128, 0.42075852])
136
+
137
+
138
+ ```python
139
+ evaluate( memsum_pubmed, test_corpus_pubmed, 0.6, 7, rouge_cal )
140
+ ```
141
+
142
+ 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6658/6658 [09:22<00:00, 11.84it/s]
143
+
144
+ array([0.49260137, 0.22916328, 0.44415123])
145
+
146
+
147
+ ```python
148
+ evaluate( memsum_gov_report, test_corpus_gov_report, 0.6, 22, rouge_cal )
149
+ ```
150
+
151
+ 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 973/973 [04:33<00:00, 3.55it/s]
152
+
153
+ array([0.59445629, 0.28507926, 0.56677073])
154
+
155
+
156
+
157
+ ### Summarization Examples
158
+
159
+ Given a document with a list of sentences, e.g.:
160
+
161
+
162
+ ```python
163
+ document = test_corpus_pubmed[0]["text"]
164
+ ```
165
+
166
+ We can summarize this document extractively by:
167
+
168
+
169
+ ```python
170
+ extracted_summary = memsum_pubmed.extract( [ document ],
171
+ p_stop_thres = 0.6,
172
+ max_extracted_sentences_per_document = 7
173
+ )[0]
174
+ extracted_summary
175
+ ```
176
+
177
+
178
+
179
+
180
+ ['more specifically , we found that pd patients with anxiety were more impaired on the trail making test part b which assessed attentional set - shifting , on both digit span tests which assessed working memory and attention , and to a lesser extent on the logical memory test which assessed memory and new verbal learning compared to pd patients without anxiety . taken together ,',
181
+ 'this study is the first to directly compare cognition between pd patients with and without anxiety .',
182
+ 'results from this study showed selective verbal memory deficits in rpd patients with anxiety compared to rpd without anxiety , whereas lpd patients with anxiety had greater attentional / working memory deficits compared to lpd without anxiety .',
183
+ 'given that research on healthy young adults suggests that anxiety reduces processing capacity and impairs processing efficiency , especially in the central executive and attentional systems of working memory [ 26 , 27 ] , we hypothesized that pd patients with anxiety would show impairments in attentional set - shifting and working memory compared to pd patients without anxiety .',
184
+ 'the findings confirmed our hypothesis that anxiety negatively influences attentional set - shifting and working memory in pd .',
185
+ 'seventeen pd patients with anxiety and thirty - three pd patients without anxiety were included in this study ( see table 1 ) .']
186
+
187
+
188
+
189
+
190
+ ```python
191
+
192
+ ```
193
+
194
+ We can also get the indices of the extracted sentences in the original document:
195
+
196
+
197
+ ```python
198
+ extracted_summary_batch, extracted_indices_batch = memsum_pubmed.extract( [ document ],
199
+ p_stop_thres = 0.6,
200
+ max_extracted_sentences_per_document = 7,
201
+ return_sentence_position=1
202
+ )
203
+ ```
204
+
205
+
206
+ ```python
207
+ extracted_summary_batch[0]
208
+ ```
209
+
210
+
211
+
212
+
213
+ ['more specifically , we found that pd patients with anxiety were more impaired on the trail making test part b which assessed attentional set - shifting , on both digit span tests which assessed working memory and attention , and to a lesser extent on the logical memory test which assessed memory and new verbal learning compared to pd patients without anxiety . taken together ,',
214
+ 'this study is the first to directly compare cognition between pd patients with and without anxiety .',
215
+ 'results from this study showed selective verbal memory deficits in rpd patients with anxiety compared to rpd without anxiety , whereas lpd patients with anxiety had greater attentional / working memory deficits compared to lpd without anxiety .',
216
+ 'given that research on healthy young adults suggests that anxiety reduces processing capacity and impairs processing efficiency , especially in the central executive and attentional systems of working memory [ 26 , 27 ] , we hypothesized that pd patients with anxiety would show impairments in attentional set - shifting and working memory compared to pd patients without anxiety .',
217
+ 'the findings confirmed our hypothesis that anxiety negatively influences attentional set - shifting and working memory in pd .',
218
+ 'seventeen pd patients with anxiety and thirty - three pd patients without anxiety were included in this study ( see table 1 ) .']
219
+
220
+
221
+
222
+
223
+ ```python
224
+ extracted_indices_batch[0]
225
+ ```
226
+
227
+
228
+
229
+
230
+ [50, 48, 70, 14, 49, 16]
231
+
232
+
233
+
234
+
235
+ ```python
236
+
237
+ ```
238
+
239
+ ## Training MemSum
240
+
241
+ Please refer to the documentation [Training_Pipeline.md](Training_Pipeline.md) for the complete pipeline of training MemSum on custom dataset.
242
+
243
+ You can also directly run the training pipeline on google colab: <a href="https://colab.research.google.com/github/nianlonggu/MemSum/blob/main/Training_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
244
+
245
+
246
+ ```python
247
+
248
+ ```
249
+
250
+ ## Updates
251
+
252
+ ### Update 09-02-2023: Released the dataset for human evaluation (comparing MemSum with NeuSum).
253
+ Data is available in folder human_eval_results/. It recorded the samples we used for human evaluation and records of participants' labelling.
254
+
255
+ Released a colab notebook that contained the interface for conducting human evaluation. This can be used for reproducibility test.
256
+
257
+ Documentation: [MemSum_Human_Evaluation.md](MemSum_Human_Evaluation.md)
258
+
259
+ Run it on google colab (recommended): <a href="https://colab.research.google.com/github/nianlonggu/MemSum/blob/main/MemSum_Human_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
260
+
261
+ ![human evaluation interface](images/human_evaluation_interface.png)
262
+
263
+
264
+ ### Update 28-07-2022: Code for obtaining the greedy summary of a document
265
+
266
+
267
+ ```python
268
+ from data_preprocessing.utils import greedy_extract
269
+ import json
270
+ test_corpus_custom_data = [ json.loads(line) for line in open("data/custom_data/test.jsonl")]
271
+ example_data = test_corpus_custom_data[0]
272
+ ```
273
+
274
+
275
+ ```python
276
+ example_data.keys()
277
+ ```
278
+
279
+
280
+
281
+
282
+ dict_keys(['text', 'summary'])
283
+
284
+
285
+
286
+ We can extract the oracle summary by calling the function greedy_extract and set beamsearch_size = 1
287
+
288
+
289
+ ```python
290
+ greedy_extract( example_data["text"], example_data["summary"], beamsearch_size = 1 )[0]
291
+ ```
292
+
293
+
294
+
295
+
296
+ [[50, 13, 41, 24, 31, 0, 3, 48], 0.4563635838327488]
297
+
298
+
299
+
300
+ Here the first element is a list of sentence indices in the document, the second element is the avarge of Rouge F1 scores.
301
+
302
+
303
+ ```python
304
+
305
+ ```
306
+
307
+ ### References
308
+ When using our code or models for your application, please cite the following paper:
309
+ ```
310
+ @inproceedings{gu-etal-2022-memsum,
311
+ title = "{M}em{S}um: Extractive Summarization of Long Documents Using Multi-Step Episodic {M}arkov Decision Processes",
312
+ author = "Gu, Nianlong and
313
+ Ash, Elliott and
314
+ Hahnloser, Richard",
315
+ booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
316
+ month = may,
317
+ year = "2022",
318
+ address = "Dublin, Ireland",
319
+ publisher = "Association for Computational Linguistics",
320
+ url = "https://aclanthology.org/2022.acl-long.450",
321
+ pages = "6507--6522",
322
+ abstract = "We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at each step with information on the current extraction history. When MemSum iteratively selects sentences into the summary, it considers a broad information set that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum obtains state-of-the-art test-set performance (ROUGE) in summarizing long documents taken from PubMed, arXiv, and GovReport. Ablation studies demonstrate the importance of local, global, and history information. A human evaluation confirms the high quality and low redundancy of the generated summaries, stemming from MemSum{'}s awareness of extraction history.",
323
+ }
324
+ ```
325
+