Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model:
|
| 6 |
+
- facebook/bart-large-cnn
|
| 7 |
+
pipeline_tag: summarization
|
| 8 |
+
library_name: transformers
|
| 9 |
+
tags:
|
| 10 |
+
- movies
|
| 11 |
+
- books
|
| 12 |
+
- quotes
|
| 13 |
+
- quote-detection
|
| 14 |
+
- extractive
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# BART-base-quotes
|
| 18 |
+
|
| 19 |
+
BART \[1\] fine-tuned for extractive summarization on a dataset of movie and book quotes.
|
| 20 |
+
|
| 21 |
+
Continued fine-tuning from the BART-base checkpoint.
|
| 22 |
+
|
| 23 |
+
**Compare**: The larger model [BART-large-quotes](https://huggingface.co/ChrisBridges/bart-large-quotes) achieved slightly better ROUGE scores,
|
| 24 |
+
but favors much longer quotes (~4x length on average).
|
| 25 |
+
|
| 26 |
+
## Training Description
|
| 27 |
+
|
| 28 |
+
### Dataset
|
| 29 |
+
|
| 30 |
+
The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset \[2\] and 5015 book quotes from the T50 dataset \[3\].
|
| 31 |
+
As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes.
|
| 32 |
+
Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling.
|
| 33 |
+
In the tables below, we report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.
|
| 34 |
+
|
| 35 |
+
| Split | Total | Movie | Book |
|
| 36 |
+
| ----- | ----- | ----- | ---- |
|
| 37 |
+
| Train | 7906 | 4396 | 3510 |
|
| 38 |
+
| Dev | 1130 | 628 | 502 |
|
| 39 |
+
| Test | 2259 | 1256 | 1003 |
|
| 40 |
+
|
| 41 |
+
| Data | min | median | max | mean ± std |
|
| 42 |
+
| ------------- | --- | ------ | ---- | ---------------- |
|
| 43 |
+
| Movie Context | 38 | 148 | 3358 | 167.13 ± 102.57 |
|
| 44 |
+
| Movie Quote | 5 | 20 | 592 | 28.22 ± 27.79 |
|
| 45 |
+
| T50 Context | 86 | 628 | 6497 | 659.14 ± 310.49 |
|
| 46 |
+
| T50 Quote | 6 | 41 | 877 | 61.87 ± 63.89 |
|
| 47 |
+
| Total Context | 38 | 246 | 6497 | 385.58 ± 329.258 |
|
| 48 |
+
| Total Quote | 5 | 26 | 877 | 43.16 ± 50.21 |
|
| 49 |
+
|
| 50 |
+
### Parameters
|
| 51 |
+
|
| 52 |
+
Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes.
|
| 53 |
+
While there is a significant variance in the length of quotes, we are more interested in poignant statements.
|
| 54 |
+
|
| 55 |
+
Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5.
|
| 56 |
+
The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum.
|
| 57 |
+
After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.
|
| 58 |
+
|
| 59 |
+
In addition, we evaluate T5-base \[4\] with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.
|
| 60 |
+
|
| 61 |
+
The learning rates were chosen empirically on shorter training runs of 5 epochs.
|
| 62 |
+
|
| 63 |
+
### Evaluation
|
| 64 |
+
|
| 65 |
+
Since no data splits were published with the T50 paper \[3\], we cannot reproduce the results and evaluate on the previously described training data.
|
| 66 |
+
Rather than using the whole test set at once for evaluation, we split it into 3 equally-sized disjoint random samples of size 753.
|
| 67 |
+
Each model is evaluated on all 3 samples, and we report the mean scores and 95% confidence interval for all scores.
|
| 68 |
+
Additionally, we report the average predicted quote length, the number of epochs, and the total training time.
|
| 69 |
+
|
| 70 |
+
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | Avg Quote Length | Epochs | Time |
|
| 71 |
+
| -------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ------ | ------- |
|
| 72 |
+
| T5-base | 0.3758 ± 0.0175 | 0.2990 ± 0.0128 | 0.3628 ± 0.0189 | 0.3684 ± 0.0201 | 18.1576 ± 0.1084 | 1.01 | 3:39:14 |
|
| 73 |
+
| BART-base | 0.4236 ± 0.0133 | 0.3498 ± 0.0116 | 0.4112 ± 0.0135 | 0.4165 ± 0.0107 | 19.1027 ± 0.1755 | 12.10 | 0:44:48 |
|
| 74 |
+
| BART-large | 0.4252 ± 0.0240 | 0.3456 ± 0.0204 | 0.4115 ± 0.0206 | 0.4171 ± 0.0209 | 19.2877 ± 0.1819 | 6.05 | 2:43:56 |
|
| 75 |
+
| BART-large-cnn | 0.4384 ± 0.0225 | 0.3693 ± 0.0197 | 0.4165 ± 0.0239 | 0.4317 ± 0.0234 | 81.8623 ± 1.5324 | 28.23 | 3:48:24 |
|
| 76 |
+
|
| 77 |
+
## References
|
| 78 |
+
\[1\] [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
|
| 79 |
+
](https://arxiv.org/abs/1910.13461)
|
| 80 |
+
\[2\] [You Had Me at Hello: How Phrasing Affects Memorability](https://aclanthology.org/P12-1094/)
|
| 81 |
+
\[3\] [Quote Detection: A New Task and Dataset for NLP](https://aclanthology.org/2023.latechclfl-1.3/)
|
| 82 |
+
\[4\] [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)
|