Update README.md
Browse files
README.md
CHANGED
|
@@ -30,7 +30,7 @@ but favors much longer quotes (~4x length on average).
|
|
| 30 |
The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset \[2\] and 5015 book quotes from the T50 dataset \[3\].
|
| 31 |
As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes.
|
| 32 |
Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling.
|
| 33 |
-
|
| 34 |
|
| 35 |
| Split | Total | Movie | Book |
|
| 36 |
| ----- | ----- | ----- | ---- |
|
|
@@ -50,22 +50,22 @@ In the tables below, we report the sample sizes in each data split and the lengt
|
|
| 50 |
### Parameters
|
| 51 |
|
| 52 |
Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes.
|
| 53 |
-
While there is a significant variance in the length of quotes,
|
| 54 |
|
| 55 |
Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5.
|
| 56 |
The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum.
|
| 57 |
After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.
|
| 58 |
|
| 59 |
-
In addition,
|
| 60 |
|
| 61 |
The learning rates were chosen empirically on shorter training runs of 5 epochs.
|
| 62 |
|
| 63 |
### Evaluation
|
| 64 |
|
| 65 |
-
Since no data splits were published with the T50 paper \[3\],
|
| 66 |
-
Rather than using the whole test set at once for evaluation,
|
| 67 |
-
Each model is evaluated on all 3 samples, and
|
| 68 |
-
Additionally,
|
| 69 |
|
| 70 |
| Checkpoint | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | Avg Quote Length | Epochs | Time |
|
| 71 |
| -------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ------ | ------- |
|
|
|
|
| 30 |
The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset \[2\] and 5015 book quotes from the T50 dataset \[3\].
|
| 31 |
As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes.
|
| 32 |
Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling.
|
| 33 |
+
The tables below report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.
|
| 34 |
|
| 35 |
| Split | Total | Movie | Book |
|
| 36 |
| ----- | ----- | ----- | ---- |
|
|
|
|
| 50 |
### Parameters
|
| 51 |
|
| 52 |
Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes.
|
| 53 |
+
While there is a significant variance in the length of quotes, poignant statements are of the most interest.
|
| 54 |
|
| 55 |
Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5.
|
| 56 |
The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum.
|
| 57 |
After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.
|
| 58 |
|
| 59 |
+
In addition, T5-base \[4\] is evaluated with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.
|
| 60 |
|
| 61 |
The learning rates were chosen empirically on shorter training runs of 5 epochs.
|
| 62 |
|
| 63 |
### Evaluation
|
| 64 |
|
| 65 |
+
Since no data splits were published with the T50 paper \[3\], the results are not fully reproducible, and models are evaluated on the previously described training data.
|
| 66 |
+
Rather than using the whole test set at once for evaluation, it is split into 3 equally-sized disjoint random samples of size 753.
|
| 67 |
+
Each model is evaluated on all 3 samples, and the mean scores and 95% confidence interval for all scores are reported below.
|
| 68 |
+
Additionally, the table includes the average predicted quote length, the number of epochs of the best training checkpoint, and the total training time.
|
| 69 |
|
| 70 |
| Checkpoint | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | Avg Quote Length | Epochs | Time |
|
| 71 |
| -------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ------ | ------- |
|