ChrisBridges
/

bart-base-quotes

@@ -30,7 +30,7 @@ but favors much longer quotes (~4x length on average).
 The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset \[2\] and 5015 book quotes from the T50 dataset \[3\].
 As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes.
 Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling.
-In the tables below, we report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.
 | Split | Total | Movie | Book |
 | ----- | ----- | ----- | ---- |
@@ -50,22 +50,22 @@ In the tables below, we report the sample sizes in each data split and the lengt
 ### Parameters
 Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes.
-While there is a significant variance in the length of quotes, we are more interested in poignant statements.
 Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5.
 The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum.
 After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.
-In addition, we evaluate T5-base \[4\] with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.
 The learning rates were chosen empirically on shorter training runs of 5 epochs.
 ### Evaluation
-Since no data splits were published with the T50 paper \[3\], we cannot reproduce the results and evaluate on the previously described training data.
-Rather than using the whole test set at once for evaluation, we split it into 3 equally-sized disjoint random samples of size 753.
-Each model is evaluated on all 3 samples, and we report the mean scores and 95% confidence interval for all scores.
-Additionally, we report the average predicted quote length, the number of epochs, and the total training time.
 | Checkpoint     | ROUGE-1         | ROUGE-2         | ROUGE-L         | ROUGE-Lsum      | Avg Quote Length | Epochs | Time    |
 | -------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ------ | ------- |

 The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset \[2\] and 5015 book quotes from the T50 dataset \[3\].
 As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes.
 Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling.
+The tables below report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.
 | Split | Total | Movie | Book |
 | ----- | ----- | ----- | ---- |
 ### Parameters
 Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes.
+While there is a significant variance in the length of quotes, poignant statements are of the most interest.
 Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5.
 The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum.
 After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.
+In addition, T5-base \[4\] is evaluated with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.
 The learning rates were chosen empirically on shorter training runs of 5 epochs.
 ### Evaluation
+Since no data splits were published with the T50 paper \[3\], the results are not fully reproducible, and models are evaluated on the previously described training data.
+Rather than using the whole test set at once for evaluation, it is split into 3 equally-sized disjoint random samples of size 753.
+Each model is evaluated on all 3 samples, and the mean scores and 95% confidence interval for all scores are reported below.
+Additionally, the table includes the average predicted quote length, the number of epochs of the best training checkpoint, and the total training time.
 | Checkpoint     | ROUGE-1         | ROUGE-2         | ROUGE-L         | ROUGE-Lsum      | Avg Quote Length | Epochs | Time    |
 | -------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ------ | ------- |