ChrisBridges commited on
Commit
b7648b2
·
verified ·
1 Parent(s): c4aeb82

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -30,7 +30,7 @@ but favors much longer quotes (~4x length on average).
30
  The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset \[2\] and 5015 book quotes from the T50 dataset \[3\].
31
  As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes.
32
  Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling.
33
- In the tables below, we report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.
34
 
35
  | Split | Total | Movie | Book |
36
  | ----- | ----- | ----- | ---- |
@@ -50,22 +50,22 @@ In the tables below, we report the sample sizes in each data split and the lengt
50
  ### Parameters
51
 
52
  Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes.
53
- While there is a significant variance in the length of quotes, we are more interested in poignant statements.
54
 
55
  Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5.
56
  The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum.
57
  After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.
58
 
59
- In addition, we evaluate T5-base \[4\] with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.
60
 
61
  The learning rates were chosen empirically on shorter training runs of 5 epochs.
62
 
63
  ### Evaluation
64
 
65
- Since no data splits were published with the T50 paper \[3\], we cannot reproduce the results and evaluate on the previously described training data.
66
- Rather than using the whole test set at once for evaluation, we split it into 3 equally-sized disjoint random samples of size 753.
67
- Each model is evaluated on all 3 samples, and we report the mean scores and 95% confidence interval for all scores.
68
- Additionally, we report the average predicted quote length, the number of epochs, and the total training time.
69
 
70
  | Checkpoint | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | Avg Quote Length | Epochs | Time |
71
  | -------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ------ | ------- |
 
30
  The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset \[2\] and 5015 book quotes from the T50 dataset \[3\].
31
  As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes.
32
  Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling.
33
+ The tables below report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.
34
 
35
  | Split | Total | Movie | Book |
36
  | ----- | ----- | ----- | ---- |
 
50
  ### Parameters
51
 
52
  Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes.
53
+ While there is a significant variance in the length of quotes, poignant statements are of the most interest.
54
 
55
  Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5.
56
  The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum.
57
  After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.
58
 
59
+ In addition, T5-base \[4\] is evaluated with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.
60
 
61
  The learning rates were chosen empirically on shorter training runs of 5 epochs.
62
 
63
  ### Evaluation
64
 
65
+ Since no data splits were published with the T50 paper \[3\], the results are not fully reproducible, and models are evaluated on the previously described training data.
66
+ Rather than using the whole test set at once for evaluation, it is split into 3 equally-sized disjoint random samples of size 753.
67
+ Each model is evaluated on all 3 samples, and the mean scores and 95% confidence interval for all scores are reported below.
68
+ Additionally, the table includes the average predicted quote length, the number of epochs of the best training checkpoint, and the total training time.
69
 
70
  | Checkpoint | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | Avg Quote Length | Epochs | Time |
71
  | -------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ------ | ------- |