euclaise
/

ReMask-3B

Text Generation

Model card Files Files and versions

euclaise commited on Apr 2, 2024

Commit

01e78cf

·

verified ·

1 Parent(s): 5ac31cd

Update README.md

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -68,11 +68,11 @@ Consider the following chat interaction:
 The model must predict the bolded parts.  So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
-We then compute a distance loss `D(p_masked, p_full)` between the two predictions. For this, I used the average of the backwards and forwards KL divergences between the predictions.
 Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
 ```
-loss = CE(p_masked, labels) + CE(p_full, labels) + weight*D(p_masked, p_full)
 ```
 ***ReMask-CoT:***
@@ -103,7 +103,7 @@ Here are some benchmark results, computed using the the LM Evaluation Harness wi
 | Masked Thought | 24.18%                 | *43.60%*                |
 | **ReMask**     | **27.90%**             | 43.26%                    |
-As I expected, it improves GSM8K doesn't do much to ARC.
 ## Training details
 - Framework: PyTorch Lightning
@@ -115,4 +115,5 @@ As I expected, it improves GSM8K doesn't do much to ARC.
 - Batch size: 16, accumulated to 256
 - Epochs: 6
 - Learning rate: 1e-5
-- Learning rate schedule: One Cycle, cosine, no cycle_momentum

 The model must predict the bolded parts.  So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
+We then compute a divergence loss `D(p_masked, p_full)` between the two predictions. For this, I used the average of the backwards and forwards KL divergences between the predictions.
 Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
 ```
+loss = 0.5*(CE(p_masked, labels) + CE(p_full, labels)) + weight*D(p_masked, p_full)
 ```
 ***ReMask-CoT:***
 | Masked Thought | 24.18%                 | *43.60%*                |
 | **ReMask**     | **27.90%**             | 43.26%                    |
+As I expected, it improves GSM8K, but doesn't do much to ARC.
 ## Training details
 - Framework: PyTorch Lightning
 - Batch size: 16, accumulated to 256
 - Epochs: 6
 - Learning rate: 1e-5
+- Learning rate schedule: One Cycle, cosine, no cycle_momentum
+- Regularization weight: 0.1