projecte-aina
/

Plume32k

text-generation

text-generation-inference

Model card Files Files and versions

javi8979 commited on Jun 12, 2024

Commit

34571e8

·

verified ·

1 Parent(s): c5f13b8

Update README.md

Files changed (1) hide show

README.md +16 -1

README.md CHANGED Viewed

@@ -80,7 +80,22 @@ generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_to
 ## Training
-Training details are specified in the [paper](). Code for training the model and running other experiments can be found in our [GitHub repository](https://github.com/projecte-aina/Plume).
 ## Evaluation

 ## Training
+For training, the learning rate is warmed up from $1 \times 10^{-7}$ to a maximum of $3 \times 10^{-4}$ over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full \texttt{float32} training. We show in the next table the training hyperparameters:
+| **Hyper-Parameter** |                          |
+|---------------------|--------------------------|
+| Batch size          | 40                       |
+| Number of Epochs    | 1                        |
+| Optimizer           | Adam                     |
+| Adam-β₁             | 0.9                      |
+| Adam-β₂             | 0.999                    |
+| Adam-ε              | 1e-08                    |
+| Learning rate       | 3e-04                    |
+| LR Scheduler        | Linear                   |
+| Warmup Steps        | 2000                     |
+More training details are specified in the [paper](). Code for training the model and running other experiments can be found in our [GitHub repository](https://github.com/projecte-aina/Plume).
 ## Evaluation