Update README.md
Browse files
README.md
CHANGED
|
@@ -80,7 +80,22 @@ generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_to
|
|
| 80 |
|
| 81 |
## Training
|
| 82 |
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
## Evaluation
|
| 86 |
|
|
|
|
| 80 |
|
| 81 |
## Training
|
| 82 |
|
| 83 |
+
For training, the learning rate is warmed up from $1 \times 10^{-7}$ to a maximum of $3 \times 10^{-4}$ over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full \texttt{float32} training. We show in the next table the training hyperparameters:
|
| 84 |
+
|
| 85 |
+
| **Hyper-Parameter** | |
|
| 86 |
+
|---------------------|--------------------------|
|
| 87 |
+
| Batch size | 40 |
|
| 88 |
+
| Number of Epochs | 1 |
|
| 89 |
+
| Optimizer | Adam |
|
| 90 |
+
| Adam-β₁ | 0.9 |
|
| 91 |
+
| Adam-β₂ | 0.999 |
|
| 92 |
+
| Adam-ε | 1e-08 |
|
| 93 |
+
| Learning rate | 3e-04 |
|
| 94 |
+
| LR Scheduler | Linear |
|
| 95 |
+
| Warmup Steps | 2000 |
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
More training details are specified in the [paper](). Code for training the model and running other experiments can be found in our [GitHub repository](https://github.com/projecte-aina/Plume).
|
| 99 |
|
| 100 |
## Evaluation
|
| 101 |
|