Update README.md
Browse files
README.md
CHANGED
|
@@ -28,7 +28,7 @@ DPO lets us directly train the model to score preferred responses higher than le
|
|
| 28 |
- **Optimizer**: AdamW (learning rate = `2e-6`, weight decay = `0`)
|
| 29 |
- **Precision**: bf16
|
| 30 |
- **Batch size**: 2 (gradient accumulation = 4)
|
| 31 |
-
- **Scheduler**:
|
| 32 |
- **DPO Beta**: 0.1
|
| 33 |
- **Eval & Checkpointing**: Every epoch
|
| 34 |
- **Monitoring**: Weights & Biases (WandB)
|
|
|
|
| 28 |
- **Optimizer**: AdamW (learning rate = `2e-6`, weight decay = `0`)
|
| 29 |
- **Precision**: bf16
|
| 30 |
- **Batch size**: 2 (gradient accumulation = 4)
|
| 31 |
+
- **Scheduler**: cosine with 1% warmup
|
| 32 |
- **DPO Beta**: 0.1
|
| 33 |
- **Eval & Checkpointing**: Every epoch
|
| 34 |
- **Monitoring**: Weights & Biases (WandB)
|