Update README.md
Browse files
README.md
CHANGED
|
@@ -43,25 +43,9 @@ For research, experimentation, and educational purposes where a small instructio
|
|
| 43 |
|
| 44 |
## Training Details
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
-
- `batch_size`: 2
|
| 50 |
-
- `compute_type`: bf16
|
| 51 |
-
- `learning_rate`: 5e-5
|
| 52 |
-
- `lora_alpha`: 32
|
| 53 |
-
- `lora_dropout`: 0
|
| 54 |
-
- `lr_scheduler_type`: cosine
|
| 55 |
-
|
| 56 |
-
### Direct Preference Optimization (DPO)
|
| 57 |
-
|
| 58 |
-
- **Key Configurations:**
|
| 59 |
-
- `batch_size`: 2
|
| 60 |
-
- `compute_type`: bf16
|
| 61 |
-
- `learning_rate`: 5e-5
|
| 62 |
-
- `lora_alpha`: 32
|
| 63 |
-
- `lora_dropout`: 0.95
|
| 64 |
-
- `lr_scheduler_type`: cosine
|
| 65 |
|
| 66 |
## Evaluation
|
| 67 |
|
|
|
|
| 43 |
|
| 44 |
## Training Details
|
| 45 |
|
| 46 |
+
Both SFT and DPO share common settings: liger_kernel booster, LoRA fine-tuning, custom model, BF16 compute type, batch size of 2, and a cosine scheduler with a learning rate of 5e-5. RSLoRA is enabled with a rank of 16 and alpha of 32.
|
| 47 |
|
| 48 |
+
The main differences are in the dataset and training specifics. SFT uses CrashCourse_120K with packing enabled and LoRA dropout of 0, while DPO uses orca_pairs with packing disabled and a LoRA dropout of 0.95.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
## Evaluation
|
| 51 |
|