Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Algorithm Config Variants (Step-200 Checkpoints)
|
| 2 |
+
|
| 3 |
+
This repository contains checkpoint artifacts for several training variants, selected to enable a fair, quick comparison under similar training compute.
|
| 4 |
+
|
| 5 |
+
W&B workspace (quick-pass tracking):
|
| 6 |
+
https://wandb.ai/leon_at_work/soni_ablation_4b/workspace
|
| 7 |
+
|
| 8 |
+
## Included Checkpoints
|
| 9 |
+
|
| 10 |
+
| Variant Folder | Short Description | Checkpoint Step |
|
| 11 |
+
|---|---|---|
|
| 12 |
+
| `Baseline 4B` | Baseline configuration used for comparison | 200 |
|
| 13 |
+
| `a4_length_norm_sqrt` | GRPO with sqrt length normalization (plus normalized reduction) | 200 |
|
| 14 |
+
| `a5_sapo_on_sqrt` | SAPO on top of sqrt length-normalized setting | 200 |
|
| 15 |
+
| `a6_eps_clip_02` | Sqrt length-normalized setting with PPO clip set to 0.2/0.2 | 200 |
|
| 16 |
+
| `a8_dr_grpo_gspo` | Dr.GRPO-style setting with GSPO | 200 |
|
| 17 |
+
| `a10_grpo_norm_by_std` | GRPO with std-based normalization enabled | 200 |
|
| 18 |
+
| `a11_vanilla_grpo` | Vanilla GRPO-style setting (`norm_by_std + regular loss + eps=0.2`) | 200 |
|
| 19 |
+
|
| 20 |
+
## Training-Parameter Deltas
|
| 21 |
+
|
| 22 |
+
Baseline reference (`Baseline 4B`):
|
| 23 |
+
`advantage_estimator=grpo`, `policy_loss_type=gspo`, `loss_reduction=sequence_mean`, `eps_clip_low/high=3e-4/4e-4`, `grpo_norm_by_std=false`, reward config includes `multilevel_localization_f1_reward + multiturn_reward(minimal_turns=4, maximal_turns=4)`.
|
| 24 |
+
|
| 25 |
+
| Variant Folder | Parameter Delta vs `Baseline 4B` |
|
| 26 |
+
|---|---|
|
| 27 |
+
| `a4_length_norm_sqrt` | `advantage_estimator=grpo_length_norm_sqrt`; `loss_reduction=seq_mean_token_sum_norm` |
|
| 28 |
+
| `a5_sapo_on_sqrt` | same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo` |
|
| 29 |
+
| `a6_eps_clip_02` | `a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2` |
|
| 30 |
+
| `a8_dr_grpo_gspo` | `advantage_estimator=grpo`; `policy_loss_type=gspo`; `loss_reduction=seq_mean_token_sum_norm` |
|
| 31 |
+
| `a10_grpo_norm_by_std` | `grpo_norm_by_std=true` (other key settings close to baseline) |
|
| 32 |
+
| `a11_vanilla_grpo` | `grpo_norm_by_std=true`; `policy_loss_type=regular`; `eps_clip_low/high=0.2/0.2`; `loss_reduction=sequence_mean` |
|
| 33 |
+
|
| 34 |
+
## Notes on Early Observations
|
| 35 |
+
|
| 36 |
+
- In quick-pass validation (subset-based), these algorithmic variants did not clearly outperform the current baseline yet.
|
| 37 |
+
- Vanilla GRPO can show higher training reward while not necessarily improving validation performance.
|
| 38 |
+
- Data prefiltering appears promising from learning-curve shape, and related runs are still being monitored.
|
| 39 |
+
|
| 40 |
+
## Extra Files
|
| 41 |
+
|
| 42 |
+
- `configs/rewards/baseline_4b.yaml` is included for reference to the baseline reward configuration.
|