| # Algorithm Config Variants (Step-200 Checkpoints) | |
| This repository contains checkpoint artifacts for several training variants, selected to enable a fair, quick comparison under similar training compute. | |
| W&B: | |
| https://wandb.ai/leon_at_work/soni_ablation_4b/workspace | |
| ## Included Checkpoints | |
| | Variant Folder | Short Description | Checkpoint Step | | |
| |---|---|---| | |
| | `Baseline 4B` | Baseline configuration used for comparison | 200 | | |
| | `a2_sapo` | Vanilla SAPO (norm_by_std=true + sapo + eps=0.2 + sequence_mean) | 200 | | |
| | `a3_dr_grpo_pure` | Dr.GRPO pure (grpo + regular + seq_mean_token_sum_norm + no-std + eps=0.2) | 200 | | |
| | `a4_length_norm_sqrt` | GRPO with sqrt length normalization (plus normalized reduction) | 200 | | |
| | `a5_sapo_on_sqrt` | SAPO on top of sqrt length-normalized setting | 200 | | |
| | `a6_eps_clip_02` | Sqrt length-normalized setting with PPO clip set to 0.2/0.2 | 200 | | |
| | `a8_dr_grpo_gspo` | Dr.GRPO-style setting with GSPO | 200 | | |
| | `a9_dr_grpo_sapo` | Dr.GRPO-style setting with SAPO | 200 | | |
| | `a10_grpo_norm_by_std` | GRPO with std-based normalization enabled | 200 | | |
| | `a11_vanilla_grpo` | Vanilla GRPO-style setting (`norm_by_std + regular loss + eps=0.2`) | 200 | | |
| ## Training-Parameter Deltas | |
| Baseline reference (`Baseline 4B`): | |
| `advantage_estimator=grpo`, `policy_loss_type=gspo`, `loss_reduction=sequence_mean`, `eps_clip_low/high=3e-4/4e-4`, `grpo_norm_by_std=false`, reward config includes `multilevel_localization_f1_reward + multiturn_reward(minimal_turns=4, maximal_turns=4)`. | |
| | Variant Folder | Parameter Delta vs `Baseline 4B` | | |
| |---|---| | |
| | `a2_sapo` | `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=sequence_mean`; `eps_clip_low/high=0.2/0.2`; `grpo_norm_by_std=true` | | |
| | `a3_dr_grpo_pure` | advantage_estimator=grpo; policy_loss_type=regular; loss_reduction=seq_mean_token_sum_norm; eps_clip_low/high=0.2/0.2; grpo_norm_by_std=false | | |
| | `a4_length_norm_sqrt` | `advantage_estimator=grpo_length_norm_sqrt`; `loss_reduction=seq_mean_token_sum_norm` | | |
| | `a5_sapo_on_sqrt` | same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo` | | |
| | `a6_eps_clip_02` | `a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2` | | |
| | `a8_dr_grpo_gspo` | `advantage_estimator=grpo`; `policy_loss_type=gspo`; `loss_reduction=seq_mean_token_sum_norm` | | |
| | `a9_dr_grpo_sapo` | `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=seq_mean_token_sum_norm` | | |
| | `a10_grpo_norm_by_std` | `grpo_norm_by_std=true` (other key settings close to baseline) | | |
| | `a11_vanilla_grpo` | `grpo_norm_by_std=true`; `policy_loss_type=regular`; `eps_clip_low/high=0.2/0.2`; `loss_reduction=sequence_mean` | | |
| ## Notes on Early Observations | |
| - In quick-pass validation (subset-based), these algorithmic variants did not clearly outperform the current baseline yet. | |
| - Vanilla GRPO can show higher training reward while not necessarily improving validation performance. | |
| - Data prefiltering appears promising from the learning-curve shape, and related runs are still being monitored. | |
| ## Extra Files | |
| - `configs/rewards/baseline_4b.yaml` is included for reference to the baseline reward configuration. | |