Update README.md

f1d71b2 verified 13 days ago

3.14 kB

	# Algorithm Config Variants (Step-200 Checkpoints)

	This repository contains checkpoint artifacts for several training variants, selected to enable a fair, quick comparison under similar training compute.

	W&B:
	https://wandb.ai/leon_at_work/soni_ablation_4b/workspace

	## Included Checkpoints

	\| Variant Folder \| Short Description \| Checkpoint Step \|
	\|---\|---\|---\|
	\| `Baseline 4B` \| Baseline configuration used for comparison \| 200 \|
	\| `a2_sapo` \| Vanilla SAPO (norm_by_std=true + sapo + eps=0.2 + sequence_mean) \| 200 \|
	\| `a3_dr_grpo_pure` \| Dr.GRPO pure (grpo + regular + seq_mean_token_sum_norm + no-std + eps=0.2) \| 200 \|
	\| `a4_length_norm_sqrt` \| GRPO with sqrt length normalization (plus normalized reduction) \| 200 \|
	\| `a5_sapo_on_sqrt` \| SAPO on top of sqrt length-normalized setting \| 200 \|
	\| `a6_eps_clip_02` \| Sqrt length-normalized setting with PPO clip set to 0.2/0.2 \| 200 \|
	\| `a8_dr_grpo_gspo` \| Dr.GRPO-style setting with GSPO \| 200 \|
	\| `a9_dr_grpo_sapo` \| Dr.GRPO-style setting with SAPO \| 200 \|
	\| `a10_grpo_norm_by_std` \| GRPO with std-based normalization enabled \| 200 \|
	\| `a11_vanilla_grpo` \| Vanilla GRPO-style setting (`norm_by_std + regular loss + eps=0.2`) \| 200 \|

	## Training-Parameter Deltas

	Baseline reference (`Baseline 4B`):
	`advantage_estimator=grpo`, `policy_loss_type=gspo`, `loss_reduction=sequence_mean`, `eps_clip_low/high=3e-4/4e-4`, `grpo_norm_by_std=false`, reward config includes `multilevel_localization_f1_reward + multiturn_reward(minimal_turns=4, maximal_turns=4)`.

	\| Variant Folder \| Parameter Delta vs `Baseline 4B` \|
	\|---\|---\|
	\| `a2_sapo` \| `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=sequence_mean`; `eps_clip_low/high=0.2/0.2`; `grpo_norm_by_std=true` \|
	\| `a3_dr_grpo_pure` \| advantage_estimator=grpo; policy_loss_type=regular; loss_reduction=seq_mean_token_sum_norm; eps_clip_low/high=0.2/0.2; grpo_norm_by_std=false \|
	\| `a4_length_norm_sqrt` \| `advantage_estimator=grpo_length_norm_sqrt`; `loss_reduction=seq_mean_token_sum_norm` \|
	\| `a5_sapo_on_sqrt` \| same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo` \|
	\| `a6_eps_clip_02` \| `a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2` \|
	\| `a8_dr_grpo_gspo` \| `advantage_estimator=grpo`; `policy_loss_type=gspo`; `loss_reduction=seq_mean_token_sum_norm` \|
	\| `a9_dr_grpo_sapo` \| `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=seq_mean_token_sum_norm` \|
	\| `a10_grpo_norm_by_std` \| `grpo_norm_by_std=true` (other key settings close to baseline) \|
	\| `a11_vanilla_grpo` \| `grpo_norm_by_std=true`; `policy_loss_type=regular`; `eps_clip_low/high=0.2/0.2`; `loss_reduction=sequence_mean` \|

	## Notes on Early Observations

	- In quick-pass validation (subset-based), these algorithmic variants did not clearly outperform the current baseline yet.
	- Vanilla GRPO can show higher training reward while not necessarily improving validation performance.
	- Data prefiltering appears promising from the learning-curve shape, and related runs are still being monitored.

	## Extra Files

	- `configs/rewards/baseline_4b.yaml` is included for reference to the baseline reward configuration.

	# Algorithm Config Variants (Step-200 Checkpoints)

	This repository contains checkpoint artifacts for several training variants, selected to enable a fair, quick comparison under similar training compute.

	W&B:
	https://wandb.ai/leon_at_work/soni_ablation_4b/workspace

	## Included Checkpoints

	\| Variant Folder \| Short Description \| Checkpoint Step \|
	\|---\|---\|---\|
	\| `Baseline 4B` \| Baseline configuration used for comparison \| 200 \|
	\| `a2_sapo` \| Vanilla SAPO (norm_by_std=true + sapo + eps=0.2 + sequence_mean) \| 200 \|
	\| `a3_dr_grpo_pure` \| Dr.GRPO pure (grpo + regular + seq_mean_token_sum_norm + no-std + eps=0.2) \| 200 \|
	\| `a4_length_norm_sqrt` \| GRPO with sqrt length normalization (plus normalized reduction) \| 200 \|
	\| `a5_sapo_on_sqrt` \| SAPO on top of sqrt length-normalized setting \| 200 \|
	\| `a6_eps_clip_02` \| Sqrt length-normalized setting with PPO clip set to 0.2/0.2 \| 200 \|
	\| `a8_dr_grpo_gspo` \| Dr.GRPO-style setting with GSPO \| 200 \|
	\| `a9_dr_grpo_sapo` \| Dr.GRPO-style setting with SAPO \| 200 \|
	\| `a10_grpo_norm_by_std` \| GRPO with std-based normalization enabled \| 200 \|
	\| `a11_vanilla_grpo` \| Vanilla GRPO-style setting (`norm_by_std + regular loss + eps=0.2`) \| 200 \|

	## Training-Parameter Deltas

	Baseline reference (`Baseline 4B`):
	`advantage_estimator=grpo`, `policy_loss_type=gspo`, `loss_reduction=sequence_mean`, `eps_clip_low/high=3e-4/4e-4`, `grpo_norm_by_std=false`, reward config includes `multilevel_localization_f1_reward + multiturn_reward(minimal_turns=4, maximal_turns=4)`.

	\| Variant Folder \| Parameter Delta vs `Baseline 4B` \|
	\|---\|---\|
	\| `a2_sapo` \| `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=sequence_mean`; `eps_clip_low/high=0.2/0.2`; `grpo_norm_by_std=true` \|
	\| `a3_dr_grpo_pure` \| advantage_estimator=grpo; policy_loss_type=regular; loss_reduction=seq_mean_token_sum_norm; eps_clip_low/high=0.2/0.2; grpo_norm_by_std=false \|
	\| `a4_length_norm_sqrt` \| `advantage_estimator=grpo_length_norm_sqrt`; `loss_reduction=seq_mean_token_sum_norm` \|
	\| `a5_sapo_on_sqrt` \| same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo` \|
	\| `a6_eps_clip_02` \| `a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2` \|
	\| `a8_dr_grpo_gspo` \| `advantage_estimator=grpo`; `policy_loss_type=gspo`; `loss_reduction=seq_mean_token_sum_norm` \|
	\| `a9_dr_grpo_sapo` \| `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=seq_mean_token_sum_norm` \|
	\| `a10_grpo_norm_by_std` \| `grpo_norm_by_std=true` (other key settings close to baseline) \|
	\| `a11_vanilla_grpo` \| `grpo_norm_by_std=true`; `policy_loss_type=regular`; `eps_clip_low/high=0.2/0.2`; `loss_reduction=sequence_mean` \|

	## Notes on Early Observations

	- In quick-pass validation (subset-based), these algorithmic variants did not clearly outperform the current baseline yet.
	- Vanilla GRPO can show higher training reward while not necessarily improving validation performance.
	- Data prefiltering appears promising from the learning-curve shape, and related runs are still being monitored.

	## Extra Files

	- `configs/rewards/baseline_4b.yaml` is included for reference to the baseline reward configuration.