Leon-Leee commited on
Commit
617bc69
·
verified ·
1 Parent(s): ff505ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -11,7 +11,7 @@ https://wandb.ai/leon_at_work/soni_ablation_4b/workspace
11
  |---|---|---|
12
  | `Baseline 4B` | Baseline configuration used for comparison | 200 |
13
  | `a2_sapo` | Vanilla SAPO (norm_by_std=true + sapo + eps=0.2 + sequence_mean) | 200 |
14
- | a3_dr_grpo_pure | Dr.GRPO pure (grpo + regular + seq_mean_token_sum_norm + no-std + eps=0.2) | 200 |
15
  | `a4_length_norm_sqrt` | GRPO with sqrt length normalization (plus normalized reduction) | 200 |
16
  | `a5_sapo_on_sqrt` | SAPO on top of sqrt length-normalized setting | 200 |
17
  | `a6_eps_clip_02` | Sqrt length-normalized setting with PPO clip set to 0.2/0.2 | 200 |
@@ -28,7 +28,7 @@ Baseline reference (`Baseline 4B`):
28
  | Variant Folder | Parameter Delta vs `Baseline 4B` |
29
  |---|---|
30
  | `a2_sapo` | `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=sequence_mean`; `eps_clip_low/high=0.2/0.2`; `grpo_norm_by_std=true` |
31
- | a3_dr_grpo_pure | advantage_estimator=grpo; policy_loss_type=regular; loss_reduction=seq_mean_token_sum_norm; eps_clip_low/high=0.2/0.2; grpo_norm_by_std=false |
32
  | `a4_length_norm_sqrt` | `advantage_estimator=grpo_length_norm_sqrt`; `loss_reduction=seq_mean_token_sum_norm` |
33
  | `a5_sapo_on_sqrt` | same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo` |
34
  | `a6_eps_clip_02` | `a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2` |
 
11
  |---|---|---|
12
  | `Baseline 4B` | Baseline configuration used for comparison | 200 |
13
  | `a2_sapo` | Vanilla SAPO (norm_by_std=true + sapo + eps=0.2 + sequence_mean) | 200 |
14
+ | `a3_dr_grpo_pure` | Dr.GRPO pure (grpo + regular + seq_mean_token_sum_norm + no-std + eps=0.2) | 200 |
15
  | `a4_length_norm_sqrt` | GRPO with sqrt length normalization (plus normalized reduction) | 200 |
16
  | `a5_sapo_on_sqrt` | SAPO on top of sqrt length-normalized setting | 200 |
17
  | `a6_eps_clip_02` | Sqrt length-normalized setting with PPO clip set to 0.2/0.2 | 200 |
 
28
  | Variant Folder | Parameter Delta vs `Baseline 4B` |
29
  |---|---|
30
  | `a2_sapo` | `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=sequence_mean`; `eps_clip_low/high=0.2/0.2`; `grpo_norm_by_std=true` |
31
+ | `a3_dr_grpo_pure` | advantage_estimator=grpo; policy_loss_type=regular; loss_reduction=seq_mean_token_sum_norm; eps_clip_low/high=0.2/0.2; grpo_norm_by_std=false |
32
  | `a4_length_norm_sqrt` | `advantage_estimator=grpo_length_norm_sqrt`; `loss_reduction=seq_mean_token_sum_norm` |
33
  | `a5_sapo_on_sqrt` | same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo` |
34
  | `a6_eps_clip_02` | `a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2` |