Leon-Leee commited on
Commit
9d5af36
·
verified ·
1 Parent(s): 22c20db

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -14,6 +14,7 @@ https://wandb.ai/leon_at_work/soni_ablation_4b/workspace
14
  | `a5_sapo_on_sqrt` | SAPO on top of sqrt length-normalized setting | 200 |
15
  | `a6_eps_clip_02` | Sqrt length-normalized setting with PPO clip set to 0.2/0.2 | 200 |
16
  | `a8_dr_grpo_gspo` | Dr.GRPO-style setting with GSPO | 200 |
 
17
  | `a10_grpo_norm_by_std` | GRPO with std-based normalization enabled | 200 |
18
  | `a11_vanilla_grpo` | Vanilla GRPO-style setting (`norm_by_std + regular loss + eps=0.2`) | 200 |
19
 
@@ -28,6 +29,7 @@ Baseline reference (`Baseline 4B`):
28
  | `a5_sapo_on_sqrt` | same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo` |
29
  | `a6_eps_clip_02` | `a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2` |
30
  | `a8_dr_grpo_gspo` | `advantage_estimator=grpo`; `policy_loss_type=gspo`; `loss_reduction=seq_mean_token_sum_norm` |
 
31
  | `a10_grpo_norm_by_std` | `grpo_norm_by_std=true` (other key settings close to baseline) |
32
  | `a11_vanilla_grpo` | `grpo_norm_by_std=true`; `policy_loss_type=regular`; `eps_clip_low/high=0.2/0.2`; `loss_reduction=sequence_mean` |
33
 
 
14
  | `a5_sapo_on_sqrt` | SAPO on top of sqrt length-normalized setting | 200 |
15
  | `a6_eps_clip_02` | Sqrt length-normalized setting with PPO clip set to 0.2/0.2 | 200 |
16
  | `a8_dr_grpo_gspo` | Dr.GRPO-style setting with GSPO | 200 |
17
+ | `a9_dr_grpo_sapo` | Dr.GRPO-style setting with SAPO | 200 |
18
  | `a10_grpo_norm_by_std` | GRPO with std-based normalization enabled | 200 |
19
  | `a11_vanilla_grpo` | Vanilla GRPO-style setting (`norm_by_std + regular loss + eps=0.2`) | 200 |
20
 
 
29
  | `a5_sapo_on_sqrt` | same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo` |
30
  | `a6_eps_clip_02` | `a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2` |
31
  | `a8_dr_grpo_gspo` | `advantage_estimator=grpo`; `policy_loss_type=gspo`; `loss_reduction=seq_mean_token_sum_norm` |
32
+ | `a9_dr_grpo_sapo` | `advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=seq_mean_token_sum_norm` |
33
  | `a10_grpo_norm_by_std` | `grpo_norm_by_std=true` (other key settings close to baseline) |
34
  | `a11_vanilla_grpo` | `grpo_norm_by_std=true`; `policy_loss_type=regular`; `eps_clip_low/high=0.2/0.2`; `loss_reduction=sequence_mean` |
35