explcre
/

phase8_rl

explcre commited on 25 days ago

Commit

bca3e91

verified ·

1 Parent(s): d01b5e3

server3_h100_20260502: server3_h100_20260502/README.md

Files changed (1) hide show

server3_h100_20260502/README.md ADDED Viewed

+# Phase 8 RL — server3 H100 cluster runs (2026-05-02)
+This subfolder contains Phase 8 RL ablation runs from the server3 H100 cluster,
+in addition to the lab cluster's runs already present in the repo.
+## Algos × ckpts
+- {svgspo, dapo, gspo_v2, grpo} × {ckpt_step000100.pt (peak), ckpt_step000200.pt (final)}
+## Audit
+- GSPO has a length-norm fix (Qwen 2025 geometric-mean ratio); see github commit
+  `fa2b5ab` on branch mllm-integrate-server3.
+- DAPO is "Clip-Higher only" ablation (1 of 4 paper components).
+- See EXPERIMENTS.md cycles 25-33 for full audit + cross-cluster comparison.
+## Logs
+- log.jsonl — per-step training metrics (18 columns: reward channels, ratio,
+  KL, clip_frac).
+- rollouts.jsonl — per-rollout case-study log (1600 rollouts per algo) with
+  full reward channel breakdown for visualization.
+## Figures
+- F6_rl_training_curves.pdf — basic 6-panel grid (4 algos overlaid)
+- F6_rl_rich.pdf / .png — rolling-mean + per-channel + best-of-K + clip activity
+  (4 rows × 4 cols, addresses "raw curve looks spiky" question).