server3_h100_20260502: server3_h100_20260502/README.md
Browse files
server3_h100_20260502/README.md
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 8 RL — server3 H100 cluster runs (2026-05-02)
|
| 2 |
+
|
| 3 |
+
This subfolder contains Phase 8 RL ablation runs from the server3 H100 cluster,
|
| 4 |
+
in addition to the lab cluster's runs already present in the repo.
|
| 5 |
+
|
| 6 |
+
## Algos × ckpts
|
| 7 |
+
- {svgspo, dapo, gspo_v2, grpo} × {ckpt_step000100.pt (peak), ckpt_step000200.pt (final)}
|
| 8 |
+
|
| 9 |
+
## Audit
|
| 10 |
+
- GSPO has a length-norm fix (Qwen 2025 geometric-mean ratio); see github commit
|
| 11 |
+
`fa2b5ab` on branch mllm-integrate-server3.
|
| 12 |
+
- DAPO is "Clip-Higher only" ablation (1 of 4 paper components).
|
| 13 |
+
- See EXPERIMENTS.md cycles 25-33 for full audit + cross-cluster comparison.
|
| 14 |
+
|
| 15 |
+
## Logs
|
| 16 |
+
- log.jsonl — per-step training metrics (18 columns: reward channels, ratio,
|
| 17 |
+
KL, clip_frac).
|
| 18 |
+
- rollouts.jsonl — per-rollout case-study log (1600 rollouts per algo) with
|
| 19 |
+
full reward channel breakdown for visualization.
|
| 20 |
+
|
| 21 |
+
## Figures
|
| 22 |
+
- F6_rl_training_curves.pdf — basic 6-panel grid (4 algos overlaid)
|
| 23 |
+
- F6_rl_rich.pdf / .png — rolling-mean + per-channel + best-of-K + clip activity
|
| 24 |
+
(4 rows × 4 cols, addresses "raw curve looks spiky" question).
|