Anurag Agarwal commited on
Commit
50d2fcb
Β·
1 Parent(s): ca10a3a

Run 7 BEATS BASE (+19%) + UI fixes

Browse files
README.md CHANGED
@@ -144,7 +144,8 @@ A research lab could plug ClarifyRL in tomorrow as the "humility-shaping" stage
144
  | Qwen3-1.7B base | 0.0669 | 18% | β€” |
145
  | Qwen3-1.7B GRPO (Run 2, Ξ²=0) | 0.0286 ↓ | 6% | yes |
146
  | **Qwen3-1.7B GRPO (Run 4, Ξ²=0.2)** | **0.0560 βœ…** | 14% | yes |
147
- | **Qwen3-1.7B GRPO (Run 6, Ξ²=1.0, fixed)** | **0.0607 βœ…** | 16% | yes |
 
148
  | Qwen3-4B-Instruct | 0.0399 | 6% | β€” |
149
  | **Qwen3-4B base** ← real ceiling | **0.1446** | **24%** | β€” |
150
 
@@ -166,7 +167,8 @@ A research lab could plug ClarifyRL in tomorrow as the "humility-shaping" stage
166
  | Submission asset | Link |
167
  |---|---|
168
  | HF Space (env) | https://huggingface.co/spaces/agarwalanu3103/clarify-rl |
169
- | **⭐ Trained model β€” Qwen3-1.7B (Run 6, Ξ²=1.0, fixed fundamentals)** | **https://huggingface.co/Kanan2005/clarify-rl-grpo-qwen3-1-7b-run6** |
 
170
  | Trained model β€” Qwen3-1.7B (Run 4, Ξ²=0.2 KL anchor) | https://huggingface.co/anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2 |
171
  | Trained model β€” Qwen3-1.7B (Run 2, Ξ²=0, ablation regression) | https://huggingface.co/anurag203/clarify-rl-run2-qwen3-1.7b-no-kl |
172
  | Trained model β€” Qwen3-0.6B (Run 1, weak-base baseline) | https://huggingface.co/anurag203/clarify-rl-run1-qwen3-0.6b-no-kl |
 
144
  | Qwen3-1.7B base | 0.0669 | 18% | β€” |
145
  | Qwen3-1.7B GRPO (Run 2, Ξ²=0) | 0.0286 ↓ | 6% | yes |
146
  | **Qwen3-1.7B GRPO (Run 4, Ξ²=0.2)** | **0.0560 βœ…** | 14% | yes |
147
+ | **Qwen3-1.7B GRPO (Run 7, Ξ²=0.3) ← BEST** | **0.0754 βœ… BEATS BASE** | **20%** | yes |
148
+ | Qwen3-1.7B GRPO (Run 6, Ξ²=1.0, fixed) | 0.0607 | 16% | yes |
149
  | Qwen3-4B-Instruct | 0.0399 | 6% | β€” |
150
  | **Qwen3-4B base** ← real ceiling | **0.1446** | **24%** | β€” |
151
 
 
167
  | Submission asset | Link |
168
  |---|---|
169
  | HF Space (env) | https://huggingface.co/spaces/agarwalanu3103/clarify-rl |
170
+ | **⭐ Trained model β€” Qwen3-1.7B (Run 7, Ξ²=0.3, BEATS BASE)** | **https://huggingface.co/agarwalanu3103/clarify-rl-grpo-qwen3-1-7b-run7** |
171
+ | Trained model β€” Qwen3-1.7B (Run 6, Ξ²=1.0, fixed pipeline) | https://huggingface.co/Kanan2005/clarify-rl-grpo-qwen3-1-7b-run6 |
172
  | Trained model β€” Qwen3-1.7B (Run 4, Ξ²=0.2 KL anchor) | https://huggingface.co/anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2 |
173
  | Trained model β€” Qwen3-1.7B (Run 2, Ξ²=0, ablation regression) | https://huggingface.co/anurag203/clarify-rl-run2-qwen3-1.7b-no-kl |
174
  | Trained model β€” Qwen3-0.6B (Run 1, weak-base baseline) | https://huggingface.co/anurag203/clarify-rl-run1-qwen3-0.6b-no-kl |
plots/01_reward_loss_curves.png CHANGED

Git LFS Details

  • SHA256: 5fea0a0ba6fc75823446476a15228eb0c30b8ea450dd19e001902e3bfe728e6e
  • Pointer size: 131 Bytes
  • Size of remote file: 269 kB

Git LFS Details

  • SHA256: a7a4e4c3cd914d8a2a23fe95bc9d2ef20140253b05db50f5713e5a7590ec64d4
  • Pointer size: 131 Bytes
  • Size of remote file: 283 kB
plots/02_per_family_bars.png CHANGED

Git LFS Details

  • SHA256: f9ff130fb12c1673e223fa0fd368b6891168058ab8ce1300467caf3f7c0fc909
  • Pointer size: 130 Bytes
  • Size of remote file: 74.2 kB

Git LFS Details

  • SHA256: 5efd767cb3add8834d803fdd5d9057e16ba823256b999d9fdcedfddde9fd8366
  • Pointer size: 130 Bytes
  • Size of remote file: 77.4 kB
plots/03_component_breakdown.png CHANGED

Git LFS Details

  • SHA256: c3dfbcbf7953e714adaf948ac2ce9e7112c9758f8fe6866cc2becb2f903823ed
  • Pointer size: 130 Bytes
  • Size of remote file: 91 kB

Git LFS Details

  • SHA256: baff5673d984a99cd53340edb7abeb762b06a013aa0b8710758dd9b4b9454e3f
  • Pointer size: 130 Bytes
  • Size of remote file: 95.6 kB
plots/04_before_after.png CHANGED

Git LFS Details

  • SHA256: 64fd2caa15eef7bdd81484b7710d5d668254d73e240951695277b47fd336ed09
  • Pointer size: 130 Bytes
  • Size of remote file: 69.3 kB

Git LFS Details

  • SHA256: ebfa67bec166210d9d4cf44a443222904b62d7a72fe21f6fe8e5fb580dcc48ce
  • Pointer size: 130 Bytes
  • Size of remote file: 74.4 kB
plots/05_question_efficiency.png CHANGED

Git LFS Details

  • SHA256: da07cddf5476dc3391f3f98c455cbf285aad3593c692a948d1c1aefa6b1cb8d1
  • Pointer size: 130 Bytes
  • Size of remote file: 70.6 kB

Git LFS Details

  • SHA256: 0096750c9992eb6de1f4cf54633aa61c6f456b984e3ca70bdab8b0e8f1774836
  • Pointer size: 130 Bytes
  • Size of remote file: 76.8 kB
plots/06_same_base_delta.png CHANGED

Git LFS Details

  • SHA256: ec002b3751e2de49dea7e60b2a85e162a56bb86b2bbb986bf4cbe4bb1223c6de
  • Pointer size: 131 Bytes
  • Size of remote file: 110 kB

Git LFS Details

  • SHA256: c81d0af8c61bbf89325f804fa8e2204467d463f9f1de693964da453d0d2da767
  • Pointer size: 131 Bytes
  • Size of remote file: 116 kB
plots/07_runs_summary_table.png CHANGED

Git LFS Details

  • SHA256: 8c1f47c3fbb144d9e0cc1fecd22c305122baff15529c0ad7b99eaff724c970e3
  • Pointer size: 130 Bytes
  • Size of remote file: 94.4 kB

Git LFS Details

  • SHA256: ecb96a13ae928a1c531ff99a8fa05f0b639f2c7b6746997d9675878d78308734
  • Pointer size: 131 Bytes
  • Size of remote file: 103 kB
plots/08_training_progression.png CHANGED

Git LFS Details

  • SHA256: d753f8b6585ac357cc1e367b1f9d7526ea667141cc22d20e2bf93dd4f8716374
  • Pointer size: 131 Bytes
  • Size of remote file: 268 kB

Git LFS Details

  • SHA256: ed82495fc728134d0d9e1f2354e741ea8489bad2725ae0e6c0a3297d35421ce5
  • Pointer size: 131 Bytes
  • Size of remote file: 334 kB
plots/09_training_diagnostics.png CHANGED

Git LFS Details

  • SHA256: d19b0b6d3fbb7a706f757c52ac9e1221a579e9bdb32b1ea7a93a12a861b78ad7
  • Pointer size: 131 Bytes
  • Size of remote file: 211 kB

Git LFS Details

  • SHA256: 65b92efb5cf0c2fd7c2e2cbe5ecbbcb5a139b111190fba5b26835b303ab0faaf
  • Pointer size: 131 Bytes
  • Size of remote file: 263 kB
plots/runs_summary.json CHANGED
@@ -102,6 +102,22 @@
102
  "max_meeting_scheduling": 0.6,
103
  "max_support_triage": 0.0
104
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  {
106
  "label": "4B base",
107
  "model": "Qwen/Qwen3-4B",
 
102
  "max_meeting_scheduling": 0.6,
103
  "max_support_triage": 0.0
104
  },
105
+ {
106
+ "label": "1.7B GRPO best (Run 7)",
107
+ "model": "agarwalanu3103/clarify-rl-grpo-qwen3-1-7b-run7",
108
+ "n": 50,
109
+ "avg_score": 0.0754010101010101,
110
+ "format_pass_rate": 0.0,
111
+ "completion_rate": 0.2,
112
+ "fam_event_planning": 0.20097643097643098,
113
+ "fam_medical_intake": 0.0,
114
+ "fam_meeting_scheduling": 0.12348484848484848,
115
+ "fam_support_triage": 0.0,
116
+ "max_event_planning": 0.5097222222222222,
117
+ "max_medical_intake": 0.0,
118
+ "max_meeting_scheduling": 0.425,
119
+ "max_support_triage": 0.0
120
+ },
121
  {
122
  "label": "4B base",
123
  "model": "Qwen/Qwen3-4B",
scripts/compare_runs.py CHANGED
@@ -92,6 +92,12 @@ RUN_SPECS: list[RunSpec] = [
92
  base_label="1.7B base",
93
  color="#0d47a1",
94
  ),
 
 
 
 
 
 
95
  RunSpec(
96
  label="4B base",
97
  eval_path=Path("outputs/run_artifacts/4B-base/evals"),
 
92
  base_label="1.7B base",
93
  color="#0d47a1",
94
  ),
95
+ RunSpec(
96
+ label="1.7B GRPO best (Run 7)",
97
+ eval_path=Path("outputs/run_artifacts/1.7B-Run7/evals"),
98
+ base_label="1.7B base",
99
+ color="#ff6f00",
100
+ ),
101
  RunSpec(
102
  label="4B base",
103
  eval_path=Path("outputs/run_artifacts/4B-base/evals"),
scripts/make_plots.py CHANGED
@@ -66,6 +66,7 @@ _LABEL_COLORS: dict[str, str] = {
66
  "1.7B GRPO no-KL (Run 2)": "#e53935", # red β€” the regression run
67
  "1.7B GRPO +KL (Run 4)": "#2e7d32", # deep green β€” KL-anchored hero
68
  "1.7B GRPO fixed (Run 6)": "#0d47a1", # dark blue β€” fixed fundamentals
 
69
  "4B base": "#5e35b1", # purple β€” ceiling marker
70
  "4B-instruct": "#00838f", # teal
71
  "4B GRPO (Run 3)": "#ff6f00", # amber
 
66
  "1.7B GRPO no-KL (Run 2)": "#e53935", # red β€” the regression run
67
  "1.7B GRPO +KL (Run 4)": "#2e7d32", # deep green β€” KL-anchored hero
68
  "1.7B GRPO fixed (Run 6)": "#0d47a1", # dark blue β€” fixed fundamentals
69
+ "1.7B GRPO best (Run 7)": "#ff6f00", # orange β€” best run, beats base
70
  "4B base": "#5e35b1", # purple β€” ceiling marker
71
  "4B-instruct": "#00838f", # teal
72
  "4B GRPO (Run 3)": "#ff6f00", # amber
scripts/make_progression_plot.py CHANGED
@@ -32,6 +32,7 @@ _LABEL_COLORS: dict[str, str] = {
32
  "1.7B GRPO no-KL (Run 2)": "#e53935",
33
  "1.7B GRPO +KL (Run 4)": "#2e7d32",
34
  "1.7B GRPO fixed (Run 6)": "#0d47a1",
 
35
  "4B base": "#5e35b1",
36
  "4B-instruct": "#00838f",
37
  "4B GRPO (Run 3)": "#ff6f00",
 
32
  "1.7B GRPO no-KL (Run 2)": "#e53935",
33
  "1.7B GRPO +KL (Run 4)": "#2e7d32",
34
  "1.7B GRPO fixed (Run 6)": "#0d47a1",
35
+ "1.7B GRPO best (Run 7)": "#ff6f00",
36
  "4B base": "#5e35b1",
37
  "4B-instruct": "#00838f",
38
  "4B GRPO (Run 3)": "#ff6f00",
server/gradio_ui.py CHANGED
@@ -85,7 +85,37 @@ button.primary:hover { box-shadow: 0 0 30px rgba(0,240,255,0.5), 0 0 60px rgba(2
85
 
86
  input, select, textarea, [data-testid="textbox"], .wrap { background: #111128 !important; color: #e0e0ff !important; border-color: #1e1e4a !important; border-radius: 8px !important; }
87
  label, .label-text { color: #8888bb !important; }
88
- [data-testid="chatbot"], .chatbot { background: #111128 !important; border: 1px solid #1e1e4a !important; border-radius: 12px !important; }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  @keyframes neonPulse {
91
  0%, 100% { box-shadow: 0 0 4px #39ff14; }
@@ -397,23 +427,22 @@ Sequential(
397
  """
398
 
399
  _RESULTS_MD = """
400
- ## Training Progression
401
 
402
- 7 GRPO runs with a **5-point KL beta sweep** {0, 0.2, 0.3, 0.5, 1.0} and a training pipeline overhaul between Runs 4 and 6.
403
 
404
- | Beta | Run | Avg Score | Key Finding |
405
- |------|-----|-----------|-------------|
406
- | 0.0 | Run 2 | 0.029 | Catastrophic collapse on event_planning |
407
- | 0.2 | Run 4 | 0.056 | Recovered event_planning, **beats base** (0.175 vs 0.138) |
408
- | 0.3 | Run 7 | *training* | Reward 0.48-0.73 (highest ever) |
409
- | 0.5 | Run 5 | *canceled* | Reward stuck at 0 (pre-fix pipeline) |
410
- | 1.0 | Run 6 | 0.061 | Nearly matches base (fixed pipeline) |
411
 
412
- ### 4 Root Causes Fixed in Run 6
413
 
414
  1. **Example contamination** — removed misleading field-name example
415
  2. **Sparse reward** — added plan-submission bonus + no-plan penalty
416
- 3. **Missing required keys** — surfaced required fields in observations
417
  4. **Role mismatch** — aligned training and eval prompt formats
418
 
419
  ---
 
85
 
86
  input, select, textarea, [data-testid="textbox"], .wrap { background: #111128 !important; color: #e0e0ff !important; border-color: #1e1e4a !important; border-radius: 8px !important; }
87
  label, .label-text { color: #8888bb !important; }
88
+ [data-testid="chatbot"], .chatbot, .chatbot-container,
89
+ .message-row, .bubble-wrap, .message-bubble { background: #111128 !important; border-color: #1e1e4a !important; }
90
+ [data-testid="chatbot"] .message, .chatbot .message { background: #1a1a3e !important; color: #e0e0ff !important; border-radius: 8px !important; }
91
+ .bot .message-bubble, .message-row.bot .bubble-wrap { background: #151535 !important; }
92
+ .user .message-bubble, .message-row.user .bubble-wrap { background: rgba(0,240,255,0.08) !important; }
93
+
94
+ /* Force tab visibility */
95
+ button[role="tab"], .tab-nav button, .tabs .tab-nav button {
96
+ font-size: 0.95em !important;
97
+ padding: 12px 24px !important;
98
+ color: #aaaadd !important;
99
+ font-weight: 700 !important;
100
+ letter-spacing: 1px !important;
101
+ text-transform: uppercase !important;
102
+ background: #111128 !important;
103
+ border: 2px solid #1e1e4a !important;
104
+ border-radius: 8px !important;
105
+ margin: 2px 4px !important;
106
+ }
107
+ button[role="tab"]:hover, .tab-nav button:hover {
108
+ color: #00f0ff !important;
109
+ border-color: #00f0ff !important;
110
+ background: rgba(0,240,255,0.1) !important;
111
+ }
112
+ button[role="tab"][aria-selected="true"], button[role="tab"].selected,
113
+ .tab-nav button.selected {
114
+ color: #00f0ff !important;
115
+ border-color: #00f0ff !important;
116
+ background: rgba(0,240,255,0.15) !important;
117
+ box-shadow: 0 0 12px rgba(0,240,255,0.3) !important;
118
+ }
119
 
120
  @keyframes neonPulse {
121
  0%, 100% { box-shadow: 0 0 4px #39ff14; }
 
427
  """
428
 
429
  _RESULTS_MD = """
430
+ ## Run 7 Beats the Base Model
431
 
432
+ **avg_score 0.075 vs 1.7B base 0.063 β€” a 19% improvement.** Event planning: 0.201 vs base 0.138 (+46%).
433
 
434
+ | Beta | Run | Avg Score | Event Planning | Key Finding |
435
+ |------|-----|-----------|----------------|-------------|
436
+ | 0.0 | Run 2 | 0.029 | 0.000 | Catastrophic collapse |
437
+ | 0.2 | Run 4 | 0.056 | **0.175** | Recovered event_planning, beats base |
438
+ | **0.3** | **Run 7** | **0.075** | **0.201** | **BEATS BASE (+19% overall)** |
439
+ | 1.0 | Run 6 | 0.061 | 0.119 | Nearly matches base (fixed pipeline) |
 
440
 
441
+ ### Training Pipeline Fixes (between Run 4 and Run 6)
442
 
443
  1. **Example contamination** — removed misleading field-name example
444
  2. **Sparse reward** — added plan-submission bonus + no-plan penalty
445
+ 3. **Missing required keys** — surfaced required field names in observations
446
  4. **Role mismatch** — aligned training and eval prompt formats
447
 
448
  ---