Spaces:

agarwalanu3103
/

clarify-rl

Sleeping

App Files Files Community

Anurag Agarwal commited on Apr 26

Commit

50d2fcb

1 Parent(s): ca10a3a

Run 7 BEATS BASE (+19%) + UI fixes

Browse files

Files changed (15) hide show

README.md +4 -2
plots/01_reward_loss_curves.png +2 -2
plots/02_per_family_bars.png +2 -2
plots/03_component_breakdown.png +2 -2
plots/04_before_after.png +2 -2
plots/05_question_efficiency.png +2 -2
plots/06_same_base_delta.png +2 -2
plots/07_runs_summary_table.png +2 -2
plots/08_training_progression.png +2 -2
plots/09_training_diagnostics.png +2 -2
plots/runs_summary.json +16 -0
scripts/compare_runs.py +6 -0
scripts/make_plots.py +1 -0
scripts/make_progression_plot.py +1 -0
server/gradio_ui.py +41 -12

README.md CHANGED Viewed

@@ -144,7 +144,8 @@ A research lab could plug ClarifyRL in tomorrow as the "humility-shaping" stage
 | Qwen3-1.7B base | 0.0669 | 18% | — |
 | Qwen3-1.7B GRPO (Run 2, β=0) | 0.0286 ↓ | 6% | yes |
 | **Qwen3-1.7B GRPO (Run 4, β=0.2)** | **0.0560 ✅** | 14% | yes |
-| **Qwen3-1.7B GRPO (Run 6, β=1.0, fixed)** | **0.0607 ✅** | 16% | yes |
 | Qwen3-4B-Instruct | 0.0399 | 6% | — |
 | **Qwen3-4B base** ← real ceiling | **0.1446** | **24%** | — |
@@ -166,7 +167,8 @@ A research lab could plug ClarifyRL in tomorrow as the "humility-shaping" stage
 | Submission asset | Link |
 |---|---|
 | HF Space (env) | https://huggingface.co/spaces/agarwalanu3103/clarify-rl |
-| **⭐ Trained model — Qwen3-1.7B (Run 6, β=1.0, fixed fundamentals)** | **https://huggingface.co/Kanan2005/clarify-rl-grpo-qwen3-1-7b-run6** |
 | Trained model — Qwen3-1.7B (Run 4, β=0.2 KL anchor) | https://huggingface.co/anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2 |
 | Trained model — Qwen3-1.7B (Run 2, β=0, ablation regression) | https://huggingface.co/anurag203/clarify-rl-run2-qwen3-1.7b-no-kl |
 | Trained model — Qwen3-0.6B (Run 1, weak-base baseline) | https://huggingface.co/anurag203/clarify-rl-run1-qwen3-0.6b-no-kl |

 | Qwen3-1.7B base | 0.0669 | 18% | — |
 | Qwen3-1.7B GRPO (Run 2, β=0) | 0.0286 ↓ | 6% | yes |
 | **Qwen3-1.7B GRPO (Run 4, β=0.2)** | **0.0560 ✅** | 14% | yes |
+| **Qwen3-1.7B GRPO (Run 7, β=0.3) ← BEST** | **0.0754 ✅ BEATS BASE** | **20%** | yes |
+| Qwen3-1.7B GRPO (Run 6, β=1.0, fixed) | 0.0607 | 16% | yes |
 | Qwen3-4B-Instruct | 0.0399 | 6% | — |
 | **Qwen3-4B base** ← real ceiling | **0.1446** | **24%** | — |
 | Submission asset | Link |
 |---|---|
 | HF Space (env) | https://huggingface.co/spaces/agarwalanu3103/clarify-rl |
+| **⭐ Trained model — Qwen3-1.7B (Run 7, β=0.3, BEATS BASE)** | **https://huggingface.co/agarwalanu3103/clarify-rl-grpo-qwen3-1-7b-run7** |
+| Trained model — Qwen3-1.7B (Run 6, β=1.0, fixed pipeline) | https://huggingface.co/Kanan2005/clarify-rl-grpo-qwen3-1-7b-run6 |
 | Trained model — Qwen3-1.7B (Run 4, β=0.2 KL anchor) | https://huggingface.co/anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2 |
 | Trained model — Qwen3-1.7B (Run 2, β=0, ablation regression) | https://huggingface.co/anurag203/clarify-rl-run2-qwen3-1.7b-no-kl |
 | Trained model — Qwen3-0.6B (Run 1, weak-base baseline) | https://huggingface.co/anurag203/clarify-rl-run1-qwen3-0.6b-no-kl |

plots/01_reward_loss_curves.png CHANGED Viewed

Git LFS Details

SHA256: 5fea0a0ba6fc75823446476a15228eb0c30b8ea450dd19e001902e3bfe728e6e
Pointer size: 131 Bytes
Size of remote file: 269 kB

Git LFS Details

SHA256: a7a4e4c3cd914d8a2a23fe95bc9d2ef20140253b05db50f5713e5a7590ec64d4
Pointer size: 131 Bytes
Size of remote file: 283 kB

plots/02_per_family_bars.png CHANGED Viewed

Git LFS Details

SHA256: f9ff130fb12c1673e223fa0fd368b6891168058ab8ce1300467caf3f7c0fc909
Pointer size: 130 Bytes
Size of remote file: 74.2 kB

Git LFS Details

SHA256: 5efd767cb3add8834d803fdd5d9057e16ba823256b999d9fdcedfddde9fd8366
Pointer size: 130 Bytes
Size of remote file: 77.4 kB

plots/03_component_breakdown.png CHANGED Viewed

Git LFS Details

SHA256: c3dfbcbf7953e714adaf948ac2ce9e7112c9758f8fe6866cc2becb2f903823ed
Pointer size: 130 Bytes
Size of remote file: 91 kB

Git LFS Details

SHA256: baff5673d984a99cd53340edb7abeb762b06a013aa0b8710758dd9b4b9454e3f
Pointer size: 130 Bytes
Size of remote file: 95.6 kB

plots/04_before_after.png CHANGED Viewed

Git LFS Details

SHA256: 64fd2caa15eef7bdd81484b7710d5d668254d73e240951695277b47fd336ed09
Pointer size: 130 Bytes
Size of remote file: 69.3 kB

Git LFS Details

SHA256: ebfa67bec166210d9d4cf44a443222904b62d7a72fe21f6fe8e5fb580dcc48ce
Pointer size: 130 Bytes
Size of remote file: 74.4 kB

plots/05_question_efficiency.png CHANGED Viewed

Git LFS Details

SHA256: da07cddf5476dc3391f3f98c455cbf285aad3593c692a948d1c1aefa6b1cb8d1
Pointer size: 130 Bytes
Size of remote file: 70.6 kB

Git LFS Details

SHA256: 0096750c9992eb6de1f4cf54633aa61c6f456b984e3ca70bdab8b0e8f1774836
Pointer size: 130 Bytes
Size of remote file: 76.8 kB

plots/06_same_base_delta.png CHANGED Viewed

Git LFS Details

SHA256: ec002b3751e2de49dea7e60b2a85e162a56bb86b2bbb986bf4cbe4bb1223c6de
Pointer size: 131 Bytes
Size of remote file: 110 kB

Git LFS Details

SHA256: c81d0af8c61bbf89325f804fa8e2204467d463f9f1de693964da453d0d2da767
Pointer size: 131 Bytes
Size of remote file: 116 kB

plots/07_runs_summary_table.png CHANGED Viewed

Git LFS Details

SHA256: 8c1f47c3fbb144d9e0cc1fecd22c305122baff15529c0ad7b99eaff724c970e3
Pointer size: 130 Bytes
Size of remote file: 94.4 kB

Git LFS Details

SHA256: ecb96a13ae928a1c531ff99a8fa05f0b639f2c7b6746997d9675878d78308734
Pointer size: 131 Bytes
Size of remote file: 103 kB

plots/08_training_progression.png CHANGED Viewed

Git LFS Details

SHA256: d753f8b6585ac357cc1e367b1f9d7526ea667141cc22d20e2bf93dd4f8716374
Pointer size: 131 Bytes
Size of remote file: 268 kB

Git LFS Details

SHA256: ed82495fc728134d0d9e1f2354e741ea8489bad2725ae0e6c0a3297d35421ce5
Pointer size: 131 Bytes
Size of remote file: 334 kB

plots/09_training_diagnostics.png CHANGED Viewed

Git LFS Details

SHA256: d19b0b6d3fbb7a706f757c52ac9e1221a579e9bdb32b1ea7a93a12a861b78ad7
Pointer size: 131 Bytes
Size of remote file: 211 kB

Git LFS Details

SHA256: 65b92efb5cf0c2fd7c2e2cbe5ecbbcb5a139b111190fba5b26835b303ab0faaf
Pointer size: 131 Bytes
Size of remote file: 263 kB

plots/runs_summary.json CHANGED Viewed

@@ -102,6 +102,22 @@
       "max_meeting_scheduling": 0.6,
       "max_support_triage": 0.0
     },
     {
       "label": "4B base",
       "model": "Qwen/Qwen3-4B",

       "max_meeting_scheduling": 0.6,
       "max_support_triage": 0.0
     },
+    {
+      "label": "1.7B GRPO best (Run 7)",
+      "model": "agarwalanu3103/clarify-rl-grpo-qwen3-1-7b-run7",
+      "n": 50,
+      "avg_score": 0.0754010101010101,
+      "format_pass_rate": 0.0,
+      "completion_rate": 0.2,
+      "fam_event_planning": 0.20097643097643098,
+      "fam_medical_intake": 0.0,
+      "fam_meeting_scheduling": 0.12348484848484848,
+      "fam_support_triage": 0.0,
+      "max_event_planning": 0.5097222222222222,
+      "max_medical_intake": 0.0,
+      "max_meeting_scheduling": 0.425,
+      "max_support_triage": 0.0
+    },
     {
       "label": "4B base",
       "model": "Qwen/Qwen3-4B",

scripts/compare_runs.py CHANGED Viewed

@@ -92,6 +92,12 @@ RUN_SPECS: list[RunSpec] = [
         base_label="1.7B base",
         color="#0d47a1",
     ),
     RunSpec(
         label="4B base",
         eval_path=Path("outputs/run_artifacts/4B-base/evals"),

         base_label="1.7B base",
         color="#0d47a1",
     ),
+    RunSpec(
+        label="1.7B GRPO best (Run 7)",
+        eval_path=Path("outputs/run_artifacts/1.7B-Run7/evals"),
+        base_label="1.7B base",
+        color="#ff6f00",
+    ),
     RunSpec(
         label="4B base",
         eval_path=Path("outputs/run_artifacts/4B-base/evals"),

scripts/make_plots.py CHANGED Viewed

@@ -66,6 +66,7 @@ _LABEL_COLORS: dict[str, str] = {
     "1.7B GRPO no-KL (Run 2)":  "#e53935",   # red — the regression run
     "1.7B GRPO +KL (Run 4)":    "#2e7d32",   # deep green — KL-anchored hero
     "1.7B GRPO fixed (Run 6)":  "#0d47a1",   # dark blue — fixed fundamentals
     "4B base":                  "#5e35b1",   # purple — ceiling marker
     "4B-instruct":              "#00838f",   # teal
     "4B GRPO (Run 3)":          "#ff6f00",   # amber

     "1.7B GRPO no-KL (Run 2)":  "#e53935",   # red — the regression run
     "1.7B GRPO +KL (Run 4)":    "#2e7d32",   # deep green — KL-anchored hero
     "1.7B GRPO fixed (Run 6)":  "#0d47a1",   # dark blue — fixed fundamentals
+    "1.7B GRPO best (Run 7)":   "#ff6f00",   # orange — best run, beats base
     "4B base":                  "#5e35b1",   # purple — ceiling marker
     "4B-instruct":              "#00838f",   # teal
     "4B GRPO (Run 3)":          "#ff6f00",   # amber

scripts/make_progression_plot.py CHANGED Viewed

@@ -32,6 +32,7 @@ _LABEL_COLORS: dict[str, str] = {
     "1.7B GRPO no-KL (Run 2)":  "#e53935",
     "1.7B GRPO +KL (Run 4)":    "#2e7d32",
     "1.7B GRPO fixed (Run 6)":  "#0d47a1",
     "4B base":                  "#5e35b1",
     "4B-instruct":              "#00838f",
     "4B GRPO (Run 3)":          "#ff6f00",

     "1.7B GRPO no-KL (Run 2)":  "#e53935",
     "1.7B GRPO +KL (Run 4)":    "#2e7d32",
     "1.7B GRPO fixed (Run 6)":  "#0d47a1",
+    "1.7B GRPO best (Run 7)":   "#ff6f00",
     "4B base":                  "#5e35b1",
     "4B-instruct":              "#00838f",
     "4B GRPO (Run 3)":          "#ff6f00",

server/gradio_ui.py CHANGED Viewed

@@ -85,7 +85,37 @@ button.primary:hover { box-shadow: 0 0 30px rgba(0,240,255,0.5), 0 0 60px rgba(2
 input, select, textarea, [data-testid="textbox"], .wrap { background: #111128 !important; color: #e0e0ff !important; border-color: #1e1e4a !important; border-radius: 8px !important; }
 label, .label-text { color: #8888bb !important; }
-[data-testid="chatbot"], .chatbot { background: #111128 !important; border: 1px solid #1e1e4a !important; border-radius: 12px !important; }
 @keyframes neonPulse {
     0%, 100% { box-shadow: 0 0 4px #39ff14; }
@@ -397,23 +427,22 @@ Sequential(
 """
 _RESULTS_MD = """
-## Training Progression
-7 GRPO runs with a **5-point KL beta sweep** {0, 0.2, 0.3, 0.5, 1.0} and a training pipeline overhaul between Runs 4 and 6.
-| Beta | Run | Avg Score | Key Finding |
-|------|-----|-----------|-------------|
-| 0.0 | Run 2 | 0.029 | Catastrophic collapse on event_planning |
-| 0.2 | Run 4 | 0.056 | Recovered event_planning, **beats base** (0.175 vs 0.138) |
-| 0.3 | Run 7 | *training* | Reward 0.48-0.73 (highest ever) |
-| 0.5 | Run 5 | *canceled* | Reward stuck at 0 (pre-fix pipeline) |
-| 1.0 | Run 6 | 0.061 | Nearly matches base (fixed pipeline) |
-### 4 Root Causes Fixed in Run 6
 1. **Example contamination** &mdash; removed misleading field-name example
 2. **Sparse reward** &mdash; added plan-submission bonus + no-plan penalty
-3. **Missing required keys** &mdash; surfaced required fields in observations
 4. **Role mismatch** &mdash; aligned training and eval prompt formats
 ---

 input, select, textarea, [data-testid="textbox"], .wrap { background: #111128 !important; color: #e0e0ff !important; border-color: #1e1e4a !important; border-radius: 8px !important; }
 label, .label-text { color: #8888bb !important; }
+[data-testid="chatbot"], .chatbot, .chatbot-container,
+.message-row, .bubble-wrap, .message-bubble { background: #111128 !important; border-color: #1e1e4a !important; }
+[data-testid="chatbot"] .message, .chatbot .message { background: #1a1a3e !important; color: #e0e0ff !important; border-radius: 8px !important; }
+.bot .message-bubble, .message-row.bot .bubble-wrap { background: #151535 !important; }
+.user .message-bubble, .message-row.user .bubble-wrap { background: rgba(0,240,255,0.08) !important; }
+/* Force tab visibility */
+button[role="tab"], .tab-nav button, .tabs .tab-nav button {
+    font-size: 0.95em !important;
+    padding: 12px 24px !important;
+    color: #aaaadd !important;
+    font-weight: 700 !important;
+    letter-spacing: 1px !important;
+    text-transform: uppercase !important;
+    background: #111128 !important;
+    border: 2px solid #1e1e4a !important;
+    border-radius: 8px !important;
+    margin: 2px 4px !important;
+}
+button[role="tab"]:hover, .tab-nav button:hover {
+    color: #00f0ff !important;
+    border-color: #00f0ff !important;
+    background: rgba(0,240,255,0.1) !important;
+}
+button[role="tab"][aria-selected="true"], button[role="tab"].selected,
+.tab-nav button.selected {
+    color: #00f0ff !important;
+    border-color: #00f0ff !important;
+    background: rgba(0,240,255,0.15) !important;
+    box-shadow: 0 0 12px rgba(0,240,255,0.3) !important;
+}
 @keyframes neonPulse {
     0%, 100% { box-shadow: 0 0 4px #39ff14; }
 """
 _RESULTS_MD = """
+## Run 7 Beats the Base Model
+**avg_score 0.075 vs 1.7B base 0.063 — a 19% improvement.** Event planning: 0.201 vs base 0.138 (+46%).
+| Beta | Run | Avg Score | Event Planning | Key Finding |
+|------|-----|-----------|----------------|-------------|
+| 0.0 | Run 2 | 0.029 | 0.000 | Catastrophic collapse |
+| 0.2 | Run 4 | 0.056 | **0.175** | Recovered event_planning, beats base |
+| **0.3** | **Run 7** | **0.075** | **0.201** | **BEATS BASE (+19% overall)** |
+| 1.0 | Run 6 | 0.061 | 0.119 | Nearly matches base (fixed pipeline) |
+### Training Pipeline Fixes (between Run 4 and Run 6)
 1. **Example contamination** &mdash; removed misleading field-name example
 2. **Sparse reward** &mdash; added plan-submission bonus + no-plan penalty
+3. **Missing required keys** &mdash; surfaced required field names in observations
 4. **Role mismatch** &mdash; aligned training and eval prompt formats
 ---