Spaces:
Sleeping
Sleeping
File size: 22,658 Bytes
3381f43 0b07253 3381f43 0b07253 3381f43 cf86f90 0b07253 cf86f90 0b07253 3381f43 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | """Light theme CSS, SVG diagrams, and HTML content for the ESCTR Gradio UI."""
INJECT_CSS = """<style>
@import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@300;400;500;600;700&display=swap');
.gradio-container{background:linear-gradient(135deg,#dbeafe 0%,#e0e7ff 40%,#ede9fe 70%,#ecfdf5 100%)!important;font-family:'IBM Plex Mono',monospace!important;color:#1e293b!important;max-width:960px!important;margin:0 auto!important}
/* Tabs β border-bottom highlight, no dark hover */
.tabs>.tab-nav{justify-content:center!important;border-bottom:none!important;gap:8px!important;padding:8px 0!important;background:transparent!important}
.tabs>.tab-nav>button{border:1px solid transparent!important;border-radius:0!important;padding:8px 20px!important;font-family:'IBM Plex Mono',monospace!important;font-weight:500!important;background:transparent!important;color:#64748b!important;font-size:13px!important;border-bottom:2px solid transparent!important}
.tabs>.tab-nav>button:hover{background:transparent!important;color:#1e293b!important;border-bottom:2px solid #94a3b8!important}
.tabs>.tab-nav>button.selected{border:1px solid #1e293b!important;border-radius:6px!important;color:#1e293b!important;font-weight:600!important;background:#fff!important}
.tabitem{background:transparent!important;border:none!important}
label,span{font-family:'IBM Plex Mono',monospace!important;color:#334155!important}
.prose{max-width:760px;margin:0 auto;line-height:1.8;color:#334155}
.prose h2{color:#0f172a;font-size:1.5rem;font-weight:700;margin:2.5rem 0 1rem;border-bottom:1px solid #e2e8f0;padding-bottom:8px}
.prose h3{color:#1e293b;font-size:1.15rem;font-weight:600;margin:1.8rem 0 0.8rem}
.prose p{margin:0.8rem 0;font-size:0.92rem}
.prose a{color:#4f46e5;text-decoration:none}
.prose code{background:#f1f5f9;padding:2px 6px;border-radius:4px;font-size:0.85em;color:#7c3aed;border:1px solid #e2e8f0}
.prose blockquote{border-left:3px solid #6366f1;padding:0.5rem 1rem;margin:1rem 0;color:#64748b;font-style:italic;background:#f8fafc;border-radius:0 6px 6px 0}
.prose table{width:100%;border-collapse:collapse;margin:1.2rem 0;font-size:0.85rem}
.prose th{background:#f1f5f9;text-align:left;padding:10px 12px;border:1px solid #e2e8f0;color:#1e293b;font-weight:600}
.prose td{padding:8px 12px;border:1px solid #e2e8f0;color:#334155}
.prose tr:hover td{background:#f8fafc}
.prose .formula{background:#f8fafc;border:1px solid #e2e8f0;border-radius:8px;padding:1rem 1.5rem;margin:1rem 0;text-align:center;font-size:1rem;color:#7c3aed;letter-spacing:0.02em}
.prose img{border-radius:8px;border:1px solid #e2e8f0;max-width:100%;margin:0.5rem 0;box-shadow:0 1px 3px rgba(0,0,0,0.08)}
.svgbox{text-align:center;margin:1.5rem 0}
.svgbox svg{max-width:100%}
.lb-table{width:100%;border-collapse:collapse;font-family:'IBM Plex Mono',monospace;font-size:0.85rem;margin:1rem 0}
.lb-table th{background:#f1f5f9;color:#1e293b;padding:12px 14px;text-align:left;border:1px solid #e2e8f0;font-weight:600}
.lb-table td{padding:10px 14px;border:1px solid #e2e8f0;color:#334155}
.lb-table tr:hover td{background:#f8fafc}
.lb-table .rank{color:#64748b;font-weight:600;text-align:center}
.lb-table .model{font-weight:500;color:#0f172a}
.lb-table .best{color:#16a34a;font-weight:700}
.lb-table .ongoing{color:#ca8a04;font-style:italic}
/* Form inputs β always white bg, dark text */
input,textarea,select,.gr-input,.gr-text-input{background:#fff!important;color:#1e293b!important;border-color:#cbd5e1!important;font-family:'IBM Plex Mono',monospace!important}
textarea{font-family:'IBM Plex Mono',monospace!important;font-size:0.82rem!important;color:#1e293b!important}
/* Buttons */
.gr-button{font-family:'IBM Plex Mono',monospace!important;color:#1e293b!important;background:#fff!important;border:1px solid #cbd5e1!important}
.gr-button:hover{background:#f1f5f9!important}
.gr-button.primary,.gr-button[variant="primary"],button.primary{background:#4f46e5!important;color:#fff!important;border-color:#4f46e5!important}
.gr-button.stop,.gr-button[variant="stop"],button.stop{background:#dc2626!important;color:#fff!important;border-color:#dc2626!important}
/* Panels, groups, boxes β white */
.gr-panel,.gr-box,.gr-form,.gr-group,.panel,.block{background:#fff!important;border-color:#e2e8f0!important}
/* Accordions β light bg, dark text headers */
.gr-accordion,.accordion{background:#f8fafc!important;border-color:#e2e8f0!important;border:1px solid #e2e8f0!important;border-radius:8px!important}
.gr-accordion>.label-wrap,.accordion>.label-wrap{background:#f8fafc!important;color:#1e293b!important}
.gr-accordion>.label-wrap *,.accordion>.label-wrap *{color:#1e293b!important}
.gr-accordion>.label-wrap:hover,.accordion>.label-wrap:hover{background:#f1f5f9!important}
/* Force dark text on light background β override Gradio 6 theme vars */
*{--body-text-color:#1e293b!important;--block-label-text-color:#334155!important;--block-title-text-color:#0f172a!important;--input-text-color:#1e293b!important;--color-accent:#4f46e5!important;--block-background-fill:#fff!important;--background-fill-secondary:#f8fafc!important;--border-color-primary:#e2e8f0!important;--block-border-color:#e2e8f0!important;--button-secondary-background-fill:#fff!important;--button-secondary-text-color:#1e293b!important;--button-secondary-border-color:#cbd5e1!important}
.gradio-container{color:#1e293b!important}
.gradio-container p,.gradio-container span,.gradio-container div,.gradio-container li,.gradio-container td,.gradio-container th,.gradio-container label,.gradio-container h1,.gradio-container h2,.gradio-container h3,.gradio-container h4,.gradio-container h5,.gradio-container h6{color:#1e293b!important;font-family:'IBM Plex Mono',monospace!important}
.gradio-container .prose p,.gradio-container .prose span,.gradio-container .prose li,.gradio-container .prose td{color:#334155!important}
.gradio-container .prose h2{color:#0f172a!important}
.gradio-container .prose h3{color:#1e293b!important}
.gradio-container .prose blockquote,.gradio-container .prose blockquote *{color:#64748b!important}
.gradio-container .prose code{color:#7c3aed!important}
.gradio-container .prose a{color:#4f46e5!important}
.gradio-container .prose th{color:#1e293b!important}
.gradio-container .lb-table th{color:#1e293b!important}
.gradio-container .lb-table td{color:#334155!important}
.gradio-container .lb-table .best{color:#16a34a!important}
.gradio-container .lb-table .ongoing,.gradio-container .lb-table .ongoing *{color:#ca8a04!important}
.gradio-container .lb-table .rank{color:#64748b!important}
.gradio-container .formula,.gradio-container .formula *{color:#7c3aed!important}
[data-testid] label,[data-testid] span{color:#334155!important}
.block-label,.block-title,.label-text{color:#334155!important}
/* Dropdown β light */
.dropdown,.gr-dropdown{background:#fff!important;color:#1e293b!important}
.dropdown li,.gr-dropdown li{color:#1e293b!important;background:#fff!important}
.dropdown li:hover,.gr-dropdown li:hover{background:#f1f5f9!important}
</style>"""
HEADER_HTML = """<div style="text-align:center;padding:2rem 1rem 0.5rem">
<h1 style="font-family:'IBM Plex Mono',monospace;font-size:2rem;font-weight:700;color:#0f172a;margin:0;letter-spacing:-0.02em">ESCTR</h1>
<p style="font-family:'IBM Plex Mono',monospace;font-size:0.95rem;color:#64748b;margin:4px 0;font-style:italic">Enterprise Supply Chain & Tax Reconciliation</p>
<p style="font-family:'IBM Plex Mono',monospace;font-size:0.75rem;color:#94a3b8;margin:4px 0">
<a href="https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/Blog.md" style="color:#4f46e5;text-decoration:none">Blog</a> Β·
<a href="https://github.com/Musharraf1128/esctr-environment" style="color:#4f46e5;text-decoration:none">GitHub</a> Β·
<a href="https://huggingface.co/spaces/musharraf7/esctr-grpo-trained" style="color:#4f46e5;text-decoration:none">Training Dashboard</a>
</p></div>"""
ARCH_SVG = """<div class="svgbox"><svg width="720" height="320" viewBox="0 0 720 320" xmlns="http://www.w3.org/2000/svg">
<rect x="0" y="0" width="720" height="320" fill="#f8fafc" rx="8"/>
<rect x="260" y="10" width="450" height="300" rx="8" fill="none" stroke="#cbd5e1" stroke-width="1.5"/>
<text x="485" y="35" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="12" fill="#4f46e5" font-weight="600">ESCTR Environment</text>
<rect x="20" y="100" width="190" height="120" rx="6" fill="#fff" stroke="#4f46e5" stroke-width="1.5"/>
<text x="115" y="150" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="14" fill="#0f172a" font-weight="600">Agent</text>
<text x="115" y="172" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="10" fill="#64748b">(Qwen3 LLM)</text>
<text x="115" y="190" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="10" fill="#64748b">GRPO-trained</text>
<defs><marker id="ah" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#4f46e5"/></marker>
<marker id="ag" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#16a34a"/></marker></defs>
<line x1="210" y1="140" x2="280" y2="140" stroke="#4f46e5" stroke-width="1.5" marker-end="url(#ah)"/>
<text x="245" y="132" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="9" fill="#64748b">action</text>
<line x1="280" y1="180" x2="210" y2="180" stroke="#16a34a" stroke-width="1.5" marker-end="url(#ag)"/>
<text x="245" y="198" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="9" fill="#64748b">obs</text>
<rect x="290" y="60" width="200" height="200" rx="6" fill="#fff" stroke="#e2e8f0" stroke-width="1"/>
<text x="390" y="85" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="12" fill="#1e293b" font-weight="500">Tool Engine</text>
<text x="310" y="115" font-family="IBM Plex Mono,monospace" font-size="10" fill="#4f46e5">βΈ query_database</text>
<text x="310" y="140" font-family="IBM Plex Mono,monospace" font-size="10" fill="#4f46e5">βΈ read_document</text>
<text x="310" y="165" font-family="IBM Plex Mono,monospace" font-size="10" fill="#4f46e5">βΈ communicate_vendor</text>
<text x="310" y="194" font-family="IBM Plex Mono,monospace" font-size="10" fill="#dc2626">βΈ submit_financial_decision</text>
<text x="310" y="210" font-family="IBM Plex Mono,monospace" font-size="8" fill="#94a3b8"> (terminal action)</text>
<text x="310" y="238" font-family="IBM Plex Mono,monospace" font-size="9" fill="#64748b">Procedurally generated</text>
<text x="310" y="251" font-family="IBM Plex Mono,monospace" font-size="9" fill="#64748b">from seed β deterministic</text>
<rect x="530" y="80" width="160" height="140" rx="6" fill="#fff" stroke="#e2e8f0" stroke-width="1"/>
<text x="610" y="108" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="11" fill="#1e293b" font-weight="500">Reward Verifier</text>
<text x="545" y="135" font-family="IBM Plex Mono,monospace" font-size="9" fill="#16a34a">R_outcome 60-70%</text>
<text x="545" y="155" font-family="IBM Plex Mono,monospace" font-size="9" fill="#16a34a">R_trajectory 30-40%</text>
<text x="545" y="175" font-family="IBM Plex Mono,monospace" font-size="9" fill="#dc2626">- penalties</text>
<text x="610" y="205" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="10" fill="#7c3aed">R β (0.01, 0.99)</text>
<line x1="490" y1="150" x2="530" y2="150" stroke="#cbd5e1" stroke-width="1" marker-end="url(#ah)"/>
</svg></div>"""
EPISODE_SVG = """<div class="svgbox"><svg width="600" height="300" viewBox="0 0 600 300" xmlns="http://www.w3.org/2000/svg">
<rect x="0" y="0" width="600" height="300" fill="#f8fafc" rx="8"/>
<text x="300" y="25" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="12" fill="#64748b" font-weight="500">Typical Episode Flow</text>
<defs><marker id="ah2" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#4f46e5"/></marker></defs>
<rect x="40" y="40" width="220" height="36" rx="4" fill="#fff" stroke="#4f46e5" stroke-width="1"/>
<text x="150" y="63" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="10" fill="#4f46e5">β query_database(POs)</text>
<line x1="150" y1="76" x2="150" y2="96" stroke="#cbd5e1" stroke-width="1" marker-end="url(#ah2)"/>
<rect x="40" y="96" width="220" height="36" rx="4" fill="#fff" stroke="#4f46e5" stroke-width="1"/>
<text x="150" y="119" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="10" fill="#4f46e5">β‘ query_database(invoices)</text>
<line x1="150" y1="132" x2="150" y2="152" stroke="#cbd5e1" stroke-width="1" marker-end="url(#ah2)"/>
<rect x="40" y="152" width="220" height="36" rx="4" fill="#fff" stroke="#4f46e5" stroke-width="1"/>
<text x="150" y="175" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="10" fill="#4f46e5">β’ read_document(PO-XXXX)</text>
<line x1="150" y1="188" x2="150" y2="208" stroke="#cbd5e1" stroke-width="1" marker-end="url(#ah2)"/>
<rect x="40" y="208" width="220" height="36" rx="4" fill="#fff" stroke="#4f46e5" stroke-width="1"/>
<text x="150" y="231" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="10" fill="#4f46e5">β£ read_document(INV-XXXX)</text>
<line x1="150" y1="244" x2="150" y2="264" stroke="#cbd5e1" stroke-width="1" marker-end="url(#ah2)"/>
<rect x="40" y="264" width="220" height="36" rx="4" fill="#fff" stroke="#dc2626" stroke-width="1.5"/>
<text x="150" y="287" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="10" fill="#dc2626">β€ submit_financial_decision</text>
<rect x="340" y="55" width="230" height="230" rx="6" fill="#fff" stroke="#e2e8f0" stroke-width="1"/>
<text x="455" y="80" text-anchor="middle" font-family="IBM Plex Mono,monospace" font-size="11" fill="#1e293b" font-weight="500">Agent Reasoning</text>
<text x="355" y="110" font-family="IBM Plex Mono,monospace" font-size="9" fill="#64748b">β Discover relevant PO IDs</text>
<text x="355" y="135" font-family="IBM Plex Mono,monospace" font-size="9" fill="#64748b">β‘ Discover invoice IDs</text>
<text x="355" y="160" font-family="IBM Plex Mono,monospace" font-size="9" fill="#64748b">β’ Cross-reference prices</text>
<text x="355" y="185" font-family="IBM Plex Mono,monospace" font-size="9" fill="#64748b">β£ Calculate discrepancy</text>
<text x="355" y="215" font-family="IBM Plex Mono,monospace" font-size="9" fill="#16a34a">β€ Submit exact adjustment</text>
<text x="355" y="245" font-family="IBM Plex Mono,monospace" font-size="9" fill="#7c3aed"> β Reward computed</text>
<text x="355" y="268" font-family="IBM Plex Mono,monospace" font-size="9" fill="#7c3aed"> β R = f(accuracy,</text>
<text x="355" y="281" font-family="IBM Plex Mono,monospace" font-size="9" fill="#7c3aed"> procedure, steps)</text>
</svg></div>"""
LEADERBOARD_HTML = """<div class="prose">
<h2 style="text-align:center">Model Leaderboard</h2>
<p style="text-align:center;color:#64748b;font-style:italic;font-size:0.85rem">All models trained on the ESCTR environment using TRL's GRPOTrainer with <code>environment_factory</code>.</p>
<table class="lb-table">
<thead><tr><th class="rank">#</th><th>Model</th><th>Params</th><th>Method</th><th>GPU</th><th>Peak Reward</th><th>Tool Calls</th><th>Failures</th><th>Time</th></tr></thead>
<tbody>
<tr><td class="rank">1</td><td class="model">Qwen3-0.6B</td><td>0.6B</td><td>GRPO</td><td>T4</td><td class="best">0.30</td><td>4.0</td><td>0</td><td>~2h</td></tr>
<tr><td class="rank">2</td><td class="model">Qwen3-4B (LoRA)</td><td>4B</td><td>GRPO + Shaped</td><td>RTX 4090</td><td class="best">0.27</td><td>4.0</td><td>0</td><td>71m</td></tr>
<tr><td class="rank">3</td><td class="model ongoing">Qwen3-1.7B (LoRA)</td><td>1.7B</td><td>GRPO + Shaped</td><td>T4 (HF)</td><td class="ongoing">0.195*</td><td>3.9</td><td>0</td><td>~7h</td></tr>
<tr style="opacity:0.45"><td class="rank">β</td><td class="model">Baseline (untrained)</td><td>β</td><td>β</td><td>β</td><td>0.09</td><td>1-4</td><td>frequent</td><td>β</td></tr>
</tbody></table>
<p style="font-size:0.8rem;color:#94a3b8">* In-progress run on HF Jobs. Peak reward at step 20. Zero tool failures across all logged steps.</p>
<h3>Key Findings</h3>
<table>
<thead><tr><th>Metric</th><th>Untrained</th><th>Trained (best)</th></tr></thead>
<tbody>
<tr><td>Mean Reward</td><td>0.09</td><td><strong style="color:#16a34a">0.30</strong> (+233%)</td></tr>
<tr><td>Tool Success Rate</td><td>60%</td><td><strong style="color:#16a34a">100%</strong></td></tr>
<tr><td>Investigation Completeness</td><td>40%</td><td><strong style="color:#16a34a">100%</strong></td></tr>
<tr><td>Tool Calls / Episode</td><td>Erratic (1-4)</td><td><strong style="color:#16a34a">Stable 4.0</strong></td></tr>
</tbody></table>
</div>"""
PLOT_BASE = "https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots"
BLOG_HTML = f"""<div class="prose">
<blockquote>Training LLMs to investigate procurement fraud, enforce SLA penalties, and reject bad vendor settlements β autonomously.</blockquote>
<h2>The Problem</h2>
<p>Every day, enterprises process millions of procurement transactions. Between Purchase Orders, shipping manifests, SLA contracts, and vendor invoices β discrepancies are inevitable. A vendor bills <code>$45/unit</code> instead of the contracted <code>$40</code>. A shipment arrives 5 days late, triggering penalty clauses. The vendor disputes the penalty.</p>
<p>Resolving these disputes means humans manually cross-referencing siloed databases, interpreting contract clauses, and performing precise arithmetic under pressure. Current LLMs can't solve this reliably β not because the individual steps are hard, but because the <em>combination</em> is: multi-step tool use, precise arithmetic, adversarial reasoning, and state tracking across 10-20 interaction steps.</p>
<p>This is the capability gap that <strong>Reinforcement Learning with Verifiable Rewards (RLVR)</strong> was designed to close.</p>
<h2>The Environment</h2>
{ARCH_SVG}
<p>ESCTR gives the agent three scenarios of increasing complexity:</p>
<table>
<thead><tr><th>Task</th><th>Difficulty</th><th>What the Agent Must Do</th></tr></thead>
<tbody>
<tr><td><strong>Procurement Reconciliation</strong></td><td>π’ Easy</td><td>Identify overcharged line items, calculate exact overcharge</td></tr>
<tr><td><strong>SLA Enforcement</strong></td><td>π‘ Medium</td><td>Discover late shipments, retrieve SLA contract, compute penalty</td></tr>
<tr><td><strong>Adversarial Auditing</strong></td><td>π΄ Hard</td><td>All above + disprove vendor counter-claims using warehouse logs</td></tr>
</tbody></table>
<p>Every scenario is <strong>procedurally generated from a seed</strong> β infinite training configurations with deterministic, reproducible grading. No memorization possible.</p>
<h2>Reward Design</h2>
<div class="formula">R<sub>total</sub> = Ξ± Β· R<sub>outcome</sub> + Ξ² Β· R<sub>trajectory</sub> β penalties</div>
<p><strong>R<sub>outcome</sub></strong> (60-70%): Did the agent submit the exact correct adjustment? <strong>R<sub>trajectory</sub></strong> (30-40%): Did the agent follow proper investigative procedure? <strong>Penalties</strong>: step costs (β0.005/step), hallucination (β0.02), accepting bad settlements (β0.20).</p>
<p>The correct answer is always a <strong>precise floating-point number</strong> derived from contract terms. No LLM-as-judge, no fuzzy rubric β pure programmatic verification.</p>
<h2>Training Journey</h2>
<h3>Phase 1 β Proof of Concept (0.6B)</h3>
<p>Validated the training loop with Qwen3-0.6B on a T4 GPU. Reward improved from <strong>0.09 β 0.30</strong> (+222%) in 500 episodes. The model learned the canonical investigation procedure with zero tool failures.</p>
<div style="display:flex;gap:12px;flex-wrap:wrap">
<img src="{PLOT_BASE}/reward_curve.png" style="flex:1;min-width:280px" alt="0.6B reward curve"/>
<img src="{PLOT_BASE}/training_dashboard.png" style="flex:1;min-width:280px" alt="Training dashboard"/>
</div>
<h3>Phase 2 β Scaling to 4B, and Hitting a Wall</h3>
<p>Scaled to Qwen3-4B on an RTX 4090 with LoRA. First three attempts <strong>completely failed</strong> β loss flat at 0.0.</p>
<p><strong>Problem 1: Token Budget Exhaustion.</strong> The model consumed its entire 512-token budget on <code><think></code> blocks before making a single tool call.</p>
<p><strong>Problem 2: Deterministic Starvation.</strong> At <code>temperature=1.0</code>, all K=4 rollouts were identical. Zero reward variance β zero gradient signal.</p>
<h3>Phase 2.5 β The Fix</h3>
<p><strong>1. Shaped Rewards</strong> β +0.05 partial credit per valid investigation step.<br/>
<strong>2. High Temperature</strong> β T=1.5 with K=4 rollouts forced exploration diversity.</p>
<h3>Phase 3 β Success: 4B in 71 Minutes</h3>
<div style="display:flex;gap:12px;flex-wrap:wrap">
<img src="{PLOT_BASE}/reward_curve_4b.png" style="flex:1;min-width:280px" alt="4B reward curve"/>
<img src="{PLOT_BASE}/tool_calls_4b.png" style="flex:1;min-width:280px" alt="4B tool discipline"/>
</div>
<p>The tool graph tells the story: early chaos (2-4.25 calls/episode) collapses into rigid discipline β exactly 4.0 tool calls, the optimal investigate-investigate-investigate-submit pipeline.</p>
<h3>Phase 4 β Iterating on 1.7B (HF Jobs)</h3>
<p>Launched on HuggingFace's T4-medium. Early metrics confirm the shaped reward architecture transfers cleanly to a different model size with <strong>zero modifications</strong>.</p>
<table>
<thead><tr><th>Step</th><th>Loss</th><th>Reward</th><th>Tool Calls</th><th>Entropy</th></tr></thead>
<tbody>
<tr><td>5</td><td>0.184</td><td><strong>0.195</strong></td><td>3.9</td><td>0.132</td></tr>
<tr><td>10</td><td>0.116</td><td>0.195</td><td>3.9</td><td>0.127</td></tr>
<tr><td>15</td><td>0.088</td><td>0.180</td><td>3.6</td><td>0.028</td></tr>
<tr><td>20</td><td>0.186</td><td>0.190</td><td>3.8</td><td>0.047</td></tr>
</tbody></table>
<h2>Technical Summary</h2>
<table>
<thead><tr><th>Param</th><th>0.6B</th><th>4B</th><th>1.7B</th></tr></thead>
<tbody>
<tr><td>Model</td><td>Qwen3-0.6B</td><td>Qwen3-4B</td><td>Qwen3-1.7B</td></tr>
<tr><td>GPU</td><td>T4 (Colab)</td><td>RTX 4090</td><td>T4 (HF Jobs)</td></tr>
<tr><td>Quant</td><td>None</td><td>4-bit QLoRA</td><td>4-bit QLoRA</td></tr>
<tr><td>Adapter</td><td>Full</td><td>LoRA r=16</td><td>LoRA r=16</td></tr>
<tr><td>Episodes</td><td>500</td><td>300</td><td>500</td></tr>
<tr><td>Time</td><td>~2h</td><td>71m</td><td>~7h</td></tr>
</tbody></table>
</div>"""
|