| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1"> |
| <title>ReasoningEconomicsEnv: An OpenEnv Benchmark Where LLMs Learn to Budget Their Own Thinking Across a Shared-Budget Episode</title> |
| <meta name="description" content="ReasoningEconomicsEnv: an OpenEnv-native RL environment for sequential, shared-budget reasoning. GRPO training pipeline with TRL 1.0 rollout_func, verifiable math grading, and tokenizer-native budget accounting."> |
| <link rel="preconnect" href="https://fonts.googleapis.com"> |
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> |
| <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700;800&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet"> |
| |
| <script type="module"> |
| import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs'; |
| mermaid.initialize({ |
| startOnLoad: true, |
| theme: 'dark', |
| themeVariables: { |
| primaryColor: '#6366f1', |
| primaryTextColor: '#e2e8f0', |
| primaryBorderColor: '#818cf8', |
| lineColor: '#818cf8', |
| secondaryColor: '#1e293b', |
| tertiaryColor: '#172033', |
| background: '#0f172a', |
| mainBkg: '#1e293b', |
| nodeBorder: '#818cf8', |
| clusterBkg: '#172033', |
| clusterBorder: '#334155', |
| titleColor: '#e2e8f0', |
| edgeLabelBackground: '#1e293b', |
| nodeTextColor: '#e2e8f0' |
| }, |
| flowchart: { curve: 'basis', htmlLabels: true }, |
| fontFamily: 'Inter, sans-serif' |
| }); |
| </script> |
| <style> |
| :root { |
| --bg: #0f172a; --surface: #1e293b; --surface-2: #172033; --border: #334155; |
| --text: #e2e8f0; --muted: #94a3b8; --accent: #6366f1; |
| --accent2: #818cf8; --green: #22c55e; --red: #ef4444; |
| --orange: #f59e0b; --radius: 12px; |
| } |
| * { margin: 0; padding: 0; box-sizing: border-box; } |
| html { scroll-behavior: smooth; } |
| body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; |
| background: var(--bg); color: var(--text); line-height: 1.7; |
| -webkit-font-smoothing: antialiased; } |
| .container { max-width: 820px; margin: 0 auto; padding: 2rem 1.5rem 4rem; } |
| |
| .topnav { position: sticky; top: 0; z-index: 10; background: rgba(15,23,42,.85); |
| backdrop-filter: blur(10px); border-bottom: 1px solid var(--border); |
| padding: .9rem 1.5rem; display: flex; justify-content: space-between; |
| align-items: center; font-size: .88rem; } |
| .topnav .brand { font-weight: 700; color: var(--text); text-decoration: none; |
| display: flex; align-items: center; gap: .5rem; } |
| .topnav .brand .dot { width: 8px; height: 8px; border-radius: 50%; |
| background: var(--green); box-shadow: 0 0 8px rgba(34,197,94,.6); } |
| .topnav .links { display: flex; gap: 1.25rem; } |
| .topnav .links a { color: var(--muted); text-decoration: none; transition: color .15s; } |
| .topnav .links a:hover { color: var(--accent2); } |
| |
| .hero { text-align: center; padding: 4rem 0 2.5rem; } |
| .hero-badge { display: inline-block; background: rgba(99,102,241,.15); color: var(--accent2); |
| padding: .4rem 1.1rem; border-radius: 20px; font-size: .78rem; font-weight: 600; |
| letter-spacing: .08em; margin-bottom: 1.25rem; |
| border: 1px solid rgba(99,102,241,.3); text-transform: uppercase; } |
| .hero h1 { font-size: clamp(2rem, 4.2vw, 3.2rem); font-weight: 800; letter-spacing: -.025em; |
| line-height: 1.15; |
| background: linear-gradient(135deg, #e2e8f0 25%, #6366f1 100%); |
| -webkit-background-clip: text; -webkit-text-fill-color: transparent; |
| background-clip: text; } |
| .hero .subtitle { color: var(--muted); font-size: 1.15rem; max-width: 640px; |
| margin: 1rem auto 0; } |
| .hero .byline { color: var(--muted); font-size: .85rem; margin-top: 1.5rem; |
| font-style: italic; } |
| .banner { width: 100%; border-radius: var(--radius); margin: 2rem 0 3rem; |
| border: 1px solid var(--border); } |
| |
| .badges { display: flex; justify-content: center; gap: .6rem; flex-wrap: wrap; |
| margin: 1.5rem 0; } |
| .badges img { height: 22px; } |
| |
| .btn-group { display: flex; gap: .75rem; justify-content: center; margin: 2rem 0; |
| flex-wrap: wrap; } |
| .btn { display: inline-flex; align-items: center; gap: .45rem; padding: .6rem 1.35rem; |
| background: var(--accent); color: white; border-radius: 8px; font-size: .88rem; |
| font-weight: 600; text-decoration: none; transition: all .2s; } |
| .btn:hover { background: var(--accent2); transform: translateY(-1px); } |
| .btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); } |
| .btn-outline:hover { border-color: var(--accent); color: var(--accent2); |
| background: rgba(99,102,241,.08); } |
| |
| .toc { background: var(--surface); border: 1px solid var(--border); border-radius: var(--radius); |
| padding: 1.25rem 1.5rem; margin: 0 0 2.5rem; } |
| .toc h3 { font-size: .82rem; font-weight: 700; letter-spacing: .08em; text-transform: uppercase; |
| color: var(--accent2); margin-bottom: .85rem; } |
| .toc ol { list-style: none; counter-reset: toc; display: flex; flex-wrap: wrap; gap: .35rem .8rem; |
| margin: 0; padding: 0; } |
| .toc ol li { counter-increment: toc; font-size: .88rem; } |
| .toc ol li::before { content: counter(toc) "."; color: var(--accent); font-weight: 700; |
| font-size: .8rem; margin-right: .3rem; } |
| .toc ol li a { color: var(--muted); text-decoration: none; transition: color .15s; } |
| .toc ol li a:hover { color: var(--accent2); } |
| |
| section { margin: 3.5rem 0; } |
| section h2 { font-size: 1.55rem; font-weight: 800; letter-spacing: -.01em; |
| margin-bottom: 1rem; color: var(--text); |
| border-left: 3px solid var(--accent); padding-left: .9rem; } |
| section h3 { font-size: 1.1rem; font-weight: 700; margin: 2rem 0 .75rem; |
| color: var(--accent2); } |
| section p { color: #cbd5e1; margin-bottom: 1rem; font-size: 1.02rem; } |
| section p strong { color: var(--text); } |
| section ul, section ol { color: #cbd5e1; margin: 1rem 0 1rem 1.5rem; } |
| section ul li, section ol li { margin-bottom: .5rem; font-size: 1rem; } |
| section ul li strong, section ol li strong { color: var(--text); } |
| |
| blockquote { border-left: 3px solid var(--accent2); |
| background: rgba(99,102,241,.06); padding: 1.1rem 1.25rem; |
| margin: 1.5rem 0; border-radius: 0 8px 8px 0; |
| color: #e2e8f0; font-size: 1.02rem; } |
| |
| .table-wrap { margin: 1.5rem 0; overflow-x: auto; |
| background: var(--surface); border: 1px solid var(--border); |
| border-radius: var(--radius); } |
| table { width: 100%; border-collapse: collapse; font-size: .92rem; } |
| th { background: rgba(99,102,241,.1); color: var(--accent2); |
| font-size: .72rem; font-weight: 700; letter-spacing: .06em; |
| text-transform: uppercase; padding: .85rem 1rem; text-align: left; } |
| td { padding: .7rem 1rem; border-top: 1px solid var(--border); color: #cbd5e1; } |
| td.num { text-align: right; font-variant-numeric: tabular-nums; |
| font-family: 'JetBrains Mono', monospace; font-size: .88rem; } |
| tr:hover td { background: rgba(99,102,241,.04); } |
| td strong, th strong { color: var(--text); } |
| .task-id { font-family: 'JetBrains Mono', monospace; font-weight: 700; |
| color: var(--accent2); font-size: .85rem; } |
| tr.avg-row td { background: rgba(99,102,241,.08); font-weight: 700; |
| color: var(--text); } |
| tr.novel td:first-child { color: #fca5a5; } |
| |
| pre { background: #0b1120; border: 1px solid var(--border); |
| border-radius: var(--radius); padding: 1.1rem 1.25rem; overflow-x: auto; |
| margin: 1.25rem 0; font-family: 'JetBrains Mono', monospace; |
| font-size: .85rem; line-height: 1.6; color: #d1d5db; } |
| pre .c { color: #64748b; } |
| code { font-family: 'JetBrains Mono', monospace; font-size: .88em; |
| background: rgba(99,102,241,.12); color: var(--accent2); |
| padding: .1em .35em; border-radius: 4px; } |
| pre code { background: none; color: inherit; padding: 0; font-size: 1em; } |
| |
| figure { margin: 2rem 0; } |
| figure img { width: 100%; border-radius: var(--radius); |
| border: 1px solid var(--border); } |
| figcaption { text-align: center; color: var(--muted); font-size: .85rem; |
| margin-top: .75rem; } |
| |
| .mermaid-wrap { margin: 2rem 0; background: var(--surface); border: 1px solid var(--border); |
| border-radius: var(--radius); padding: 1.5rem 1rem; overflow-x: auto; } |
| .mermaid-wrap .mermaid { display: flex; justify-content: center; } |
| .mermaid-caption { text-align: center; color: var(--muted); font-size: .85rem; |
| margin-top: .75rem; } |
| |
| .episode-trace { background: var(--surface); border: 1px solid var(--border); |
| border-radius: var(--radius); padding: 1.25rem 1.5rem; margin: 1.5rem 0; |
| position: relative; } |
| .episode-trace::before { content: ''; position: absolute; left: 1.5rem; top: 2.5rem; |
| bottom: 1.25rem; width: 2px; background: var(--border); } |
| .trace-step { position: relative; padding-left: 2rem; margin-bottom: 1.25rem; } |
| .trace-step:last-child { margin-bottom: 0; } |
| .trace-step .step-marker { position: absolute; left: -.45rem; top: .2rem; width: 12px; |
| height: 12px; border-radius: 50%; border: 2px solid var(--accent); |
| background: var(--bg); z-index: 1; } |
| .trace-step .step-marker.terminal { background: var(--red); border-color: var(--red); } |
| .trace-step .step-marker.good { background: var(--green); border-color: var(--green); } |
| .trace-step .step-label { font-family: 'JetBrains Mono', monospace; font-size: .78rem; |
| color: var(--accent2); font-weight: 700; margin-bottom: .25rem; } |
| .trace-step .step-content { font-size: .9rem; color: #cbd5e1; } |
| .trace-step .step-content code { font-size: .82em; } |
| .trace-verdict { margin-top: 1rem; padding: .75rem 1rem; border-radius: 8px; |
| font-size: .9rem; font-weight: 600; } |
| .trace-verdict.bad { background: rgba(239,68,68,.1); border: 1px solid rgba(239,68,68,.3); |
| color: #fca5a5; } |
| .trace-verdict.good { background: rgba(34,197,94,.1); border: 1px solid rgba(34,197,94,.3); |
| color: #86efac; } |
| |
| .callout { text-align: center; padding: 2rem 1.5rem; margin: 3rem 0; |
| background: linear-gradient(135deg, rgba(99,102,241,.08), rgba(129,140,248,.04)); |
| border: 1px solid rgba(99,102,241,.25); border-radius: var(--radius); } |
| .callout .q { font-size: 1.25rem; font-weight: 700; color: var(--text); |
| font-style: italic; margin-bottom: .5rem; } |
| .callout .sub { color: var(--muted); font-size: .95rem; } |
| |
| .footer { text-align: center; padding: 3rem 0 1rem; color: var(--muted); |
| font-size: .85rem; border-top: 1px solid var(--border); margin-top: 3rem; } |
| .footer a { color: var(--accent2); text-decoration: none; margin: 0 .5rem; } |
| .footer a:hover { text-decoration: underline; } |
| @media (max-width: 640px) { |
| .container { padding: 1rem 1rem 3rem; } |
| .hero { padding: 2.5rem 0 1.5rem; } |
| .topnav .links { display: none; } |
| section h2 { font-size: 1.3rem; } |
| table { font-size: .82rem; } |
| th, td { padding: .55rem .6rem; } |
| .toc ol { flex-direction: column; } |
| .episode-trace { padding: 1rem; } |
| .episode-trace::before { left: 1rem; } |
| } |
| |
| .math-display { |
| margin: 1.25rem 0; |
| padding: 1rem 1.25rem 1.15rem; |
| overflow-x: auto; |
| background: var(--surface); |
| border: 1px solid var(--border); |
| border-radius: var(--radius); |
| text-align: center; |
| } |
| .math-display mjx-container[jax="CHTML"][display="true"] { margin: 0.65em 0 !important; } |
| .math-display mjx-container { color: #e2e8f0 !important; } |
| .math-note { font-size: .9rem; color: var(--muted); margin-top: .35rem; margin-bottom: 0; } |
| </style> |
| |
| <script> |
| window.MathJax = { |
| tex: { |
| inlineMath: [['\\(', '\\)']], |
| displayMath: [['\\[', '\\]']] |
| }, |
| options: { |
| renderActions: { |
| addMenu: [0, '', ''] |
| } |
| } |
| }; |
| </script> |
| <script defer src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" id="MathJax-script"></script> |
| </head> |
| <body> |
|
|
| <nav class="topnav"> |
| <a href="#top" class="brand"><span class="dot"></span> ReasoningEconomicsEnv Blog</a> |
| <div class="links"> |
| <a href="#why">Motivation</a> |
| <a href="#env-design">Env Design</a> |
| <a href="#scoring">Scoring</a> |
| <a href="#traces">Pathology</a> |
| <a href="#results">Results</a> |
| <a href="#engineering">Engineering</a> |
| <a href="#positioning">Positioning</a> |
| <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank">Live Space ↗</a> |
| </div> |
| </nav> |
|
|
| <div class="container" id="top"> |
|
|
| <div class="hero"> |
| <div class="hero-badge">OpenEnv · AgentX OpenEnv Track</div> |
| <h1>ReasoningEconomicsEnv</h1> |
| <p class="subtitle">An OpenEnv Benchmark Where LLMs Learn to Budget Their Own Thinking Across a Shared-Budget Episode.</p> |
| <div class="badges"> |
| <a href="https://github.com/sharma-yash01/ReasoningEconomicsEnv" target="_blank"><img src="https://img.shields.io/badge/GitHub-Environment-181717?logo=github" alt="GitHub Env"/></a> |
| <a href="https://github.com/sharma-yash01/ReasoningEconomicsPT" target="_blank"><img src="https://img.shields.io/badge/GitHub-Training-181717?logo=github" alt="GitHub PT"/></a> |
| <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank"><img src="https://img.shields.io/badge/HF%20Space-Live%20Env-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/></a> |
| <img src="https://img.shields.io/badge/Qwen3--14B-8xA100%20%7C%20ZeRO--3%20%2B%20Unsloth%20LoRA%20%2B%20vLLM%20TP%3D2-6366f1" alt="Qwen3-14B · 8xA100 · ZeRO-3 + Unsloth LoRA + vLLM TP=2"/> |
| <img src="https://img.shields.io/badge/Mean%20Reward-%2B0.47%20%7C%20480%20eps-22c55e" alt="Mean reward +0.47, 480 eps"/> |
| <img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/> |
| <img src="https://img.shields.io/badge/Training-GRPO%20%7C%20TRL%201.0-orange" alt="GRPO | TRL 1.0"/> |
| <img src="https://img.shields.io/badge/Budget%20Modes-Hard%20%7C%20Soft-8A2BE2" alt="Budget Modes"/> |
| </div> |
| <div class="byline">AgentX OpenEnv Track · UC Berkeley RDI | Yashaswi Sharma, Harshawn Singh, Ryu Lun, Prabhu Pugalenthi, Khushi Kumari, Defu Cao, Muyan Weng</div> |
| </div> |
|
|
| <img src="banner.png" alt="ReasoningEconomicsEnv: allocate a shared budget, solve each question, pay in tokens and correctness" class="banner"/> |
|
|
| <div class="btn-group"> |
| <a class="btn" href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank">Live Environment Space →</a> |
| <a class="btn btn-outline" href="https://github.com/sharma-yash01/ReasoningEconomicsEnv" target="_blank">GitHub Repo</a> |
| </div> |
|
|
| |
| <nav class="toc" id="toc"> |
| <h3>Contents</h3> |
| <ol> |
| <li><a href="#why">Motivation: allocation, not compression</a></li> |
| <li><a href="#matters">Why a shared budget changes the problem</a></li> |
| <li><a href="#prior-work">Prior work & novelty</a></li> |
| <li><a href="#design">What ReasoningEconomicsEnv is</a></li> |
| <li><a href="#env-design">Environment design</a></li> |
| <li><a href="#openenv">Why OpenEnv</a></li> |
| <li><a href="#scoring">Scoring: per-step + terminal reward</a></li> |
| <li><a href="#architecture">Architecture & training pipeline</a></li> |
| <li><a href="#traces">Training pathology & zero-advantage collapse</a></li> |
| <li><a href="#results">Results: what we found</a></li> |
| <li><a href="#engineering">Engineering lessons</a></li> |
| <li><a href="#positioning">Positioning: online, sequential, shared-budget</a></li> |
| <li><a href="#foundations">Foundations & citations</a></li> |
| <li><a href="#quickstart">Quick start</a></li> |
| <li><a href="#future">Future work</a></li> |
| <li><a href="#conclusion">Conclusion</a></li> |
| </ol> |
| </nav> |
|
|
| |
| <section id="why"> |
| <h2>Motivation: reasoning LLMs do not allocate tokens; they spend them</h2> |
| <p>Frontier reasoning models — DeepSeek-R1, QwQ, Qwen3 thinking mode, the o-series — over-spend tokens on easy items and under-spend on hard ones. Chain-of-thought length is only weakly correlated with ground-truth difficulty: trivial arithmetic consumes thousands of thinking tokens, and genuinely hard items get truncated before a <code>\boxed{}</code>. The finding is repeated across Han et al.'s Token-Budget-Aware LLM (<a href="#prior-work">arXiv:2412.18547</a>), Xu et al.'s Chain-of-Draft (<a href="#prior-work">arXiv:2502.18600</a>), and Moonshot AI's Kimi K1.5 Long2Short ablations (<a href="#prior-work">arXiv:2501.12599</a>).</p> |
| <p>Inference tokens are not a per-query resource. In any real deployment — an exam battery, an eval suite, a multi-turn tool loop, a long-horizon agent — they are a <strong>shared, capped resource across a sequence of prompts</strong>. Misallocation is not just slow; it is lost accuracy per dollar. A deployed reasoner has to infer difficulty from text alone, decide what a prompt is worth <em>given what's left</em>, conserve on easy items so it can invest on hard ones, and pace itself under irrecoverable depletion.</p> |
|
|
| <h3>What existing work does not solve — the Long2Short delta</h3> |
| <p>The four families in <a href="#prior-work">Prior Work</a> each cover one axis of reasoning-length control, and each leaves the same axis empty.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Family</th><th>What it optimizes</th><th>Axis still empty</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Prompt-Guided</strong><br><span style="font-size:.85em;color:var(--muted)">Token-Budget, Chain-of-Draft, CCoT, Token Complexity</span></td><td>Shorten a single chain via prompting</td><td>No cross-prompt budget; no learning</td></tr> |
| <tr><td><strong>RL with Length Reward</strong><br><span style="font-size:.85em;color:var(--muted)">L1/LCPO, O1-Pruner, <strong>Kimi K1.5 Long2Short</strong>, DAST, SelfBudgeter</span></td><td>RL-trained per-response length control. Long2Short distills a long-reasoning teacher into a shorter policy; the reward is conditioned on <em>one</em> chain's length.</td><td>The policy cannot trade tokens from Q1 to Q7 — it is never shown a shared budget</td></tr> |
| <tr><td><strong>SFT / Distillation</strong><br><span style="font-size:.85em;color:var(--muted)">CoT-Valve, TokenSkip, Z1</span></td><td>Bake shorter reasoning into the weights</td><td>Still per-prompt; no episode state</td></tr> |
| <tr><td><strong>Dynamic Early Exit</strong><br><span style="font-size:.85em;color:var(--muted)">Dynasor-CoT, Budget Forcing / s1, DTSR</span></td><td>Decoding-time termination within one prompt</td><td>The policy has no knowledge that another prompt downstream will also need tokens</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p><strong>The delta in one sentence.</strong> Kimi K1.5 Long2Short asks <em>"when should I stop this chain?"</em>; ReasoningEconomicsEnv asks <em>"how should I split my budget across these N chains?"</em> — a different action space (portfolio over prompts) under a different reward surface (joint accuracy × utilization across a battery). Long2Short has no notion of a shared episode budget; it cannot express the trade "save 400 tokens on Q1 so I have them on Q4" because Q1 and Q4 never share state in its MDP.</p> |
| <p>Long2Short is the offline, single-chain limit of this MDP (<code>N=1</code>, no shared budget); a numerical comparison would require degenerating our env to <code>N=1</code>, at which point the two methods become equivalent by construction. We therefore frame Long2Short as a <em>special case</em> of our formulation rather than a competing baseline.</p> |
|
|
| <h3>What ReasoningEconomicsEnv is</h3> |
| <p><strong>ReasoningEconomicsEnv</strong> is an OpenEnv-native RL environment where the LLM is both the <strong>budget allocator</strong> (by choosing how long to think) and the <strong>solver</strong> (by producing the answer). The environment is a stateless grader and budget accountant. Over a multi-turn episode, the agent learns <em>meta-reasoning</em>: when to think long, when to think cheaply, how to trade correctness against compute under a single shared cap — with no difficulty labels.</p> |
| <blockquote>To our knowledge, this is the first sequential multi-turn MDP that jointly incentivizes <em>reasoning-trace reduction</em> and <em>answer accuracy</em> under a shared, session-level budget. Every prior family — prompt-guided caps, RL with length rewards including <strong>Kimi K1.5 Long2Short</strong>, SFT/distillation on compressed traces, dynamic early exit — optimizes compression <em>within</em> a single query. None learn <strong>pacing across a sequence of queries</strong>.</blockquote> |
| </section> |
|
|
| |
| <section id="matters"> |
| <h2>Why a shared budget changes the problem</h2> |
| <p>Per-query budgeting is a local optimization. Shared-budget reasoning is a sequential resource-allocation problem with partial observability over future question difficulty.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Axis</th><th>Prior work (R1 / QwQ / Xu 2024 / Kimi K1.5)</th><th>ReasoningEconomicsEnv</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Budget scope</strong></td><td>Per-query, isolated</td><td>Shared across N questions</td></tr> |
| <tr><td><strong>Difficulty signal</strong></td><td>Explicit label or classifier</td><td>Inferred from text only</td></tr> |
| <tr><td><strong>Horizon</strong></td><td>Single step</td><td>Sequential (N steps / episode)</td></tr> |
| <tr><td><strong>Pacing pressure</strong></td><td>None</td><td>Irrecoverable depletion</td></tr> |
| <tr><td><strong>Training cost</strong></td><td>Live API per rollout</td><td>Grader-only env (CPU) + local vLLM</td></tr> |
| <tr><td><strong>Decision learned</strong></td><td><em>How short can this answer be?</em></td><td><em>How should I spend what I have left?</em></td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p>The failure modes we want to surface are distinctly sequential:</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Failure</th><th>What goes wrong</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Over-invest early</strong></td><td>Budget gone before the last (possibly hard) question arrives</td></tr> |
| <tr><td><strong>Over-conserve</strong></td><td>Easy questions answered; hard questions starved, cap under-used</td></tr> |
| <tr><td><strong>Fixed pacing</strong></td><td>Uniform spend ignores difficulty variance across items</td></tr> |
| <tr><td><strong>Thinking-mode blowup</strong></td><td><code><think>…</think></code> runs past <code>max_completion_length</code>; answer truncated, grading returns zero</td></tr> |
| <tr><td><strong>Unit drift</strong></td><td>Budget cap and spend tallied in <em>different tokenizers</em> — phantom budget</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p>Every row in that table is a real failure mode we hit and diagnosed end-to-end (see <a href="#engineering">Engineering Lessons</a> and <a href="#results">Training Runs</a>).</p> |
| </section> |
|
|
| |
| <section id="prior-work"> |
| <h2>Prior work & novelty</h2> |
| <p>The reasoning-economics literature falls into four families. <strong>All four optimize <em>per-prompt</em> reasoning length; none expose a shared <em>cross-prompt</em> token budget.</strong> ReasoningEconomicsEnv is the missing fifth regime — portfolio allocation under a joint budget.</p> |
|
|
| <h3>Prompt-Guided</h3> |
| <p>Inference-time prompting asks the model to self-regulate. No training signal, per-query scope.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Method</th><th>Mechanism</th><th>Link</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Token-Budget</strong> (Han et al., 2024)</td><td>LLM self-estimates a token budget per query and embeds it in the prompt to constrain CoT length; reports ~68% token reduction with minimal accuracy loss</td><td><a href="https://arxiv.org/abs/2412.18547" target="_blank" style="color:var(--accent2)">arXiv:2412.18547</a></td></tr> |
| <tr><td><strong>Chain-of-Draft</strong> (Xu et al., 2025)</td><td>Prompts the model to write ≤5 words per reasoning step; matches CoT accuracy at ~7.6% of the tokens</td><td><a href="https://arxiv.org/abs/2502.18600" target="_blank" style="color:var(--accent2)">arXiv:2502.18600</a></td></tr> |
| <tr><td><strong>CCoT</strong> (Renze & Guven, 2024)</td><td>Appends "be concise" to CoT prompts; reduces length with a minor accuracy penalty on weaker models</td><td><a href="https://arxiv.org/abs/2401.05618" target="_blank" style="color:var(--accent2)">arXiv:2401.05618</a></td></tr> |
| <tr><td><strong>Token Complexity</strong> (Lee et al., 2025)</td><td>Benchmarks compression prompts (word limits, bullets, abbreviations); finds LLMs natively adjust length to difficulty even without sophisticated prompting</td><td><a href="https://arxiv.org/abs/2503.01141" target="_blank" style="color:var(--accent2)">arXiv:2503.01141</a></td></tr> |
| </tbody> |
| </table> |
| </div> |
|
|
| <h3>RL with Length Reward</h3> |
| <p>Training-time methods that shape a reward around reasoning length on single-prompt responses.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Method</th><th>Mechanism</th><th>Link</th></tr></thead> |
| <tbody> |
| <tr><td><strong>L1 / LCPO</strong> (Aggarwal & Welleck, 2025)</td><td>GRPO with a length-penalty reward; controls reasoning length via a "Think for N tokens" prompt prefix</td><td><a href="https://arxiv.org/abs/2503.04697" target="_blank" style="color:var(--accent2)">arXiv:2503.04697</a></td></tr> |
| <tr><td><strong>O1-Pruner</strong> (Luo et al., 2025)</td><td>PPO with a reward that penalizes token usage relative to a target length; applied to Marco-o1 and QwQ</td><td><a href="https://arxiv.org/abs/2501.12570" target="_blank" style="color:var(--accent2)">arXiv:2501.12570</a></td></tr> |
| <tr class="novel"><td><strong>Kimi K1.5 / Long2Short</strong> (Moonshot AI, 2025)</td><td>Length-conditioned RL distillation of a long-reasoning teacher into a shorter policy. <strong>The paper ReasoningEconomicsEnv most directly contrasts with</strong> — Long2Short shortens <em>one</em> chain, we allocate <em>across many</em>.</td><td><a href="https://arxiv.org/abs/2501.12599" target="_blank" style="color:var(--accent2)">arXiv:2501.12599</a></td></tr> |
| <tr><td><strong>DAST</strong> (Shu et al., 2025)</td><td>SimPO-based preference optimization on constructed short/long preference pairs</td><td><a href="https://arxiv.org/abs/2503.04472" target="_blank" style="color:var(--accent2)">arXiv:2503.04472</a></td></tr> |
| <tr><td><strong>SelfBudgeter</strong> (Li et al., 2025)</td><td>Model prepends a self-predicted token budget before reasoning and is trained to respect it</td><td><a href="https://arxiv.org/abs/2505.11274" target="_blank" style="color:var(--accent2)">arXiv:2505.11274</a></td></tr> |
| </tbody> |
| </table> |
| </div> |
|
|
| <h3>SFT / Distillation</h3> |
| <p>Supervised fine-tuning on shortened CoT traces. No RL, no budget state.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Method</th><th>Mechanism</th><th>Link</th></tr></thead> |
| <tbody> |
| <tr><td><strong>CoT-Valve</strong> (Ma et al., 2025)</td><td>Single model trained on CoT of varying lengths; inference-time "valve" parameter controls reasoning depth</td><td><a href="https://arxiv.org/abs/2502.09601" target="_blank" style="color:var(--accent2)">arXiv:2502.09601</a></td></tr> |
| <tr><td><strong>TokenSkip</strong> (Xia et al., 2025)</td><td>Compresses existing CoT by skipping non-essential tokens, then fine-tunes on the compressed traces</td><td><a href="https://arxiv.org/abs/2502.12067" target="_blank" style="color:var(--accent2)">arXiv:2502.12067</a></td></tr> |
| <tr><td><strong>Z1</strong> (Zhang et al., 2025)</td><td>SFT on compressed-thought data that shortens each reasoning step</td><td><a href="https://arxiv.org/abs/2504.00810" target="_blank" style="color:var(--accent2)">arXiv:2504.00810</a></td></tr> |
| </tbody> |
| </table> |
| </div> |
|
|
| <h3>Dynamic Early Exit</h3> |
| <p>Decoding-time heuristics that terminate a single chain early. The policy has no knowledge of downstream prompts.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Method</th><th>Mechanism</th><th>Link</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Dynasor-CoT</strong> (Fu et al., 2025)</td><td>Probes intermediate answers at fixed intervals; terminates when consecutive answers agree</td><td><a href="https://arxiv.org/abs/2412.20993" target="_blank" style="color:var(--accent2)">arXiv:2412.20993</a></td></tr> |
| <tr><td><strong>Budget Forcing / s1</strong> (Muennighoff et al., 2025)</td><td>Forces end-of-thinking + "Final Answer:" at the max token budget; simple and strong baseline</td><td><a href="https://arxiv.org/abs/2501.19393" target="_blank" style="color:var(--accent2)">arXiv:2501.19393</a></td></tr> |
| <tr><td><strong>DEER</strong> (Yang et al., 2025)</td><td>Detects reflection signals (e.g., "Wait,", "Let me check") in the output as dynamic exit points and terminates reasoning if deemed sufficient</td><td><a href="https://arxiv.org/abs/2504.15895" target="_blank" style="color:var(--accent2)">arXiv:2504.15895</a></td></tr> |
| </tbody> |
| </table> |
| </div> |
|
|
| <h3>Framework we inherit</h3> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Component</th><th>What we inherit</th><th>Link</th></tr></thead> |
| <tbody> |
| <tr><td><strong>OpenEnv</strong></td><td>Gym-style <code>reset</code>/<code>step</code> over WebSocket; HF Space deployment; per-session state; concurrent sessions</td><td><a href="https://huggingface.co/blog/openenv" target="_blank" style="color:var(--accent2)">HF Blog: Introducing OpenEnv</a></td></tr> |
| </tbody> |
| </table> |
| </div> |
|
|
| <div class="callout"> |
| <div class="q">Every method above optimizes "how long should <em>this</em> chain be". ReasoningEconomicsEnv optimizes "how should I <em>split my budget</em> across these N chains".</div> |
| <div class="sub">A different action space (portfolio over prompts) and a different reward surface (joint <code>accuracy × utilization</code> across a battery). To our knowledge, this is the first OpenEnv-native RL environment — and the first sequential MDP of any kind — where a single reward function jointly incentivizes <em>reasoning-trace reduction</em> and <em>answer accuracy</em> across a multi-turn, shared-budget episode.</div> |
| </div> |
| <p><strong>Scope caveat.</strong> The novelty is the MDP, reward coupling, and budget accounting. The RL method (GRPO on verifiable math) is shared with the DeepSeekMath / Kimi K1.5 lineages; we reuse those techniques rather than propose new ones.</p> |
| </section> |
|
|
| |
| <section id="design"> |
| <h2>What ReasoningEconomicsEnv is</h2> |
| <blockquote>A stateless grader plus a budget accountant, served over OpenEnv's WebSocket protocol. The LLM is the policy. The reward is verifiable. The MDP is multi-turn. Nothing else is invented.</blockquote> |
| <p>Each episode samples <strong>10 math questions</strong> (configurable, <code>num_questions</code>) from <code>meta-math/MetaMathQA</code> — keyed by <code>type</code> (<code>GSM_SV</code>, <code>MATH_FOBAR</code>, …) and drawn from the <strong>first 5 000 rows</strong> of the dataset (<code>subset_start_idx=0</code>, <code>subset_size=5000</code>) so every run samples from the same fixed window. The agent receives one question at a time alongside its remaining budget, chooses how long to reason, and emits a single response containing its chain-of-thought and a <code>\boxed{…}</code> final answer. The environment grades the answer against ground truth and returns a reward and the next question — until the episode terminates (10 questions completed or budget exhausted, depending on mode).</p> |
| <p class="math-note"><strong>Dataset scope today:</strong> MetaMathQA only, first 5 000 rows. <code>AI-MO/NuminaMath-TIR</code> is wired into the sampler (<code>NUMINA_PROBLEM_TYPE = "NuminaMath_TIR"</code>, <code>numina_subset_size</code>) but kept out of the current training mix; enabling the Numina channel for an even MetaMath + Numina mix is tracked in <a href="#future">Future Work</a>.</p> |
| <p>The agent's action interface is deliberately minimal: <strong>raw text / JSON output, no tool-call protocol, no markdown parsing fragility</strong>. The LLM outputs a response string; the env parses it. Crucially, the training client (<em>ReasoningEconomicsPT</em>) never imports env Pydantic types — it speaks dict shapes over the wire, matching OpenEnv's client/server contract.</p> |
| </section> |
|
|
| |
| <section id="env-design"> |
| <h2>Environment design</h2> |
| <h3>MDP</h3> |
| <p>The core contract is two Pydantic types exchanged over the OpenEnv WebSocket:</p> |
| <pre><code><span class="c"># Observation (env → agent)</span> |
| class ReasonBudgetObservation(Observation): |
| question: str <span class="c"># raw problem text</span> |
| remaining_budget: int <span class="c"># tokens left in the episode</span> |
| questions_remaining: int |
| budget_per_remaining_question: float <span class="c"># pacing signal</span> |
| accuracy_so_far: float |
| episode_history: list[HistoryItem] <span class="c"># in-context Q/A memory</span> |
| done: bool |
| reward: Optional[float] |
| metadata: dict <span class="c"># problem_type, total_budget, budget_source,</span> |
| <span class="c"># budget_mode, min_tokens, max_tokens</span> |
|
|
| <span class="c"># Action (agent → env)</span> |
| class ReasonBudgetAction(Action): |
| response: str <span class="c"># thinking trace + \boxed{answer}</span> |
| metadata: dict <span class="c"># optional tokenizer_name override,</span> |
| <span class="c"># optional grading_response (visible tail)</span></code></pre> |
|
|
| <h3>Reward</h3> |
| <p>Two-component, both grounded in the OpenEnv per-step plus terminal-bonus pattern. <strong>Both terms are alive at once</strong> — the per-step cost penalty rewards <em>shorter traces</em>, the correctness term rewards <em>right answers</em>, and the terminal bonus couples the two multiplicatively so neither can be sacrificed to optimize the other.</p> |
| <p><strong>Per-step</strong> (accumulated every turn):</p> |
| <div class="math-display" aria-label="Per-step reward decomposition"> |
| \[ |
| r_{\text{step}} = \text{correctness} + \text{efficiency\_bonus} - \text{cost\_penalty} - \text{overspend\_penalty} |
| \] |
| </div> |
| <ul> |
| <li><code>correctness</code>: <strong><code>+1.0</code></strong> if <code>extract_boxed_answer</code> + SymPy equality matches ground truth, else <strong><code>−0.1</code></strong> — <strong>incentivizes answer accuracy</strong> (wrong answers carry a small per-step penalty; see <a href="#repo-legend">Two-repo cheat sheet</a> / <code>compute_reward</code> in <code>env/reward.py</code>).</li> |
| <li><code>efficiency_bonus</code>: reward for cheap correct answers on easy items — <strong>incentivizes trace reduction when correctness is preserved</strong>.</li> |
| <li><code>cost_penalty</code>: linear in tokens spent this turn — <strong>direct trace-length pressure</strong>.</li> |
| <li><code>overspend_penalty</code>: active only in soft-budget mode; 0 under hard cap.</li> |
| </ul> |
| <p><strong>Terminal</strong> (added to the final step's reward):</p> |
| <div class="math-display" aria-label="Terminal episode reward"> |
| \[ |
| r_{\text{episode}} = \lambda_{\text{ep}} \cdot \bigl(\text{episode\_accuracy} \,\times\, \text{budget\_utilization\_score}\bigr) |
| \] |
| </div> |
| <p class="math-note">Where <code>budget_utilization_score = max(0, 1 − |spent/total_budget − target_utilization|)</code> rewards finishing <em>close to, but not over,</em> the target utilization. <code>fair_share = total_budget / num_questions</code> is used in both <code>efficiency_bonus</code> and <code>cost_penalty</code>.</p> |
| <blockquote><strong>Why the product form.</strong> Compressing every trace to zero tokens (accuracy = 0) and answering correctly but wasting the budget (utilization bad) are both punished. The only way to maximize <code>r_episode</code> is to spend the budget well <em>and</em> be right.</blockquote> |
|
|
| <h3>Reward hyperparameters</h3> |
| <p>All runs in this blog use the repo defaults from <code>ReasoningEconomicsEnv/env/config.py</code> and <code>ReasoningEconomicsEnv/env/reward.py</code> — nothing tuned per-run. They are:</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Symbol</th><th>Name</th><th>Default</th><th>Role</th></tr></thead> |
| <tbody> |
| <tr><td><code>β</code></td><td><code>beta</code> (cost-penalty weight)</td><td class="num">0.05</td><td>Linear per-step token cost: <code>β · max(0, tokens_used / fair_share − 1)</code>. Only fires when the step overspends its fair share.</td></tr> |
| <tr><td><code>γ</code></td><td><code>gamma</code> (efficiency-bonus weight)</td><td class="num">0.1</td><td>Reward for solving under fair share: <code>γ · (1 − spend_ratio)</code>, correct steps only.</td></tr> |
| <tr><td><code>λ<sub>ep</sub></code></td><td><code>lambda_ep</code> (terminal weight)</td><td class="num">0.5</td><td>Scales terminal <code>episode_accuracy × budget_utilization_score</code>. Product form prevents unilateral optimization of either factor.</td></tr> |
| <tr><td>—</td><td><code>target_utilization</code></td><td class="num">0.9</td><td>Utilization peak for <code>budget_utilization_score</code>; rewards finishing close to 90% of the total budget.</td></tr> |
| <tr><td>—</td><td><code>correctness</code> reward</td><td class="num">+1.0 / −0.1</td><td>Per-step: <code>+1</code> on SymPy match, <code>−0.1</code> on wrong — a small negative signal so trivial "don't answer" policies lose reward.</td></tr> |
| <tr><td>—</td><td><code>soft_overspend_penalty</code></td><td class="num">0.25</td><td>Active only in soft-budget mode: <code>0.25 · (overspend_tokens / fair_share)</code>. Hard-cap mode zeroes this term.</td></tr> |
| <tr><td>—</td><td><code>budget_ratio</code></td><td class="num">2.0</td><td>Fallback total-budget multiplier when no <code>total_budget</code> and no tokenizer are passed (<a href="#env-design">budget priority table</a>).</td></tr> |
| <tr><td>—</td><td><code>num_questions</code> / <code>min_tokens</code> / <code>max_tokens</code> / <code>max_tokens_per_step</code></td><td class="num">10 / 10 / 800 / 2048</td><td>Episode length and per-step token window; <code>min_tokens</code> also sets hard-cap early termination.</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p class="math-note">Values are the <code>EnvConfig</code> dataclass defaults and the default kwargs on <code>compute_reward</code> / <code>compute_episode_bonus</code>. They were not swept in this submission; tuning <code>β</code>, <code>γ</code>, and <code>λ<sub>ep</sub></code> jointly against baseline runs is part of <a href="#future">Future Work</a>.</p> |
|
|
| <h3>Budget modes</h3> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Mode</th><th>Behavior</th><th>Use</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Hard-cap</strong> (default)</td><td>Per-step spend clipped to <code>remaining_budget</code>; episode terminates early when <code>remaining_budget < min_tokens</code></td><td>Final evaluation, competition scoring</td></tr> |
| <tr><td><strong>Soft-budget</strong></td><td>No clipping, no early termination; overspend smoothly penalized</td><td>Training curriculum — lets the policy experience the whole episode before discipline is enforced</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p>Dual modes are not a convenience. Hard-cap's early termination produces zero-advantage groups in GRPO: uniform truncation across all generations → <code>std(r)=0</code> → zero gradient. Soft-budget bridges that window until the policy learns to finish. This is the same pathology we diagnose in full under <a href="#engineering">Engineering Lessons</a> (<a href="#truncation">Pathology 2</a>).</p> |
|
|
| <h3>Tokenizer-native budgets</h3> |
| <p>A subtle bug we surfaced and fixed: per-step <strong>spend</strong> was counted with a live <code>AutoTokenizer</code>, but the episode <strong>cap</strong> (<code>total_budget</code>) was computed from an abstract config formula in an entirely different unit system. Caps and spends did not share units. The environment now resolves <code>total_budget</code> at <code>reset()</code> in priority order:</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Priority</th><th>Condition</th><th>Formula</th><th><code>budget_source</code></th></tr></thead> |
| <tbody> |
| <tr><td class="num">1</td><td>Client passes <code>total_budget</code></td><td>Exact integer</td><td><code>"client"</code></td></tr> |
| <tr><td class="num">2</td><td>Client passes <code>tokenizer_name</code></td><td><code>budget_ratio × Σ tokenize(q_i)</code> over all questions</td><td><code>"tokenizer_native"</code></td></tr> |
| <tr><td class="num">2b</td><td>Tokenizer load fails</td><td>Config formula + warning</td><td><code>"config"</code></td></tr> |
| <tr><td class="num">3</td><td>Neither passed</td><td><code>budget_ratio × N × (min_tokens + max_tokens) / 2</code></td><td><code>"config"</code></td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p>Observation metadata returns <code>total_budget</code> and <code>budget_source</code> so the client can verify the path taken. Cap and spend now live in the same policy-token unit system. The fix is exactly the tokenizer-mismatch mitigation described in cross-chat handoff Issue 2b: aligning the env's <code>AutoTokenizer</code> id with the policy tokenizer via <code>--env_tokenizer_name</code> (or the Hub id rewritten into <code>REPT_MODEL_HUB_ID</code> when the checkpoint is a local path).</p> |
| </section> |
|
|
| |
| <section id="openenv"> |
| <h2>Why OpenEnv</h2> |
| <ul> |
| <li><strong>Base types only.</strong> <code>EnvClient</code>, <code>Environment</code>, Pydantic <code>Observation</code> / <code>Action</code>. No invented abstractions.</li> |
| <li><strong>Per-WebSocket session state.</strong> One environment per socket. The invariant is load-bearing — violating it silently produced zero-reward episodes (see <a href="#stack-split">Engineering Pathology 3</a>).</li> |
| <li><strong>Concurrent sessions.</strong> <code>SUPPORTS_CONCURRENT_SESSIONS = True</code>; <code>max_concurrent_envs = 64</code> — built to be hammered by DDP ranks.</li> |
| <li><strong>Metadata channel.</strong> All extensions (<code>total_budget</code>, <code>budget_source</code>, <code>budget_mode</code>, <code>problem_type</code>, per-step <code>tokenizer_name</code>, <code>grading_response</code>) ride on <code>Observation.metadata</code> and <code>Action.metadata</code>. No new method signatures.</li> |
| <li><strong>Verifiable, bounded reward.</strong> <code>extract_boxed_answer</code> + SymPy. Per-step reward numerically bounded; episode bonus is a clamped product. No LLM judge, no circularity.</li> |
| </ul> |
| </section> |
|
|
| |
| <section id="scoring"> |
| <h2>Scoring: per-step + terminal, coupled multiplicatively</h2> |
| <p>The reward has two layers. Per-step reward fires every turn and sums into an episode total. The terminal bonus is added only at the final step and couples accuracy with budget utilization through a product. An optional scalar <code>--alpha</code> multiplies the per-step reward before episode accumulation (<code>raw_step_reward * alpha</code> inside <code>EpisodeSession</code>); <code>beta</code> is reserved for future shaping and currently has no effect.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Component</th><th>When</th><th>What it rewards</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Correctness</strong></td><td>per step</td><td>Boxed answer matches ground truth under SymPy equality</td></tr> |
| <tr><td><strong>Efficiency bonus</strong></td><td>per step</td><td>Right answer on an easy item with few tokens</td></tr> |
| <tr><td><strong>Cost penalty</strong></td><td>per step</td><td>Linear in tokens spent this turn (full decoded response, not just visible tail)</td></tr> |
| <tr><td><strong>Overspend penalty</strong></td><td>per step (soft-budget only)</td><td>Smooth penalty for going over target utilization</td></tr> |
| <tr><td><strong>Terminal bonus</strong></td><td>last step only</td><td><code>λ_ep × (episode_accuracy × budget_utilization_score)</code></td></tr> |
| </tbody> |
| </table> |
| </div> |
| <blockquote><strong>Why multiplicative coupling.</strong> Sum-of-components rewards reward hacking: the policy can drop accuracy to zero as long as it aces utilization, or vice versa. The product kills both shortcuts: if either factor is zero, the terminal bonus is zero, regardless of how well the other is optimized. The agent has to be right <em>and</em> pace itself — which is the entire learning problem.</blockquote> |
| </section> |
|
|
| |
| <section id="architecture"> |
| <h2>Architecture & training pipeline</h2> |
| <p>The project is two strictly separated packages: <strong>ReasoningEconomicsEnv</strong> (the OpenEnv environment) and <strong>ReasoningEconomicsPT</strong> (the GRPO training client). They communicate exclusively over WebSocket — no in-process imports. The PT repo subclasses <code>EnvClient</code> as <code>ReasonBudgetClient</code> with plain dict actions and observations, so it never touches env Pydantic types.</p> |
|
|
| <h3 id="repo-legend">Two-repo cheat sheet</h3> |
| <p>Every code reference in this blog lives in exactly one of the two repos below. When a symbol is mentioned in later sections (<a href="#engineering">Engineering Lessons</a>, <a href="#stack-split">Stack Split</a>, <a href="#quickstart">Quick Start</a>), this table is the canonical resolution.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Symbol / file</th><th>Repo</th><th>Path</th><th>Role</th></tr></thead> |
| <tbody> |
| <tr><td><code>ReasonBudgetEnvironment</code></td><td><strong>Env</strong></td><td><code>env/reason_budget_env.py</code></td><td>FastAPI + OpenEnv environment served over WebSocket; one instance per session.</td></tr> |
| <tr><td><code>ReasonBudgetObservation</code> / <code>ReasonBudgetAction</code></td><td><strong>Env</strong></td><td><code>env/models.py</code></td><td>Pydantic wire types; PT never imports these, only dict shapes.</td></tr> |
| <tr><td><code>EnvConfig</code></td><td><strong>Env</strong></td><td><code>env/config.py</code></td><td>Episode + budget defaults; overridden by <code>REASON_BUDGET_*</code> env vars.</td></tr> |
| <tr><td><code>compute_reward</code> / <code>compute_episode_bonus</code></td><td><strong>Env</strong></td><td><code>env/reward.py</code></td><td>Per-step and terminal reward math (<a href="#scoring">Scoring</a>).</td></tr> |
| <tr><td><code>EpisodeSampler</code> / dataset loaders</td><td><strong>Env</strong></td><td><code>env/episode_sampler.py</code>, <code>data/loaders.py</code></td><td>MetaMathQA window (<code>subset_size=5000</code>); Numina wired but disabled.</td></tr> |
| <tr><td><code>start_openenv_server.sh</code></td><td><strong>Env</strong></td><td><code>scripts/start_openenv_server.sh</code></td><td>Spawns the FastAPI WebSocket on <code>127.0.0.1:8000</code>.</td></tr> |
| <tr><td><code>ReasonBudgetClient</code> / <code>EpisodeSession</code></td><td><strong>PT</strong></td><td><code>clients/reason_budget_client.py</code></td><td>Dict-typed <code>EnvClient</code> subclass; context manager holds one WebSocket for the full episode.</td></tr> |
| <tr><td><code>rollout_func</code></td><td><strong>PT</strong></td><td><code>training/rollout.py</code></td><td>TRL 1.0 multi-turn rollout driver; emits <code>env_reward</code> / <code>env_mask</code>.</td></tr> |
| <tr><td><code>resolve_env_tokenizer_name</code></td><td><strong>PT</strong></td><td><code>training/tokenizer_sync.py</code></td><td>Aligns env tokenizer with policy tokenizer via <code>REPT_MODEL_HUB_ID</code>.</td></tr> |
| <tr><td><code>_sync_fsdp2_params_to_vllm</code></td><td><strong>PT</strong></td><td><code>training/weight_sync.py</code></td><td>Weight-sync path to <code>trl vllm-serve</code> under Branch A (FSDP2).</td></tr> |
| <tr><td><code>REPT_*</code> env vars</td><td><strong>PT</strong></td><td><code>scripts/run_grpo_lambda.sh</code></td><td><code>REPT_MODEL</code>, <code>REPT_NUM_GPUS</code>, <code>REPT_VLLM_MODE</code>, <code>REPT_VLLM_TP</code>, <code>REPT_VLLM_PORT</code>, <code>REPT_MODEL_HUB_ID</code>.</td></tr> |
| <tr><td><code>REASON_BUDGET_*</code> env vars</td><td><strong>Env</strong></td><td><code>env/config.py</code></td><td><code>REASON_BUDGET_NUM_QUESTIONS</code>, <code>REASON_BUDGET_HARD_CAP_MODE</code>, <code>REASON_BUDGET_BUDGET_RATIO</code>, <code>REASON_BUDGET_TOKENIZER_NAME</code>.</td></tr> |
| <tr><td><code>reward_logs.jsonl</code></td><td><strong>PT</strong> writes, mirrors <strong>Env</strong> schema</td><td><code>runs/<run_id>/reward_logs.jsonl</code></td><td>Per-step reward audit (<code>step_index</code>, <code>remaining_budget_before</code>, <code>visible_response</code>, <code>raw_step_reward</code>, <code>scaled_step_reward</code>, <code>done_after_step</code>, <code>episode_reward</code>).</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p class="math-note">Shorthand used below: <em>Env-side</em> = <a href="https://github.com/sharma-yash01/ReasoningEconomicsEnv" target="_blank" style="color:var(--accent2)">sharma-yash01/ReasoningEconomicsEnv</a>; <em>PT-side</em> = <a href="https://github.com/sharma-yash01/ReasoningEconomicsPT" target="_blank" style="color:var(--accent2)">sharma-yash01/ReasoningEconomicsPT</a>.</p> |
|
|
| <div class="mermaid-wrap"> |
| <pre class="mermaid"> |
| flowchart LR |
| Policy["Policy GPUs 0-5, ZeRO-3, CPU optimizer offload"] |
| VLLM["vLLM server GPUs 6-7, tensor parallel 2"] |
| Env["OpenEnv FastAPI WebSocket localhost 8000"] |
| RewardLog["reward_logs.jsonl"] |
| Policy -->|rollout_func| VLLM |
| VLLM -->|generations| Policy |
| Policy -->|reset, step| Env |
| Env -->|Observation, reward, done| Policy |
| Env --> RewardLog |
| </pre> |
| <p class="mermaid-caption">Figure 1. 8×A100 production topology for the headline Qwen3-14B run (<a href="#stack-split">Branch B</a>). GPUs 0–5 run the DeepSpeed ZeRO-3 trainer with CPU optimizer offload <strong>and Unsloth-integrated LoRA</strong>; GPUs 6–7 run <code>trl vllm-serve</code> with <code>tensor_parallel_size=2</code>; the OpenEnv server runs on a separate process listening on <code>ws://127.0.0.1:8000</code>. The Qwen2.5-3B tranche (<a href="#supporting-3b">3B results</a>) runs on <a href="#stack-split">Branch A</a> instead — FSDP2 sharding, full fine-tune, no LoRA. Configurable via env vars <code>REASON_BUDGET_NUM_QUESTIONS</code>, <code>REASON_BUDGET_HARD_CAP_MODE</code>, <code>REASON_BUDGET_BUDGET_RATIO</code>, <code>REASON_BUDGET_TOKENIZER_NAME</code>; reproducible launch via <code>start_openenv_server.sh</code>.</p> |
| </div> |
|
|
| <p>Training uses <strong>GRPO</strong> (Group Relative Policy Optimization) via TRL 1.0.0's <code>rollout_func</code> contract, which gives us explicit control over the generate → parse → step loop. We chose <code>rollout_func</code> over <code>environment_factory</code> specifically to avoid TRL 1.0.0's Qwen3-only <code>add_response_schema</code> allowlist — that path is biased toward Qwen3/Qwen3.5 chat-template parsing, while we want to run Qwen2.5 and other families too.</p> |
| <p>The critical invariant is <strong>one WebSocket per episode</strong>. <code>_rollout_one_episode</code> runs inside <code>with EpisodeSession(...) as session:</code>, so <code>reset</code> and every <code>step</code> share the same socket. Per turn: <code>trainer.vllm_generation.generate()</code> → decode → <code>session.apply_response(text, …)</code> → remote <code>step({"response": …})</code>. The function returns <code>prompt_ids</code>, <code>completion_ids</code>, <code>logprobs</code>, <code>env_mask</code>, and <code>env_reward</code>; the reward hook <code>reward_from_env(…, **kwargs)</code> simply reads <code>kwargs["env_reward"]</code>. The env's tokenizer id on reset comes from <code>resolve_env_tokenizer_name</code>: <code>--env_tokenizer_name</code> if set, otherwise the tokenizer's <code>name_or_path</code>, falling back to <code>--model</code>. When the checkpoint is a local NFS path, the launcher saves the Hub id into <code>REPT_MODEL_HUB_ID</code> so the remote env receives an HF-resolvable id.</p> |
| <p>Hybrid-thinking models need per-family wiring. <code>training/model_profiles.json</code> + <code>training/model_profiles.py</code> provide a <code>ModelProfileRegistry</code> keyed on model id (exact match first, then longest-prefix), supplying <code>chat_template_kwargs</code> (e.g. <code>enable_thinking</code>), <code>output_parser</code> (<code>qwen3_think</code> or <code>null</code>), think-tag delimiters, and <code>grading_use_visible_only</code>. Env-side invariant: budget always counts <code>_count_tokens(action.response)</code> on the full string, while grading uses <code>metadata["grading_response"]</code> (visible tail) when non-empty. Budget stays honest; grading stays robust to think traces.</p> |
| <p><strong>NCCL padding.</strong> In <code>vllm_mode=server</code> with <code>world_size > 1</code>, every <code>trainer.vllm_generation.generate()</code> runs <code>accelerate.gather_object</code> → NCCL <code>all_gather_object</code> → <code>broadcast_object_list</code>. Our rollout is a <code>while not session.done</code> loop, so different ranks make different numbers of <code>generate()</code> calls per episode — permanent collective desync. Fix: a fixed <code>DIST_SERVER_GENERATES_PER_EPISODE = 8</code> cap, with dummy 1-token generates padding each episode to exactly 8 calls. Dummies are discarded; <code>env_reward</code>, <code>completion_ids</code>, and <code>logprobs</code> are byte-identical to the unpadded case. This is active only in server mode with DDP; colocate + TP=1 does not enter the <code>gather_object</code> path.</p> |
| </section> |
|
|
| |
| <section id="traces"> |
| <h2>Training pathology & zero-advantage collapse</h2> |
| <p>We observed three named, reproducible pathologies when shipping GRPO + OpenEnv + vLLM on hybrid-thinking models under a shared-budget MDP: <strong>NCCL desync</strong> under variable-length rollouts in server mode; <strong>truncation-induced zero-advantage collapse</strong> when every completion in a GRPO group hits the same clip boundary; and <strong>stack non-composition</strong> across TRL + vLLM + Unsloth + FSDP. Before the headline 14B run, the dominant learning-signal failure was truncation collapse — WebSocket, padding, and tokenizer alignment could all be green while the policy still received a structurally zero gradient.</p> |
| <p><strong>Full telemetry tables, evidence links, a log-backed truncation episode, root-cause chains, and structural fixes are documented once in <a href="#engineering">Engineering Lessons</a></strong> (<a href="#nccl">Pathology 1</a>, <a href="#truncation">Pathology 2</a> including an expandable log excerpt, <a href="#stack-split">Pathology 3</a>, then <a href="#takeaways">Takeaways</a>). This section stays short so the blog does not narrate the same three failures twice.</p> |
| </section> |
|
|
| |
| <section id="results"> |
| <h2>Results: what we found</h2> |
| <p>Runs are organized by what they contribute: (1) the <strong>headline Qwen3-14B 8×A100 completed run</strong>, our strongest positive-mean-reward evidence; (2) the <strong>Qwen2.5-3B tranche</strong> on 1×H100 that validates the pipeline end-to-end; (3) <strong>boundary / failure runs</strong> that delimit the tractable region of the TRL 1.0 / vLLM stack.</p> |
|
|
| <h3 id="headline-14b">Headline — Qwen3-14B, 8×A100, ZeRO-3 + vLLM TP=2, true-4q</h3> |
| <p>Run <code>14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe</code> (see <a href="#stack-split">Engineering Pathology 3</a> for topology). 20/20 optimizer steps completed with artifacts saved. <strong>First completed multi-question shared-budget GRPO training run on 14B.</strong></p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Metric</th><th>Value</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Episodes</strong></td><td class="num">480</td></tr> |
| <tr><td><strong>Env turns</strong></td><td class="num">1920</td></tr> |
| <tr class="avg-row"><td><strong>Mean episode reward</strong></td><td class="num">+0.4692 ± 0.9758</td></tr> |
| <tr><td><strong>Min / Max episode reward</strong></td><td class="num">−0.40 / +4.5205</td></tr> |
| <tr><td><strong>Accuracy (per-question)</strong></td><td class="num">17.76%</td></tr> |
| <tr><td><strong>Cap-hit rate</strong></td><td class="num">13.54%</td></tr> |
| <tr><td><strong><code>env_step_error</code></strong></td><td class="num">0</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p><strong>Why <code>true-4q</code> (four questions per episode), not the designed ten.</strong> The headline run sets <code>num_questions=4</code>, not 10, because Qwen3-14B + Unsloth LoRA + vLLM tensor-parallel inference + multi-turn thinking-mode rollouts saturated 40 GB A100s at four questions per episode with two generations per GRPO group. Pushing to ten questions per episode triggered OOM in the rollout cache on the same hardware. That reduction is a <strong>VRAM ceiling</strong>, not a claim that the full MDP is solved; the designed 10-question battery remains the target for future hardware or a smaller policy (see <a href="#stack-split">Pathology 3</a> for why LoRA + ZeRO-3 + CPU offload entered the recipe).</p> |
| <p>Three things this run shows. (1) The multiplicatively-coupled per-step + terminal reward designed in <a href="#scoring">Scoring</a> can produce <strong>positive mean</strong> under a shared budget — previously the headline 3B runs were all negative-mean. (2) The max episode reward of <strong>+4.52</strong> means at least one episode hit both high accuracy <em>and</em> good utilization; the product is only large when both factors are. (3) <code>env_step_error = 0</code> over 1920 turns means the OpenEnv WebSocket invariant, tokenizer alignment, and DeepSpeed ZeRO-3 / vLLM TP=2 split all held under a full training run, not just a smoke.</p> |
|
|
| <h4 style="color:var(--accent2);margin-top:1.25rem;">Contextualizing the 17.76% accuracy — baselines not yet run</h4> |
| <p><strong>We have not yet run baselines against the headline 14B checkpoint.</strong> The per-question accuracy (17.76%) is therefore uncalibrated. It should be read only against the designed performance targets for each baseline in <a href="#baselines">Baselines (planned)</a>, reproduced here for the reader who doesn't want to scroll:</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Baseline (planned, not run)</th><th>Designed performance target</th><th>What a converged RL policy should beat</th></tr></thead> |
| <tbody> |
| <tr><td><strong>uniform-random-split</strong></td><td>Mean reward ≈ <code>0</code>, accuracy distributed by dataset-difficulty mean; utilization unshaped</td><td>Lower bound — any structured policy should clear this</td></tr> |
| <tr><td><strong>greedy-first</strong></td><td>Accuracy capped at <code>≤ 1/N</code> (25% for 4q, 10% for 10q) because later questions are starved; utilization poor on the depleted tail</td><td>Our policy must show cross-prompt pacing, not just solve Q1</td></tr> |
| <tr><td><strong>always-same-budget</strong></td><td>Accuracy approaches the dataset-difficulty mean at the given <code>total_budget / N</code>; zero utilization shaping because allocation is oblivious to difficulty</td><td>Our policy's marginal value comes from difficulty-aware allocation, not just from having a budget at all</td></tr> |
| <tr><td><strong>zero-shot LLM (same Qwen3-14B, no RL)</strong></td><td>Measurable ceiling achievable without RL; expected to be strong on per-question accuracy but poor on <code>budget_utilization_score</code> because the base model does not pace</td><td>RL contribution = product gain (<code>accuracy × utilization</code>) the base model cannot reach</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p class="math-note">17.76% per-question accuracy over the true-4q battery is meaningful only once the zero-shot LLM row above is filled in. Until then it is a raw rate, not a claim about RL uplift. Baseline evaluation against the released checkpoint is the next result in <a href="#future">Future Work</a>.</p> |
|
|
| <h3 id="supporting-3b">Supporting — Qwen2.5-3B tranche (1×H100)</h3> |
| <p>Validates the full pipeline end-to-end. Reward mean improves monotonically as <code>max_completion_length</code> and <code>max_tokens_per_step</code> grow — the interesting signal is the std and the positive tail, not the centered mean.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Run</th><th>Model</th><th>Episodes</th><th>Key settings</th><th>Mean (max)</th></tr></thead> |
| <tbody> |
| <tr><td class="task-id">qwen25_3b_best_single_h100</td><td>Qwen2.5-3B-Instruct</td><td class="num">87</td><td><code>batch=2</code>, <code>gens=2</code>, <code>lr=7e-7</code>, <code>mcl=640</code>, turns=4</td><td class="num">−0.4598 (+1.29)</td></tr> |
| <tr><td class="task-id">qwen25_3b_better</td><td>Qwen2.5-3B-Instruct</td><td class="num">285</td><td>2 epochs, same shape</td><td class="num">−0.3156 (+1.33)</td></tr> |
| <tr><td class="task-id">qwen25_3b_more_context</td><td>Qwen2.5-3B-Instruct</td><td class="num">368</td><td>2 epochs, <code>mcl=1024</code>, <code>tokens/step=512</code></td><td class="num">−0.2279 (+2.34)</td></tr> |
| <tr class="avg-row"><td class="task-id">qwen25_3b_best_v2</td><td>Qwen2.5-3B-Instruct</td><td class="num">687</td><td>4 epochs, <code>mcl=1024</code>, <code>tokens/step=1024</code>, <code>lr=5e-7</code></td><td class="num">−0.1476 (+2.29)</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p>Plumbing smokes on Qwen2.5-0.5B confirm the trainer path (loss <code>0.00889</code> / <code>0.006714</code>; wall-clock 1246 s / 1279 s). These are infrastructure validation, not research signal.</p> |
|
|
| <h3 id="boundary">Boundary / failure runs</h3> |
| <p>Where today's stack breaks. Each row is a training-stack limit, not an env or reward-shape limit.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Attempt</th><th>Setup</th><th>Outcome</th><th>Blocker</th></tr></thead> |
| <tbody> |
| <tr><td>Unsloth 14B, 1×A100-40GB</td><td><code>vllm_mode=colocate</code>, <code>max_model_len=3500</code>, <code>gens=2</code></td><td>1/120 steps; <code>grad_norm=NaN</code>; <code>selective_log_softmax</code> <code>RuntimeError</code></td><td>Unsloth truncation × thinking-mode completions (<a href="#truncation">Pathology 2</a>)</td></tr> |
| <tr class="task-id"><td>qwen25_7b_4q_ultralow</td><td>1×H100 colocate, <code>mcl=256</code></td><td>Repeated CUDA OOM</td><td>7B thinking context doesn't fit colocate headroom</td></tr> |
| <tr><td>Qwen3-8B sharded</td><td>8×A100 FSDP + vLLM server</td><td>No stable reward summary</td><td>TRL/FSDP weight-sync (<a href="#stack-split">Pathology 3</a>)</td></tr> |
| <tr><td>Qwen3-30B-A3B-Instruct-2507</td><td>8×A100 FSDP + server-mode vLLM</td><td>No completed summary</td><td>vLLM KV-cache / communicator startup</td></tr> |
| <tr><td>Qwen2.5-32B-Instruct</td><td>8×A100 FSDP + server-mode vLLM</td><td>No completed summary</td><td>Prompt length + vLLM startup + NCCL init</td></tr> |
| <tr><td>Early Qwen2.5-14B smoke</td><td>8×A100 FSDP + server</td><td>2 episodes (reward −0.07, −0.30)</td><td>Superseded by the headline ZeRO-3 run</td></tr> |
| </tbody> |
| </table> |
| </div> |
|
|
| <h3 id="baselines">Baselines (planned)</h3> |
| <p>No baselines have been computed against the headline 14B checkpoint yet. We planned four, in the spirit of <a href="#foundations">LotteryElicitationEnv</a>'s baseline set, each isolating one axis of the allocation decision.</p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Baseline</th><th>Policy</th><th>What it isolates</th></tr></thead> |
| <tbody> |
| <tr><td><strong>always-same-budget</strong></td><td>Allocate <code>total_budget / N</code> per question; greedy decode within each per-question cap</td><td>Difficulty-awareness gain (is it just the total budget that matters, or how it's split?)</td></tr> |
| <tr><td><strong>greedy-first</strong></td><td>Spend up to cap on Q1, Q2, … until budget runs out; truncate remainder</td><td>Pacing cost (what's lost by no-foresight allocation?)</td></tr> |
| <tr><td><strong>uniform-random-split</strong></td><td>Dirichlet-sampled per-question allocations at the same total budget</td><td>Lower bound — beating it proves the policy learned <em>anything</em> structured</td></tr> |
| <tr><td><strong>zero-shot LLM</strong></td><td>Same base model (Qwen3-14B) with no RL, greedy decode against the same battery at the same budget</td><td>RL contribution — measurable upper bound achievable without learning</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p><strong>Baseline performance targets.</strong> By construction we expect <code>uniform-random-split → mean reward ≈ 0</code> (difficult to ace utilization by accident); <code>greedy-first →</code> accuracy capped at <code>≤ 1/N</code> because later questions are starved; <code>always-same-budget →</code> mean reward approaches the env-difficulty mean with zero utilization shaping; <code>zero-shot LLM</code> is the measurable ceiling without RL shaping. A converged RL policy should beat all four on the product (<code>accuracy × utilization</code>), not any single factor.</p> |
|
|
| <h3>Status</h3> |
| <p>We have: a completed 14B training run with positive mean reward (+0.4692), a validated 3B tranche end-to-end, a mapped set of stack boundaries, and three named pathologies (<a href="#nccl">NCCL desync</a>, <a href="#truncation">truncation collapse</a>, <a href="#stack-split">stack non-composition</a>) with reproducible signatures and structural fixes. We do not yet have: baselines evaluated against the headline checkpoint, or cross-family model comparisons. Those are <a href="#future">Future Work</a>.</p> |
| </section> |
|
|
| |
| <section id="engineering"> |
| <h2>Engineering lessons</h2> |
| <p>Shipping a real GRPO + OpenEnv + vLLM pipeline on a multi-turn verifiable-reward environment surfaced three major pathologies. Each one is a named learning-signal failure with a reproducible signature and a structural fix. We document them so the next OpenEnv submission can avoid the same dead ends.</p> |
|
|
| <h3 id="nccl">Pathology 1 — NCCL desync under variable-length rollouts</h3> |
| <p><strong>Signature.</strong> <code>all_gather_object</code> / <code>broadcast_object_list</code> hang on the second optimizer step; heartbeat timeout at ~7200 s; all ranks report <code>env_step_error=0</code> and <code>ConnectionClosedError=0</code> on the rollout but block at the post-rollout barrier. NCCL flight recorder shows mismatched <code>last enqueued</code> / <code>last completed</code> sequence numbers across ranks — structural, not configurational.</p> |
| <p><strong>Root cause.</strong> In <code>vllm_mode=server</code> with <code>world_size > 1</code>, every <code>trainer.vllm_generation.generate()</code> runs <code>accelerate.gather_object</code> → NCCL <code>all_gather_object</code> → <code>broadcast_object_list</code>. Our rollout is a <code>while not session.done</code> loop — different ranks make different numbers of <code>generate()</code> calls per episode. NCCL is sequence-numbered; different call counts per rank → permanent desync → eventual <code>_pickle.UnpicklingError</code> or a BROADCAST timeout. A secondary trigger: stale <code>__pycache__</code> on NFS kept loading the <em>pre-fix</em> <code>.pyc</code>, reintroducing the desync after <code>git pull</code> reported the repo clean.</p> |
| <p><strong>Evidence.</strong> Full write-ups in <a href="impl-context/dist-train-desync-issue.md">impl-context/dist-train-desync-issue.md</a> and <a href="impl-context/dist-train-issue-hung-gpu.md">impl-context/dist-train-issue-hung-gpu.md</a>.</p> |
| <p><strong>Fix stack.</strong></p> |
| <ol> |
| <li>Clear bytecode caches in every launch script: |
| <pre><code>find $REPT_ROOT -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null |
| find $REPT_VENV -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null |
| PYTHONDONTWRITEBYTECODE=1 torchrun --nproc_per_node=7 ...</code></pre> |
| </li> |
| <li>Fixed-count <code>generate()</code> padding per episode. Every rank performs exactly <code>DIST_SERVER_GENERATES_PER_EPISODE = 8</code> calls — real generates for active turns, 1-token dummies (discarded) for the rest. Active only when <code>vllm_mode == "server"</code> and <code>world_size > 1</code>. Reward, logprobs, and credit assignment are byte-identical to the unpadded case.</li> |
| <li>On the 8×A100 headline path we sidestep the server-mode collective entirely: DeepSpeed ZeRO-3 with vLLM on dedicated GPUs keeps the rollout collective off the hot path (see Pathology 3). This is what unblocked <code>14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe</code>.</li> |
| </ol> |
| <p><strong>Why this matters beyond our repo.</strong> Any TRL <code>rollout_func</code> user running variable-length rollouts in server mode has this bug latent. Heartbeat and batch tuning only hide it longer.</p> |
|
|
| <h3 id="truncation">Pathology 2 — Truncation-induced zero-advantage collapse</h3> |
| <p><strong>Signature.</strong> From the most recent Lambda run (<a href="#results">terminal 36</a>, Qwen3-4B + Unsloth, step 1/120):</p> |
| <pre><code>Unsloth: Input IDs of shape torch.Size([2, 12986]) with length 12986 |
| > the model's max sequence length of 3500. |
| We shall truncate it ourselves. |
| RuntimeError: Size does not match at dimension 1 |
| expected index [2, 12760, 1] to be no larger than |
| self [2, 3499, 151936] apart from dimension 2 # in selective_log_softmax</code></pre> |
| <p>Upstream telemetry in the same step: <code>completions/clipped_ratio = 1.0</code>, <code>mean_terminated_length = 0</code>, <code>frac_reward_zero_std = 0.375</code>, <code>importance_sampling_ratio/mean = 0.072</code>, <code>grad_norm = NaN</code>. Every completion is clipped, zero episodes terminate with <code></think></code> + <code>\boxed{}</code>, and 37% of GRPO groups have <code>reward_std = 0</code> → advantages collapse to a near-constant → gradient is noise → NaN.</p> |
| <p><strong>Root cause.</strong> Unsloth silently patches the policy tokenizer to <code>max_seq_length = 3500</code> even when vLLM is configured at 32768. GRPO computes completion logits against the <em>truncated</em> 3499-token sequence, then tries to gather at the <em>untruncated</em> 12760 completion indices — the shapes disagree and the forward breaks. More broadly: when truncation is uniform across the GRPO group, every completion lands at the same wrong answer → <code>std(r) = 0</code> → <code>A_i = 0</code> → zero gradient. The policy also learns to emit filler (repeated <code>!</code>) because that is what survives the hard cap with the least cost penalty.</p> |
| <p><strong>Fix.</strong> Two changes together, not either alone. (1) <em>Keep Unsloth</em>, but only inside the DeepSpeed ZeRO-3 branch we document in <a href="#stack-split">Pathology 3</a> — in our runs, ZeRO-3's all-gather window <em>empirically</em> aligned with Unsloth's forward where FSDP's per-module <code>summon_full_params</code> did not, removing the <code>selective_log_softmax</code> shape mismatch we hit on FSDP. (2) Independently, enforce <code>vllm_max_model_length</code> = <code>tokenizer.model_max_length</code> = actual episode budget at startup, validated via an assertion, and raise <code>max_completion_length</code> to give the closing <code></think></code> and <code>\boxed{}</code> room. Start in soft-budget mode; warm into hard-cap. Increase <code>num_generations</code> to break group-level uniformity. <strong>LR tuning is the wrong instinct</strong> — the gradient is <em>structurally</em> zero, not noisy. Branch A (FSDP, full fine-tune) sidesteps the Unsloth composition entirely and does not hit this pathology.</p> |
| <p><strong>Why this matters beyond our repo.</strong> Any GRPO run on a hybrid-thinking model under a budget-constrained MDP with a partially-right-but-cheap shortcut has this bug latent. The <code>clipped_ratio = 1</code> + <code>reward_std ≈ 0</code> pair is the fingerprint.</p> |
|
|
| <details> |
| <summary>Expand: telemetry table, log-backed truncation episode, and root-cause chain</summary> |
| <h4 style="margin-top:1rem;">The telemetry that reveals collapse</h4> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Signal</th><th>Observed value</th><th>What it means</th></tr></thead> |
| <tbody> |
| <tr><td><code>reward</code></td><td class="num">≈ −0.1 to −0.47 (flat)</td><td>Uniform negative reward across GRPO groups</td></tr> |
| <tr><td><code>completions/clipped_ratio</code></td><td class="num">1.0</td><td>Every completion hits <code>max_new_tokens</code></td></tr> |
| <tr><td><code>completions/mean_terminated_length</code></td><td class="num">0</td><td>Nothing terminates naturally — no <code></think></code>, no <code>\boxed{}</code></td></tr> |
| <tr><td><code>frac_reward_zero_std</code></td><td class="num">0.375 – 1.0</td><td>Partial to full GRPO group collapse</td></tr> |
| <tr><td><code>importance_sampling_ratio/mean</code></td><td class="num">≈ 0.07</td><td>IS ratio collapsing under policy drift</td></tr> |
| <tr><td><code>entropy</code>, <code>grad_norm</code></td><td class="num">low / NaN</td><td>Near-zero (or explicitly broken) gradient signal</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p class="math-note">Numbers above are drawn from the most recent Lambda <code>train.log</code> (Qwen3-4B, single A100-40GB, Unsloth LoRA, step 1/120) and reconciled against earlier 2×H100 Qwen3-4B trainer metrics. The headline Qwen3-14B 8×A100 run in <a href="#headline-14b">Results</a> resolved this signature — it is what a stack that <em>doesn't</em> truncate looks like.</p> |
|
|
| <h4>A truncation-collapsed episode, log-verified</h4> |
| <p>Training emits a fixed <code>reward_logs.jsonl</code> schema on every step (<code>step_index</code>, <code>question</code>, <code>remaining_budget_before</code>, <code>visible_response</code>, <code>raw_step_reward</code>, <code>scaled_step_reward</code>, <code>done_after_step</code>, <code>episode_reward</code>). The sanitized excerpt below — from the single-GPU Unsloth + Qwen3-4B stack — shows episode-level telemetry when truncation collapse is in full effect.</p> |
| <div class="episode-trace"> |
| <div class="trace-step"> |
| <div class="step-marker terminal"></div> |
| <div class="step-label">Step 1 of 10 · <code>remaining_budget_before = 1256</code></div> |
| <div class="step-content"> |
| <strong>Question:</strong> <em>The matrices … are inverses. Enter the ordered pair <code>(a,b)</code>. The answer is 14. What is the value of unknown variable <code>X</code>?</em><br> |
| <strong><code>visible_response</code>:</strong> <code>"!!!!!!!!!!!! … (truncated) !!!!!!!!!!!!"</code><br> |
| <strong><code>reasoning_trace</code>:</strong> <code>""</code> (empty)<br> |
| <strong><code>raw_step_reward</code>:</strong> <code>−0.11</code> · <strong><code>done_after_step</code>:</strong> <code>false</code><br> |
| <span class="math-note">Per <a href="#scoring">Scoring</a> / <code>compute_reward</code>: <code>correctness = −0.1</code> (wrong answer) plus a small <code>cost_penalty</code> from mild overspend past <code>fair_share</code> (≈ <code>−0.01</code>); <code>efficiency_bonus = 0</code> because the step is incorrect.</span> |
| </div> |
| </div> |
| <div class="trace-step"> |
| <div class="step-marker terminal"></div> |
| <div class="step-label">Step 9 · <code>remaining_budget_before = 451.0</code></div> |
| <div class="step-content"> |
| <strong><code>visible_response</code>:</strong> same repeated-<code>!</code> pattern; <code>was_correct = false</code>.<br> |
| <strong><code>raw_step_reward</code>:</strong> <code>−0.1</code><br> |
| <span class="math-note">Here <code>correctness = −0.1</code> alone; <code>cost_penalty ≈ 0</code> (no meaningful overspend past <code>fair_share</code> on this step).</span> |
| </div> |
| </div> |
| <div class="trace-step"> |
| <div class="step-marker terminal"></div> |
| <div class="step-label">Step 10 · episode terminal</div> |
| <div class="step-content"> |
| <strong><code>final_observation.accuracy_so_far</code>:</strong> <code>0.0</code><br> |
| <strong><code>episode_reward</code>:</strong> <code>−1.0174</code><br> |
| <span class="math-note">Arithmetic: ten wrong steps each carry <code>correctness = −0.1</code> (≈ <code>−1.0</code> summed), plus small per-step <code>cost_penalty</code> terms where <code>spend_ratio > 1</code>; the terminal term <code>λ<sub>ep</sub> · episode_accuracy · budget_utilization_score</code> is <strong>zero</strong> because <code>episode_accuracy = 0</code>. The logged <code>−1.0174</code> is therefore dominated by wrong-answer penalties, not "ten pure cost penalties."</span><br> |
| <strong>Trainer-side consequence:</strong> <code>completions/clipped_ratio = 1.0</code>, <code>mean_terminated_length = 0</code>, <code>frac_reward_zero_std = 0.375</code>, <code>grad_norm = NaN</code>. Step 2 never runs. |
| </div> |
| </div> |
| <div class="trace-verdict bad"> |
| Ten degenerate completions, zero correctness, zero terminal bonus by construction. The multiplicative coupling refuses to reward a policy that didn't solve anything — doing exactly what it was designed to do. The failure is upstream: the policy can't produce a syntactically valid answer because Unsloth truncated the input out from under it. |
| </div> |
| </div> |
| <p class="math-note">Caption: truncation-collapsed Unsloth episode on 1×A100, <code>episode_reward = −1.0174</code>. Grader behaves correctly; the failure is upstream in the stack.</p> |
|
|
| <h4>Root cause chain</h4> |
| <ol> |
| <li><code>max_completion_length = 4096</code> default + Qwen3 thinking → <code><think>…</think></code> span alone consumes 1500–4000 tokens.</li> |
| <li>Completion gets truncated <strong>before</strong> the closing <code></think></code> → no <code>\boxed{…}</code> answer.</li> |
| <li>With <code>grading_use_visible_only = True</code>, the empty visible tail grades as incorrect; in the latest Lambda run the visible tail is a degenerate string of <code>!</code> characters (no <code>\boxed{}</code>, <code>reasoning_trace</code> empty).</li> |
| <li>Uniform truncation across all completions in the GRPO group → identical rewards → <code>std(r) = 0</code> → <code>A_i = 0</code> → <strong>zero gradient</strong>.</li> |
| <li>Policy divergence over reuse epochs pulls the IS ratio down (<code>mean ≈ 0.07</code>, <code>min = 0</code>) — noise on top of zero signal.</li> |
| <li>Hard-cap multi-turn budget amplifies the effect: verbose truncated completions terminate episodes early, shrinking the learning signal further.</li> |
| </ol> |
| <p><strong>The most recent Lambda run (2026-04-14) adds a second failure mode on top:</strong> with Unsloth enabled and <code>max_model_len = 3500</code>, the accumulated prompt + completion reached <strong>12 986 tokens</strong>. Unsloth warned <code>"We shall truncate it ourselves"</code> and the trainer then crashed inside <code>_get_per_token_logps_and_entropies → selective_log_softmax</code> with <code>RuntimeError: Size does not match at dimension 1 expected index [2, 12760, 1] to be no larger than self [2, 3499, 151936] apart from dimension 2</code>. Truncated indices and full-length logits disagreed, the forward broke, and <code>grad_norm</code> became <code>NaN</code>. The fix is the same as the root truncation chain above: raise <code>max_completion_length</code> (and the vLLM <code>max_model_len</code>) to match what thinking-mode actually emits, or disable thinking mode outright. LR tuning cannot fix this — the gradient is <em>structurally</em> zero (or undefined).</p> |
| <blockquote>This failure mode is general, not specific to ReasoningEconomicsEnv. <strong>Any GRPO run on a hybrid-thinking model under a budget-constrained multi-turn MDP with a partially-right-but-cheap shortcut has this bug latent.</strong> We believe every future sequential-budget RL run on reasoning models needs to start from this diagnosis.</blockquote> |
| </details> |
|
|
| <h3 id="stack-split">Pathology 3 — Stack non-composition; two validated branches</h3> |
| <p><strong>Signature.</strong> <code>Unsloth for GRPO is not yet implemented! Just ignore this function.</code> (<a href="#results">terminal 36</a>); FSDP2 shard-gather blocking <code>rollout_func</code>; vLLM colocate stealing VRAM from the policy on 40 GB A100s; repeated CUDA OOM at 7B on single H100 (<code>qwen25_7b_4q_ultralow</code>); Qwen3-30B-A3B and Qwen2.5-32B failures at vLLM KV-cache / communicator startup.</p> |
| <p><strong>Root cause.</strong> TRL 1.0.0 + vLLM (colocate) + Unsloth + FSDP do not compose as a four-way intersection. Each pairwise composition has a known sharp edge (TRL PR #3582 FSDP weight-sync, Unsloth's <code>FastLanguageModel.get_peft_model</code> × FSDP, FSDP1 <code>_is_root</code> assertion under TRL's <code>summon_full_params</code> per child module, <code>GuidedDecodingParams</code> movement across vLLM versions). Specifically, <strong>Unsloth + FSDP does not work</strong> in this stack — FSDP's parameter sharding and Unsloth's fused kernels disagree on tensor shapes during the GRPO log-prob forward pass.</p> |
| <p><strong>Resolution — two branches, chosen by model scale and LoRA need.</strong></p> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Branch</th><th>Sharding / optimizer</th><th>LoRA</th><th>Where it's used</th><th>What it unlocks</th></tr></thead> |
| <tbody> |
| <tr><td><strong>Branch A</strong></td><td>FSDP2 (or FSDP1) via <code>model-sharding-fsdp2.yaml</code> / <code>model-sharding.yaml</code>, no CPU optimizer offload</td><td><strong>None</strong> — full fine-tune</td><td>Multi-GPU server-mode vLLM; Qwen2.5-3B / Qwen3-4B tranches</td><td>Cleanest weight-sync to <code>trl vllm-serve</code> (<code>_sync_fsdp2_params_to_vllm</code>); no Unsloth interaction risk.</td></tr> |
| <tr><td><strong>Branch B</strong></td><td><strong>DeepSpeed ZeRO-3</strong> (stage 3 parameter + optimizer sharding) with <strong>CPU optimizer offload</strong></td><td><strong>Unsloth-integrated LoRA</strong> (4-bit QLoRA, Unsloth fused kernels)</td><td>Headline Qwen3-14B 8×A100 run (<code>14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe</code>)</td><td><strong>Our finding:</strong> this is the only configuration we found in which TRL 1.0 + GRPO + Unsloth completes end-to-end at 14B under 40 GB A100s with trainable adapters. ZeRO-3's all-gather window <em>empirically</em> aligned with Unsloth's forward in our runs, where FSDP2's per-module <code>summon_full_params</code> did not. Neither pairing has an upstream-blessed recipe; we report it as an engineering contribution, not a library guarantee.</td></tr> |
| </tbody> |
| </table> |
| </div> |
| <p><strong>Shared infra across both branches.</strong></p> |
| <ul> |
| <li><strong>vLLM split.</strong> Server-mode on the <em>last</em> <code>REPT_VLLM_TP</code> GPUs (e.g. GPUs 6–7 with TP=2 on the 14B headline); policy trains on the remaining ranks. No colocate stealing VRAM from the policy.</li> |
| <li><strong>OpenEnv process.</strong> Separate process, <code>start_openenv_server.sh</code>, WebSocket on <code>127.0.0.1:8000</code>, configurable via <code>REASON_BUDGET_NUM_QUESTIONS</code>, <code>REASON_BUDGET_HARD_CAP_MODE</code>, <code>REASON_BUDGET_BUDGET_RATIO</code>, <code>REASON_BUDGET_TOKENIZER_NAME</code>. Same env binary across every run; only env vars change.</li> |
| <li><strong>Port matrix.</strong> OpenEnv on 8000, <code>trl vllm-serve</code> on 8001 (<code>REPT_VLLM_PORT</code>), NCCL rendezvous on 51216, Accelerate on 29500.</li> |
| <li><strong>Dependency pin.</strong> Both branches use the same core stack: <code>trl==1.0.0</code>, <code>vllm==0.10.2</code>, <code>transformers>=5.2,<5.4</code>, <code>torch==2.8.*</code> (<code>requirements.lambda.txt</code> / <code>requirements.carc-cu121.txt</code>). The branches differ only in sharding recipe and LoRA presence.</li> |
| </ul> |
| <p><strong>Evidence.</strong> Branch B is what <code>14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe</code> ran on — 20/20 optimizer steps, 480 episodes, 1920 env turns, mean reward <strong>+0.4692</strong>, <code>env_step_error = 0</code>. Branch A is the path the Qwen2.5-3B tranche (<a href="#supporting-3b">Supporting — 3B tranche</a>) ran on, end-to-end without LoRA.</p> |
| <p><strong>Per-WebSocket invariant (prerequisite for both branches).</strong> The pathologies above rest on one OpenEnv invariant: <strong>one <code>Environment</code> instance per WebSocket session</strong>. <code>EpisodeSession</code> is a context manager held for the full multi-turn episode. Violating it — e.g. opening <code>client.sync()</code> inside <code>reset</code>/<code>step</code> — silently collapses reward to zero (episodes come back with <code>num_steps=1</code>, <code>done_after_step=true</code>, empty <code>final_observation</code>). Tokenizer id alignment between env and policy, via <code>resolve_env_tokenizer_name</code> and <code>REPT_MODEL_HUB_ID</code>, is the second half of that invariant.</p> |
|
|
| <h3 id="takeaways">Takeaways (single synthesis)</h3> |
| <p>The engineering story of this submission reduces to three lessons. Each maps to one of the three pathologies above; together they cover every material failure we encountered, and every subsequent design decision in <a href="#env-design">Environment Design</a>, <a href="#scoring">Scoring</a>, and <a href="#architecture">Architecture</a> follows from one of them.</p> |
| <ol> |
| <li><strong>Variable-length rollouts break NCCL server-mode by default</strong> (<a href="#nccl">Pathology 1</a>). Fixed-count padding or a dedicated-inference topology is the only structural fix.</li> |
| <li><strong>Truncation is not a plumbing problem; it is a learning-signal problem</strong> (<a href="#truncation">Pathology 2</a>). <code>clipped_ratio = 1</code> + <code>reward_std ≈ 0</code> ⇒ zero gradient. No LR tune can fix that; fix it with <code>max_completion_length</code>, soft-budget warmup, and <code>num_generations</code>.</li> |
| <li><strong>The reasoning-RL stack is a non-composition, but it splits cleanly into two branches</strong> (<a href="#stack-split">Pathology 3</a>). Branch A = FSDP + full fine-tune (no LoRA) for small/mid policies. Branch B = DeepSpeed ZeRO-3 + Unsloth-integrated LoRA for 14B+. In <em>our</em> experiments, Unsloth + GRPO completed only under ZeRO-3, not under FSDP — an empirical split, not an upstream guarantee.</li> |
| </ol> |
| <p class="math-note">These three lessons are the only things from this blog worth copying into the next OpenEnv + TRL 1.0 + multi-turn submission before writing a single line of env code.</p> |
| </section> |
|
|
| |
| <section id="positioning"> |
| <h2>Positioning: online, sequential, shared-budget</h2> |
| <div class="mermaid-wrap"> |
| <pre class="mermaid"> |
| quadrantChart |
| title LLMs + Reasoning Economics · Budget Scope vs Horizon |
| x-axis "Per-query budget" --> "Shared, session-level budget" |
| y-axis "Single-turn / inference-time" --> "Sequential multi-turn MDP" |
| quadrant-1 "Our target" |
| quadrant-2 "Multi-turn, no shared budget" |
| quadrant-3 "Most prior work" |
| quadrant-4 "Emerging" |
| "Token-Budget / CCoT": [0.12, 0.12] |
| "Chain-of-Draft (Xu 2024)": [0.18, 0.2] |
| "L1 / LCPO / O1-Pruner": [0.25, 0.25] |
| "Kimi K1.5 Long2Short": [0.3, 0.3] |
| "SelfBudgeter": [0.4, 0.3] |
| "CoT-Valve / TokenSkip": [0.15, 0.18] |
| "Budget Forcing / s1": [0.22, 0.32] |
| "Dynasor-CoT": [0.2, 0.15] |
| "ReasoningEconomicsEnv": [0.85, 0.88] |
| </pre> |
| <p class="mermaid-caption">Figure 2. Positioning relative to prior reasoning-economics work. Our contribution occupies the shared-budget, sequential multi-turn quadrant; every prior system compresses <em>within</em> a single query.</p> |
| </div> |
| <p class="math-note">If the quadrant chart fails to render (Mermaid <code>quadrantChart</code> is marked experimental), the intent is: horizontal axis runs from per-query budget (left) to shared session-level budget (right); vertical axis runs from single-turn inference-time methods (bottom) to sequential multi-turn MDPs (top). Prior work clusters in the bottom-left; ReasoningEconomicsEnv is the top-right.</p> |
|
|
| <h3>Why the y-axis is "Sequential multi-turn MDP" — the online-learning angle</h3> |
| <p>The vertical axis is kept as a sequential multi-turn MDP deliberately: <strong>our problem is an online-decision problem, not an offline length-control problem.</strong> Every prior family in the bottom-left picks a reasoning length <em>once per prompt</em>, in isolation — prompt-guided caps, RL length rewards (including Kimi K1.5 Long2Short), SFT/distillation on shorter traces, dynamic early exit. That is an offline decision in the RL-theory sense: the policy never sees the consequence of its earlier spend affecting what's available for later prompts.</p> |
| <p>Under a shared session-level budget, the policy must <strong>revise pacing after every single observation</strong>. The remaining-budget state is non-stationary by construction: every answer shrinks the feasible set of future allocations, and a wrong call on Q1 can starve Q10 irreversibly. This is exactly the online-learning setting — a stream of observations with no resets between decisions, where the cost of a bad action compounds across the episode rather than being absorbed by an independent prompt.</p> |
| <p>Three concrete consequences of the online framing:</p> |
| <ul> |
| <li><strong>The MDP has no offline equivalent.</strong> A supervised-length-control dataset cannot encode "Q1 spent X because the agent expected Q4 to be hard" — that counterfactual only exists in sequential rollouts.</li> |
| <li><strong>The reward is temporally coupled.</strong> Terminal <code>accuracy × utilization</code> ties every intermediate allocation to a single episode-level outcome; the agent cannot learn local policies in isolation (see <a href="#scoring">Scoring</a>).</li> |
| <li><strong>The training algorithm must be on-policy enough to track non-stationary state.</strong> GRPO's group-relative advantages give us that without a separate critic, which is why the MDP and the optimizer compose — and why zero-advantage collapse under truncation (<a href="#truncation">Pathology 2</a>) is catastrophic: it removes the only signal tying per-step decisions to episode outcomes.</li> |
| </ul> |
| <p>Everything else in the bottom-left quadrant can be served by a length-annotated fine-tune or a decoding-time heuristic. The top-right quadrant — our target — cannot; it needs online sequential learning under a shared budget, which is the setting ReasoningEconomicsEnv is built to expose.</p> |
| </section> |
|
|
| |
| <section id="foundations"> |
| <h2>Foundations & citations</h2> |
| <div class="table-wrap"> |
| <table> |
| <thead><tr><th>Foundation</th><th>Role in this project</th><th>Citation</th></tr></thead> |
| <tbody> |
| <tr><td><strong>GRPO</strong></td><td>Critic-free RL objective with group-relative advantages; ideal for terminal-only and sparse verifiable rewards</td><td>Shao et al., <a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a> (DeepSeekMath)</td></tr> |
| <tr><td><strong>OpenEnv</strong></td><td>Gym-style reset/step, WebSocket transport, HF Space deployment, per-session state, concurrent sessions</td><td><a href="https://huggingface.co/blog/openenv" target="_blank" style="color:var(--accent2)">HF Blog: Introducing OpenEnv</a></td></tr> |
| <tr><td><strong>TRL 1.0 + <code>rollout_func</code></strong></td><td>Explicit multi-turn stepping; <code>env_reward</code> / <code>env_mask</code> contract; avoids <code>add_response_schema</code> Qwen3/3.5 allowlist</td><td><a href="https://huggingface.co/docs/trl/en/openenv" target="_blank" style="color:var(--accent2)">TRL × OpenEnv docs</a></td></tr> |
| <tr><td><strong>vLLM</strong></td><td>High-throughput inference; colocate and server modes; <code>trl vllm-serve</code> weight sync</td><td><a href="https://github.com/vllm-project/vllm" target="_blank" style="color:var(--accent2)">vllm-project/vllm</a></td></tr> |
| <tr><td><strong>Kimi K1.5 / Long2Short</strong></td><td>State-of-the-art per-query length RL; the strongest "compress within a single trace" baseline we compare against</td><td>Moonshot AI, <a href="https://arxiv.org/abs/2501.12599" target="_blank" style="color:var(--accent2)">arXiv:2501.12599</a> (2025)</td></tr> |
| <tr><td><strong>DeepSpeed ZeRO-3</strong></td><td>Stage-3 parameter + optimizer sharding with CPU offload; the sharding backbone of Branch B (14B + Unsloth LoRA)</td><td>Rajbhandari et al., <a href="https://arxiv.org/abs/1910.02054" target="_blank" style="color:var(--accent2)">arXiv:1910.02054</a></td></tr> |
| <tr><td><strong>Unsloth</strong></td><td>Fused kernels for QLoRA training; we integrated it only on Branch B (ZeRO-3). Composition with TRL GRPO is an <em>empirical finding in our stack</em>, not a documented upstream pairing.</td><td><a href="https://github.com/unslothai/unsloth" target="_blank" style="color:var(--accent2)">unslothai/unsloth</a></td></tr> |
| <tr><td><strong>FSDP2</strong></td><td>Per-parameter fully-sharded data parallel; the sharding backbone of Branch A (full fine-tune, no LoRA)</td><td>PyTorch, <a href="https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html" target="_blank" style="color:var(--accent2)">docs</a></td></tr> |
| <tr><td><strong>MetaMathQA</strong> (active) / <strong>NuminaMath-TIR</strong> (planned)</td><td>Source dataset for episode question sampling — public, verifiable, SymPy-gradable. Current runs draw from the first 5 000 rows of MetaMathQA; NuminaMath-TIR is wired but disabled (<a href="#future">Future Work</a>).</td><td><code>meta-math/MetaMathQA</code>, <code>AI-MO/NuminaMath-TIR</code> on Hugging Face</td></tr> |
| <tr><td><strong>LotteryElicitationEnv/PT</strong></td><td>Sibling project — structural template for two-repo split, <code>rollout_func</code>, DDP padding</td><td>Same monorepo</td></tr> |
| </tbody> |
| </table> |
| </div> |
| </section> |
|
|
| |
| <section id="quickstart"> |
| <h2>Quick start</h2> |
| <p><strong>Supported quick-start path: single-GPU colocate.</strong> Multi-GPU paths (FSDP full fine-tune, and DeepSpeed ZeRO-3 + Unsloth LoRA for 14B) are described in <a href="#stack-split">Engineering Pathology 3</a>; this section sticks to the simplest working configuration.</p> |
| <pre><code><span class="c"># 1. Environment (HF Space or local Docker)</span> |
| <span class="c"># Local Docker is the most robust; point ENV_BASE_URL at http://127.0.0.1:8000.</span> |
| <span class="c"># If you prefer the HF Space, use its direct host (https://<owner>-<space>.hf.space), not the hf.co/spaces page.</span> |
| export ENV_BASE_URL="http://127.0.0.1:8000" |
|
|
| <span class="c"># 2. Training client (ReasoningEconomicsPT) — single-GPU colocate only</span> |
| export REPT_ROOT="$PWD" |
| export REPT_VENV="$PWD/.venv" |
| export REPT_MODEL="Qwen/Qwen3-4B" |
| export CUDA_VISIBLE_DEVICES=0 |
| export REPT_NUM_GPUS=1 |
| export REPT_VLLM_MODE=colocate <span class="c"># vLLM and policy share one GPU</span> |
| export REPT_VLLM_TP=1 |
|
|
| bash scripts/bootstrap_lambda.sh |
| bash scripts/preflight_lambda.sh |
| bash scripts/run_grpo_lambda.sh --dry-run |
| bash scripts/run_grpo_lambda.sh</code></pre> |
| <p>All episodes are seeded. Grading is deterministic (<code>extract_boxed_answer</code> with last-match semantics + SymPy equality). Budget resolution is fully specified by the four-priority table in <a href="#env-design">Environment Design</a>, with <code>budget_source</code> returned in observation metadata for audit.</p> |
| <p><strong>Dependency pin:</strong> <code>trl==1.0.0</code> + <code>vllm==0.10.2</code> + <code>transformers>=5.2,<5.4</code> + <code>torch==2.8.*</code> (see <code>requirements.lambda.txt</code> / <code>requirements.carc-cu121.txt</code>). Multi-GPU variants and the two engineering branches (FSDP / DeepSpeed ZeRO-3) are covered in <a href="#stack-split">Engineering Pathology 3</a>.</p> |
|
|
| <h3>Baseline execution (planned — not yet run for our checkpoint)</h3> |
| <p>Mirrors the LotteryElicitationEnv baseline CLI pattern; evaluates the trained policy against the four planned baselines in <a href="#baselines">Baselines</a>.</p> |
| <pre><code>python -m reasoning_economics_pt.eval.evaluate \ |
| --policy hf --model ./outputs/ckpt-last \ |
| --episodes 200 \ |
| --baselines always_same_budget,greedy_first,uniform_random_split,zero_shot_llm \ |
| --num_questions 4 --budget_ratio 0.8 --hard_cap_mode strict</code></pre> |
| <p>The harness reports <code>reward_mean</code>, <code>accuracy_mean</code>, <code>budget_utilization_clamped</code>, <code>overspend_tokens</code>, average tokens per question, and questions completed, per baseline and per policy. Episodes are seeded identically across baselines so allocation deltas are directly comparable.</p> |
| </section> |
|
|
| <div class="callout"> |
| <div class="q">When every token costs, can the model learn when to think?</div> |
| <div class="sub">Shared-budget reasoning under a verifiable reward is the test. The pipeline is built. The 14B headline run lands at <strong>mean +0.47</strong> across 480 episodes; convergence, baselines, and 10q training are next.</div> |
| </div> |
|
|
| |
| <section id="future"> |
| <h2>Future work</h2> |
| <ul> |
| <li><strong>Run the prescribed fix stack to convergence.</strong> Raise <code>max_completion_length</code> to 8192, cap thinking via <code>max_thinking_tokens</code>, start in soft-budget mode and warm into hard-cap, monitor the IS ratio, add a small partial-credit term for emitting <code></think></code>. The pathology in <a href="#traces">Training Pathology</a> tells us exactly where to intervene.</li> |
| <li><strong>TRL + Unsloth + FSDP did not complete in our stack; TRL + Unsloth + DeepSpeed ZeRO-3 did (Branch B).</strong> That is the only 14B path we validated end-to-end. Getting Unsloth 4-bit QLoRA to compose with a working FSDP2 recipe, if and when upstream lands, would collapse Branches A and B into one; for now, Branch A (FSDP, full fine-tune) stays the fall-back without LoRA.</li> |
| <li><strong>Domain is math-only today.</strong> Grading is <code>\boxed{}</code> + SymPy equality. The MDP generalizes to any verifiable-answer domain, but we have not validated that claim empirically.</li> |
| <li><strong>Cross-model comparison.</strong> Baselines currently run on Qwen2.5-3B/7B and Qwen3-1.7B/4B; a Qwen vs DeepSeek-R1-distill vs open-model matrix is the natural next result.</li> |
| <li><strong>Large models remain infra-limited.</strong> Qwen3-8B, Qwen3-30B-A3B, Qwen2.5-32B will stay infra-limited until FSDP2 is fixed in TRL or we add tensor parallelism to the training path.</li> |
| <li><strong>Push the NCCL padding pattern upstream into TRL.</strong> The bug is general, the fix is simple, and every TRL <code>rollout_func</code> user on variable-length rollouts + server mode has it latent.</li> |
| <li><strong>Domain-transfer sibling.</strong> <em>LotteryElicitationEnv</em> targets prospect-theory preference recovery on the same training harness — validating that the allocator generalizes beyond math is future work.</li> |
| </ul> |
| </section> |
|
|
| |
| <section id="conclusion"> |
| <h2>Conclusion</h2> |
| <p><strong>ReasoningEconomicsEnv</strong> reframes the reasoning-economics question from <em>how short can this answer be?</em> to <em>how should I spend what I have left?</em> A stateless grader, a tokenizer-native budget accountant, and a multiplicatively-coupled per-step-plus-terminal reward give us a sequential MDP where every component is auditable and every dollar of compute is accounted for in the unit system the policy actually sees.</p> |
| <p>The engineering contribution is summarized in one place: <a href="#takeaways">Takeaways</a>. We do not re-enumerate it here.</p> |
| <p>The research question remains open: <em>can a GRPO-trained LLM learn to pace its own reasoning across a shared-budget episode?</em> The headline Qwen3-14B 8×A100 run — 480 episodes, mean reward <strong>+0.4692</strong>, max <strong>+4.52</strong>, <code>env_step_error=0</code> — is the first evidence we have that the answer is yes, subject to baselines landing against the same checkpoint (designed targets in <a href="#baselines">Baselines</a>). Baseline runs against the released 14B checkpoint and the planned NuminaMath-TIR channel are the next results on the roadmap.</p> |
| </section> |
|
|
| <div class="footer"> |
| <p>ReasoningEconomicsEnv · AgentX OpenEnv Track · UC Berkeley RDI</p> |
| <p style="margin-top:.5rem;"> |
| <a href="https://github.com/sharma-yash01/ReasoningEconomicsEnv" target="_blank">Environment</a> · |
| <a href="https://github.com/sharma-yash01/ReasoningEconomicsPT" target="_blank">Training client</a> · |
| <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank">HF Space</a> · |
| <a href="https://github.com/meta-pytorch/OpenEnv" target="_blank">OpenEnv Framework</a> · |
| <a href="https://huggingface.co/docs/trl/en/openenv" target="_blank">TRL × OpenEnv</a> |
| </p> |
| </div> |
|
|
| </div> |
| </body> |
| </html> |
|
|