yashu2000's picture
Updating logos and main blog
3eb6f92 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>SearchEconomicsEnv: A Pandora's-Box RL Environment for Cost-Constrained Agentic Search</title>
<meta name="description" content="SearchEconomicsEnv: an OpenEnv-native benchmark where LLM agents answer multi-hop HotpotQA questions under a hard search-credit budget, with reward shaped by Weitzman's 1979 Pandora's Box model of optimal sequential search. Partnership with Ceramic AI.">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700;800&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet">
<!-- Mermaid for inline diagrams -->
<script type="module">
import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
mermaid.initialize({
startOnLoad: true,
theme: 'dark',
themeVariables: {
primaryColor: '#6366f1',
primaryTextColor: '#e2e8f0',
primaryBorderColor: '#818cf8',
lineColor: '#818cf8',
secondaryColor: '#1e293b',
tertiaryColor: '#172033',
background: '#0f172a',
mainBkg: '#1e293b',
nodeBorder: '#818cf8',
clusterBkg: '#172033',
clusterBorder: '#334155',
titleColor: '#e2e8f0',
edgeLabelBackground: '#1e293b',
nodeTextColor: '#e2e8f0'
},
flowchart: { curve: 'basis', htmlLabels: true },
fontFamily: 'Inter, sans-serif'
});
</script>
<style>
:root {
--bg: #0f172a; --surface: #1e293b; --surface-2: #172033; --border: #334155;
--text: #e2e8f0; --muted: #94a3b8; --accent: #6366f1;
--accent2: #818cf8; --green: #22c55e; --red: #ef4444;
--orange: #f59e0b; --radius: 12px;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
html { scroll-behavior: smooth; }
body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
background: var(--bg); color: var(--text); line-height: 1.7;
-webkit-font-smoothing: antialiased; }
.container { max-width: 820px; margin: 0 auto; padding: 2rem 1.5rem 4rem; }
.topnav { position: sticky; top: 0; z-index: 10; background: rgba(15,23,42,.85);
backdrop-filter: blur(10px); border-bottom: 1px solid var(--border);
padding: .9rem 1.5rem; display: flex; justify-content: space-between;
align-items: center; font-size: .88rem; }
.topnav .brand { font-weight: 700; color: var(--text); text-decoration: none;
display: flex; align-items: center; gap: .5rem; }
.topnav .brand .dot { width: 8px; height: 8px; border-radius: 50%;
background: var(--green); box-shadow: 0 0 8px rgba(34,197,94,.6); }
.topnav .links { display: flex; gap: 1.25rem; }
.topnav .links a { color: var(--muted); text-decoration: none; transition: color .15s; }
.topnav .links a:hover { color: var(--accent2); }
.hero { text-align: center; padding: 4rem 0 2.5rem; }
.hero-badge { display: inline-block; background: rgba(99,102,241,.15); color: var(--accent2);
padding: .4rem 1.1rem; border-radius: 20px; font-size: .78rem; font-weight: 600;
letter-spacing: .08em; margin-bottom: 1.25rem;
border: 1px solid rgba(99,102,241,.3); text-transform: uppercase; }
.hero h1 { font-size: clamp(2rem, 4.2vw, 3.2rem); font-weight: 800; letter-spacing: -.025em;
line-height: 1.15;
background: linear-gradient(135deg, #e2e8f0 25%, #6366f1 100%);
-webkit-background-clip: text; -webkit-text-fill-color: transparent;
background-clip: text; }
.hero .subtitle { color: var(--muted); font-size: 1.15rem; max-width: 680px;
margin: 1rem auto 0; }
.hero .byline { color: var(--muted); font-size: .85rem; margin-top: 1.5rem;
font-style: italic; }
.banner { width: 100%; border-radius: var(--radius); margin: 2rem 0 3rem;
border: 1px solid var(--border); }
.badges { display: flex; justify-content: center; gap: .6rem; flex-wrap: wrap;
margin: 1.5rem 0; }
.badges img { height: 22px; }
.btn-group { display: flex; gap: .75rem; justify-content: center; margin: 2rem 0;
flex-wrap: wrap; }
.btn { display: inline-flex; align-items: center; gap: .45rem; padding: .6rem 1.35rem;
background: var(--accent); color: white; border-radius: 8px; font-size: .88rem;
font-weight: 600; text-decoration: none; transition: all .2s; }
.btn:hover { background: var(--accent2); transform: translateY(-1px); }
.btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
.btn-outline:hover { border-color: var(--accent); color: var(--accent2);
background: rgba(99,102,241,.08); }
.toc { background: var(--surface); border: 1px solid var(--border); border-radius: var(--radius);
padding: 1.25rem 1.5rem; margin: 0 0 2.5rem; }
.toc h3 { font-size: .82rem; font-weight: 700; letter-spacing: .08em; text-transform: uppercase;
color: var(--accent2); margin-bottom: .85rem; }
.toc ol { list-style: none; counter-reset: toc; display: flex; flex-wrap: wrap; gap: .35rem .8rem;
margin: 0; padding: 0; }
.toc ol li { counter-increment: toc; font-size: .88rem; }
.toc ol li::before { content: counter(toc) "."; color: var(--accent); font-weight: 700;
font-size: .8rem; margin-right: .3rem; }
.toc ol li a { color: var(--muted); text-decoration: none; transition: color .15s; }
.toc ol li a:hover { color: var(--accent2); }
section { margin: 3.5rem 0; }
section h2 { font-size: 1.55rem; font-weight: 800; letter-spacing: -.01em;
margin-bottom: 1rem; color: var(--text);
border-left: 3px solid var(--accent); padding-left: .9rem; }
section h3 { font-size: 1.1rem; font-weight: 700; margin: 2rem 0 .75rem;
color: var(--accent2); }
section p { color: #cbd5e1; margin-bottom: 1rem; font-size: 1.02rem; }
section p strong { color: var(--text); }
section ul, section ol { color: #cbd5e1; margin: 1rem 0 1rem 1.5rem; }
section ul li, section ol li { margin-bottom: .5rem; font-size: 1rem; }
section ul li strong, section ol li strong { color: var(--text); }
blockquote { border-left: 3px solid var(--accent2);
background: rgba(99,102,241,.06); padding: 1.1rem 1.25rem;
margin: 1.5rem 0; border-radius: 0 8px 8px 0;
color: #e2e8f0; font-size: 1.02rem; }
.table-wrap { margin: 1.5rem 0; overflow-x: auto;
background: var(--surface); border: 1px solid var(--border);
border-radius: var(--radius); }
table { width: 100%; border-collapse: collapse; font-size: .92rem; }
th { background: rgba(99,102,241,.1); color: var(--accent2);
font-size: .72rem; font-weight: 700; letter-spacing: .06em;
text-transform: uppercase; padding: .85rem 1rem; text-align: left; }
td { padding: .7rem 1rem; border-top: 1px solid var(--border); color: #cbd5e1; }
td.num { text-align: right; font-variant-numeric: tabular-nums;
font-family: 'JetBrains Mono', monospace; font-size: .88rem; }
tr:hover td { background: rgba(99,102,241,.04); }
td strong, th strong { color: var(--text); }
tr.avg-row td { background: rgba(99,102,241,.08); font-weight: 700;
color: var(--text); }
tr.novel td:first-child { color: #fca5a5; }
pre { background: #0b1120; border: 1px solid var(--border);
border-radius: var(--radius); padding: 1.1rem 1.25rem; overflow-x: auto;
margin: 1.25rem 0; font-family: 'JetBrains Mono', monospace;
font-size: .85rem; line-height: 1.6; color: #d1d5db; }
pre .c { color: #64748b; }
code { font-family: 'JetBrains Mono', monospace; font-size: .88em;
background: rgba(99,102,241,.12); color: var(--accent2);
padding: .1em .35em; border-radius: 4px; }
pre code { background: none; color: inherit; padding: 0; font-size: 1em; }
figure { margin: 2rem 0; }
figure img { width: 100%; border-radius: var(--radius);
border: 1px solid var(--border); }
figcaption { text-align: center; color: var(--muted); font-size: .85rem;
margin-top: .75rem; }
.mermaid-wrap { margin: 2rem 0; background: var(--surface); border: 1px solid var(--border);
border-radius: var(--radius); padding: 1.5rem 1rem; overflow-x: auto; }
.mermaid-wrap .mermaid { display: flex; justify-content: center; }
.mermaid-caption { text-align: center; color: var(--muted); font-size: .85rem;
margin-top: .75rem; }
.episode-trace { background: var(--surface); border: 1px solid var(--border);
border-radius: var(--radius); padding: 1.25rem 1.5rem; margin: 1.5rem 0;
position: relative; }
.episode-trace::before { content: ''; position: absolute; left: 1.5rem; top: 2.5rem;
bottom: 1.25rem; width: 2px; background: var(--border); }
.trace-step { position: relative; padding-left: 2rem; margin-bottom: 1.25rem; }
.trace-step:last-child { margin-bottom: 0; }
.trace-step .step-marker { position: absolute; left: -.45rem; top: .2rem; width: 12px;
height: 12px; border-radius: 50%; border: 2px solid var(--accent);
background: var(--bg); z-index: 1; }
.trace-step .step-marker.terminal { background: var(--red); border-color: var(--red); }
.trace-step .step-marker.good { background: var(--green); border-color: var(--green); }
.trace-step .step-label { font-family: 'JetBrains Mono', monospace; font-size: .78rem;
color: var(--accent2); font-weight: 700; margin-bottom: .25rem; }
.trace-step .step-content { font-size: .9rem; color: #cbd5e1; }
.trace-step .step-content code { font-size: .82em; }
.trace-verdict { margin-top: 1rem; padding: .75rem 1rem; border-radius: 8px;
font-size: .9rem; font-weight: 600; }
.trace-verdict.bad { background: rgba(239,68,68,.1); border: 1px solid rgba(239,68,68,.3);
color: #fca5a5; }
.trace-verdict.good { background: rgba(34,197,94,.1); border: 1px solid rgba(34,197,94,.3);
color: #86efac; }
.callout { text-align: center; padding: 2rem 1.5rem; margin: 3rem 0;
background: linear-gradient(135deg, rgba(99,102,241,.08), rgba(129,140,248,.04));
border: 1px solid rgba(99,102,241,.25); border-radius: var(--radius); }
.callout .q { font-size: 1.25rem; font-weight: 700; color: var(--text);
font-style: italic; margin-bottom: .5rem; }
.callout .sub { color: var(--muted); font-size: .95rem; }
.callout.status { background: linear-gradient(135deg, rgba(245,158,11,.08), rgba(245,158,11,.03));
border-color: rgba(245,158,11,.25); margin: 1.5rem 0 2.5rem; }
.callout.status .q { color: var(--orange); }
figure.banner-figure { margin: 2rem 0 3rem; }
figure.banner-figure img { width: 100%; border-radius: var(--radius);
border: 1px solid var(--border); display: block; }
figure.banner-figure figcaption { text-align: center; color: var(--muted);
font-size: .85rem; margin-top: .75rem; }
.ceramic-inline-logo { display: inline-block; vertical-align: middle; margin-left: .85rem; }
.ceramic-inline-logo .ceramic-logo { height: 40px; width: auto; vertical-align: middle;
display: inline-block; }
.ceramic-inline-logo a { line-height: 0; }
.ceramic-inline-logo a:hover .ceramic-logo { filter: brightness(1.08); }
.footer { text-align: center; padding: 3rem 0 1rem; color: var(--muted);
font-size: .85rem; border-top: 1px solid var(--border); margin-top: 3rem; }
.footer a { color: var(--accent2); text-decoration: none; margin: 0 .5rem; }
.footer a:hover { text-decoration: underline; }
@media (max-width: 640px) {
.container { padding: 1rem 1rem 3rem; }
.hero { padding: 2.5rem 0 1.5rem; }
.topnav .links { display: none; }
section h2 { font-size: 1.3rem; }
table { font-size: .82rem; }
th, td { padding: .55rem .6rem; }
.toc ol { flex-direction: column; }
.episode-trace { padding: 1rem; }
.episode-trace::before { left: 1rem; }
}
.math-display {
margin: 1.25rem 0;
padding: 1rem 1.25rem 1.15rem;
overflow-x: auto;
background: var(--surface);
border: 1px solid var(--border);
border-radius: var(--radius);
text-align: center;
}
.math-display mjx-container[jax="CHTML"][display="true"] { margin: 0.65em 0 !important; }
.math-display mjx-container { color: #e2e8f0 !important; }
.math-note { font-size: .9rem; color: var(--muted); margin-top: .35rem; margin-bottom: 0; }
</style>
<script>
window.MathJax = {
tex: {
inlineMath: [['\\(', '\\)']],
displayMath: [['\\[', '\\]']]
},
options: {
renderActions: {
addMenu: [0, '', '']
}
}
};
</script>
<script defer src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" id="MathJax-script"></script>
</head>
<body>
<nav class="topnav">
<a href="#top" class="brand"><span class="dot"></span> SearchEconomicsEnv Blog</a>
<div class="links">
<a href="#hook">Why</a>
<a href="#design">Design</a>
<a href="#reward">Reward</a>
<a href="#traces">Traces</a>
<a href="#engineering">Engineering</a>
<a href="#quickstart">Quick Start</a>
<a href="https://huggingface.co/spaces/yashu2000/search-economics-env" target="_blank">Live Space &#8599;</a>
</div>
</nav>
<div class="container" id="top">
<div class="hero">
<div class="hero-badge">OpenEnv &middot; AgentX Submission</div>
<h1>SearchEconomicsEnv</h1>
<p class="subtitle">A Pandora's-Box RL environment where LLMs learn how <em>often</em> to search, not just how to answer. Multi-hop HotpotQA under a hard search-credit budget, with rewards shaped by Weitzman's 1979 optimal-stopping theorem.</p>
<div class="badges">
<a href="https://github.com/sharma-yash01/SearchEconomicsEnv" target="_blank"><img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"/></a>
<a href="https://huggingface.co/spaces/yashu2000/search-economics-env" target="_blank"><img src="https://img.shields.io/badge/HF%20Space-Live%20Demo-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/></a>
<img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/>
<img src="https://img.shields.io/badge/Dataset-HotpotQA-22c55e" alt="HotpotQA"/>
<img src="https://img.shields.io/badge/Training-GRPO%20(planned)-orange" alt="GRPO"/>
<img src="https://img.shields.io/badge/Partner-Ceramic%20AI-6366f1" alt="Ceramic AI"/>
</div>
<div class="byline">AgentX OpenEnv Track &nbsp;|&nbsp; Yashaswi Sharma (USC) &middot; Defu Cao (USC) &middot; Muyan Weng (USC) &nbsp;|&nbsp; in partnership with <strong>Ceramic AI</strong> (Lucas Han)</div>
</div>
<figure class="banner-figure">
<img src="banner.png" alt="SearchEconomicsEnv: search then observe then commit loop. Agent pays a flat search cost to open a Ceramic box, observes snippet scores and a rolling context window, then commits an answer graded by a Weitzman composite reward."/>
<figcaption>Schematic of the <code>search</code> &rarr; <code>observe</code> &rarr; <code>commit</code> loop. Snippet titles, scores, and the <code>Robert Zemeckis</code> answer shown in the banner are illustrative placeholders for visual clarity, not retrieved Ceramic results or trained-model output.</figcaption>
</figure>
<div class="callout status">
<div class="q">Environment shipped. Trained policy pending.</div>
<div class="sub">The OpenEnv environment, Ceramic integration, baselines, hermetic fallback path, and Docker deploy wiring are complete. <strong>We have no GRPO-trained checkpoint yet: we exhausted our compute budget before a run converged.</strong> Every episode trace, banner snippet, and reward number in this post that is not directly read from <code>EnvConfig</code> defaults is an <em>illustrative target</em>, not measured model output.</div>
</div>
<div class="btn-group">
<a class="btn" href="https://huggingface.co/spaces/yashu2000/search-economics-env" target="_blank">Live Environment Space &rarr;</a>
<a class="btn btn-outline" href="https://github.com/sharma-yash01/SearchEconomicsEnv" target="_blank">GitHub Repo</a>
<a class="btn btn-outline" href="https://rdi.berkeley.edu/agentx-agentbeats.html" target="_blank">AgentX Track</a>
</div>
<!-- Table of Contents -->
<nav class="toc" id="toc">
<h3>Contents</h3>
<ol>
<li><a href="#hook">How Often Should an Agent Search?</a></li>
<li><a href="#matters">Why This Benchmark Matters</a></li>
<li><a href="#prior-work">Prior Work &amp; Novelty</a></li>
<li><a href="#design">What SearchEconomicsEnv Is</a></li>
<li><a href="#env-design">Environment Design &amp; Schemas</a></li>
<li><a href="#reward">Reward: Weitzman's Pandora's Box in Math</a></li>
<li><a href="#agent-policies">How the Agent Interacts with Ceramic</a></li>
<li><a href="#ceramic-partnership">The Ceramic AI Partnership</a></li>
<li><a href="#architecture">Architecture</a></li>
<li><a href="#baselines">Baselines &amp; Expected Behaviors</a></li>
<li><a href="#traces">Episode Traces: Failure vs Ideal</a></li>
<li><a href="#engineering">Engineering Lesson: The Pydantic Zero-Results Bug</a></li>
<li><a href="#risks">Risk Register</a></li>
<li><a href="#foundations">Foundations &amp; Citations</a></li>
<li><a href="#quickstart">Quick Start</a></li>
<li><a href="#future">Future Work</a></li>
</ol>
</nav>
<!-- 1. HOOK -->
<section id="hook">
<h2>How Often Should an Agent Search?</h2>
<p>Modern LLM agents are taught to <strong>use tools</strong> but almost never taught <em>how often</em> to use them. Retrieval-augmented agents either fire a search on every turn, or never, and the cost of a search is treated as a free lunch. <strong>SearchEconomicsEnv</strong> changes that. It is an OpenEnv-compliant RL environment where an LLM answers multi-hop HotpotQA questions under a hard <strong>search-credit budget</strong>, and each Ceramic search costs real reward.</p>
<p>The reward shape is lifted directly from Martin Weitzman's 1979 <em>Pandora's Box</em> model of optimal sequential search: open a box (issue a search) for cost &beta;, or stop and commit. The research question is whether GRPO-trained LLMs <strong>rediscover</strong> Weitzman's optimal threshold rule from the raw reward signal alone, without ever being told what optimal stopping is. The environment is fully Dockerised, OpenEnv-validated, and Ceramic-AI-integrated.</p>
<blockquote><em>Every Ceramic search is one Pandora's box: pay &beta;, see what is inside, decide whether to keep opening.</em></blockquote>
</section>
<!-- 2. WHY IT MATTERS -->
<section id="matters">
<h2>Why this Benchmark Matters</h2>
<p>Almost every public RAG benchmark (Natural Questions, HotpotQA, TriviaQA, MS MARCO) ranks systems by <strong>retrieval recall</strong> or <strong>answer F1</strong>. None of them measure the marginal value of one more search call. Production RAG is the opposite: every additional Pinecone, Weaviate, Brave, or Ceramic call has measurable latency and dollar cost, and <em>&quot;should I search again or just answer?&quot;</em> is the live question every agent framework has to answer inside inference loops like LangChain agents, OpenAI function-calling, and Claude tool use.</p>
<p>Treating that decision as a <strong>first-class learning signal</strong> has been studied under names like <em>budgeted MDPs</em>, <em>cost-aware RL</em>, and <em>information-purchasing agents</em>, but always with synthetic environments: gridworlds, token games, mock APIs. To our knowledge there is <strong>no</strong> publicly released RL environment that connects an LLM agent to a real, live, vendor-graded search API as the only information channel; shapes reward according to a <em>theoretically optimal</em> sequential-search policy (Weitzman's Pandora's Box); and ships in OpenEnv format so any GRPO, PPO, or DPO trainer can plug in. <code>SearchEconomicsEnv</code> is that environment.</p>
<p>Weitzman (1979) studies an agent who must decide which of N boxes to open (opening box <em>i</em> costs <code>c_i</code> and reveals a stochastic prize <code>X_i</code>) before committing. The optimal policy is shockingly clean: compute a <strong>reservation value</strong> <em>z_i</em> for each box such that <code>E[max(X_i, z_i)] &minus; c_i = z_i</code>, open boxes in decreasing order of <em>z_i</em>, and stop the first time the best observed prize exceeds the highest unopened reservation value. It is one of the few sequential-search problems with a closed-form optimal policy, and the mapping to our environment is direct:</p>
<div class="table-wrap">
<table>
<thead><tr><th>Pandora's Box</th><th>SearchEconomicsEnv</th></tr></thead>
<tbody>
<tr><td>Box</td><td>One Ceramic search call</td></tr>
<tr><td>Cost of opening</td><td><code>&minus;&beta;</code> (negative reward per search)</td></tr>
<tr><td>Prize</td><td>Information gain about the answer</td></tr>
<tr><td>Commit decision</td><td><code>commit</code> action with final answer</td></tr>
<tr><td>Reservation value</td><td>Implicitly learned by the policy</td></tr>
</tbody>
</table>
</div>
</section>
<!-- 3. PRIOR WORK & NOVELTY -->
<section id="prior-work">
<h2>Prior Work &amp; Novelty</h2>
<p>Most &quot;LLM + retrieval&quot; and &quot;LLM + tool use&quot; work lands in one of six buckets. None of them occupies the cell we target:</p>
<div class="table-wrap">
<table>
<thead><tr><th>Existing work</th><th>What it does</th><th>What it lacks vs. SearchEconomicsEnv</th></tr></thead>
<tbody>
<tr><td><strong>HotpotQA leaderboard</strong><br><span style="font-size:.85em;color:var(--muted)">Yang et al., <a href="https://arxiv.org/abs/1809.09600" target="_blank" style="color:var(--accent2)">arXiv:1809.09600</a>, EMNLP 2018</span></td><td>Ranks systems by EM and F1 on a fixed retrieval pipeline</td><td>No notion of search cost, no agent-controlled retrieval, no RL signal</td></tr>
<tr><td><strong>LangChain ReAct agents</strong></td><td>Multi-step tool use including search</td><td>No reward, no learning, no cost shaping</td></tr>
<tr><td><strong>WebGPT</strong><br><span style="font-size:.85em;color:var(--muted)">Nakano et al., <a href="https://arxiv.org/abs/2112.09332" target="_blank" style="color:var(--accent2)">arXiv:2112.09332</a>, 2021</span></td><td>RL-trained browsing agent on Bing</td><td>Closed-source, no public env, no theoretical reward grounding</td></tr>
<tr><td><strong>Toolformer</strong><br><span style="font-size:.85em;color:var(--muted)">Schick et al., <a href="https://arxiv.org/abs/2302.04761" target="_blank" style="color:var(--accent2)">arXiv:2302.04761</a>, 2023</span></td><td>LLM learns when to call tools via self-supervision</td><td>Self-supervised, no per-call cost, no MDP</td></tr>
<tr><td><strong>BoxRL / Pandora gym envs</strong></td><td>Toy Pandora's Box implementations</td><td>No LLM, no real API, no language at all</td></tr>
<tr><td><strong>ReasoningEconomicsEnv</strong> (our sibling)</td><td>Token-budget RL on math problems via MetaMathQA + SymPy</td><td>Single-step decision, no real external action, no theoretical optimum</td></tr>
<tr class="novel"><td><strong>SearchEconomicsEnv (ours)</strong></td><td>Multi-step MDP, structured search/commit actions, live Ceramic API, Weitzman-shaped composite reward, OpenEnv-compliant</td><td>No trained policy yet (<a href="#future">future work</a>)</td></tr>
</tbody>
</table>
</div>
<blockquote>To our knowledge, no publicly released RL environment combines all three of: a <strong>theoretically optimal</strong> reward formalism (Weitzman's Pandora's Box), a <strong>live vendor search API</strong> as the only information channel (Ceramic AI), and the <strong>OpenEnv</strong> contract so any GRPO, PPO, or DPO trainer can plug in. The environment, action schema, and reward semantics are new; the method (RLVR / GRPO on verifiable signals) is shared with the Recon and negotiation-RLVR lineages.</blockquote>
<p>This project is the <strong>second</strong> environment in a research line on <em>economic</em> constraints for LLM agents. The first, <a href="#foundations"><code>ReasoningEconomicsEnv</code></a>, budgets reasoning <em>tokens</em> across a batch of math problems. SearchEconomicsEnv replaces token allocation with sequential search-or-commit decisions, MetaMathQA with HotpotQA, and local DeepSeek inference with the live Ceramic Search API. The conceptual shift is from a one-step budgeted regression to a genuine stopping-problem MDP.</p>
</section>
<!-- 4. WHAT IT IS -->
<section id="design">
<h2>What SearchEconomicsEnv is</h2>
<blockquote>An OpenEnv-native MDP where an LLM agent decides, at every step, whether to issue another Ceramic search or commit a final answer. A shared search budget and per-step cost turn the problem into a real sequential stopping problem, not a bandit.</blockquote>
<p>Each episode plays out like this:</p>
<ul>
<li>The environment samples a stratified batch of HotpotQA questions (default 10, difficulty mix <code>{easy: 0.3, medium: 0.4, hard: 0.3}</code>).</li>
<li>At reset, every question is batch-encoded to a 384-dim embedding via <code>sentence-transformers/all-MiniLM-L6-v2</code> (or a deterministic hash encoder as fallback).</li>
<li>On each step, the agent emits a structured <code>search</code> or <code>commit</code> action. A search fires a Ceramic call, returns snippets and their scores, and pays <code>&minus;&beta;</code>. A commit grades the answer with EM and token-F1, pays the composite commit reward, and advances to the next question.</li>
<li>Searches draw from a <strong>pooled</strong> budget of <code>int(search_budget_ratio &times; num_questions)</code> credits (default 30 searches for 10 questions), so frugality on easy questions buys effort on hard ones.</li>
<li>Exhausting the budget or the per-question cap (default 5) force-commits the current question as wrong, and force-commits every remaining question. <code>done=True</code> is reachable in two ways: all questions committed, or budget burned.</li>
</ul>
<p>The conceptual diff from the sibling reasoning environment is the cleanest summary of what changed:</p>
<div class="table-wrap">
<table>
<thead><tr><th>Aspect</th><th>ReasoningEconomicsEnv</th><th>SearchEconomicsEnv</th></tr></thead>
<tbody>
<tr><td><strong>Agent decision</strong></td><td>How many tokens to allocate</td><td>When to search vs. commit</td></tr>
<tr><td><strong>Episode MDP</strong></td><td>1 <code>step()</code> per question</td><td>N <code>step()</code>s per question</td></tr>
<tr><td><strong>Action</strong></td><td><code>response: str</code></td><td><code>action_type</code> + <code>query</code> or <code>answer</code></td></tr>
<tr><td><strong>Budget unit</strong></td><td>Tokens (10&ndash;800)</td><td>Searches (pooled, ~3&times; <code>num_questions</code>)</td></tr>
<tr><td><strong>Info source</strong></td><td>Local DeepSeek inference</td><td>Ceramic Search API (+ fallback)</td></tr>
<tr><td><strong>Dataset</strong></td><td>MetaMathQA + NuminaMath-TIR</td><td>HotpotQA (difficulty-stratified)</td></tr>
<tr><td><strong>Grading</strong></td><td>SymPy symbolic + numeric</td><td>EM + token-F1, robust answer extraction</td></tr>
<tr><td><strong>Reward shape</strong></td><td>Correctness &plusmn; token cost</td><td>Per-step search cost + composite commit reward</td></tr>
</tbody>
</table>
</div>
<p>Two of those rows do most of the conceptual work. First, <em>1 step() to N step()s per question</em> turns a one-shot allocation into a genuine sequential-decision MDP. Second, <em>DeepSeek to Ceramic</em> replaces a local model that <strong>generates</strong> solutions with a live retrieval API that <strong>returns documents</strong>; the agent now has to formulate its own answer from snippets, and the environment never produces an answer for it.</p>
</section>
<!-- 5. ENVIRONMENT DESIGN -->
<section id="env-design">
<h2>Environment Design &amp; Schemas</h2>
<p>The core contract is two Pydantic types exchanged over the OpenEnv WebSocket:</p>
<pre><code><span class="c"># Action (agent &rarr; env)</span>
class SearchEconAction(BaseModel):
action_type: Literal["search", "commit"]
query: str | None = None <span class="c"># required when action_type == "search"</span>
answer: str | None = None <span class="c"># required when action_type == "commit"</span>
<span class="c"># Observation (env &rarr; agent)</span>
class SearchEconObservation(BaseModel):
<span class="c"># Question</span>
question: str
question_embedding: list[float] <span class="c"># 384-dim</span>
question_idx: int
question_done: bool
<span class="c"># Budget</span>
searches_remaining: int
searches_used_this_question: int
max_searches_per_question: int
budget_remaining_ratio: float
<span class="c"># Last search results (empty on reset / after commit)</span>
ceramic_results: list[SearchResult] <span class="c"># title/url/description/score</span>
top_score: float
score_variance: float
search_latency_s: float
<span class="c"># Accumulated context for the current question</span>
context_window: list[str] <span class="c"># max 5 snippets, 300 chars each</span>
<span class="c"># Episode tracking</span>
step_idx: int
questions_remaining: int
accuracy_so_far: float
history: list[dict] <span class="c"># per-commit EM / F1 / quality / mode</span>
<span class="c"># Plumbing</span>
done: bool
reward: float | None
metadata: dict</code></pre>
<p>The action is intentionally schema-strict. <code>model_post_init</code> raises <code>ValueError</code> if the wrong field is unset; a malformed action from the trainer's JSON parser falls through to a <code>commit</code> with an empty answer (guaranteed wrong, charges no extra search). This is what lets a GRPO trainer treat the LLM's output as a <em>parsable</em> signal and back-propagate &quot;your JSON was malformed&quot; as negative reward without any special-case logic.</p>
<p>The observation is rich because the agent needs both the question content <em>and</em> the search-state context to make a Weitzman-rational decision. The schema is deliberately large so that the same observation drives both a learned RL policy and a hand-coded threshold baseline.</p>
<h3>How the grader turns a raw model string into a reward</h3>
<p>Training an LLM with GRPO means the environment receives text, not a dataclass. Extraction has to be tolerant to reasoning traces, structured JSON, or bare answers, and it has to be deterministic and fast enough to run inside a reward function.</p>
<ol>
<li><strong>Extraction ladder.</strong> The grader in <code>env/answer_grading.py</code> tries, in order: (a) strip markdown code fences (<code>```json</code>, <code>```</code>) and retry; (b) parse the remainder as JSON and read <code>obj[&quot;answer&quot;]</code>, or <code>obj[&quot;answer&quot;]</code> when <code>obj[&quot;type&quot;] == &quot;commit&quot;</code>; (c) first line matching <code>^Answer:</code> or <code>^Final answer:</code> (case-insensitive); (d) the last non-empty line of the raw text. Anything that still fails to extract a non-empty string falls through to a wrong-commit with <code>q = 0</code>.</li>
<li><strong>Direct string comparison against HotpotQA gold.</strong> The extracted answer is normalised (lowercase, articles stripped, punctuation stripped, whitespace collapsed) and compared to the <code>answer</code> field of the sampled HotpotQA row. <strong>Exact Match</strong> is exact normalised equality. <strong>Token-F1</strong> is multiset overlap on those same normalised tokens. There is no semantic scorer, no embedding similarity, and no LLM judge in the default grader, by design: GRPO throughput cannot tolerate LLM-judge latency in the inner loop.</li>
<li><strong>LLM-as-judge is a planned v2.</strong> A future release will add an optional LLM-as-judge grading path behind a config flag, for evaluation runs that want to catch cases where the model's answer is factually equivalent to the gold string but not string-equivalent (<code>Zemeckis</code> vs <code>Robert Zemeckis</code>, <code>US</code> vs <code>United States</code>). v1 ships with deterministic EM + token-F1 only, which is the right default for a training-loop reward function but is known to under-credit semantically correct answers; the composite reward's partial-F1 term exists precisely to soften this.</li>
</ol>
<div class="episode-trace">
<div class="trace-step">
<div class="step-marker"></div>
<div class="step-label">reset(seed=42)</div>
<div class="step-content">Returns observation for question 0. 10 HotpotQA questions loaded, 30 search credits, <code>context_window = []</code>.</div>
</div>
<div class="trace-step">
<div class="step-marker"></div>
<div class="step-label">step(search, query="...")</div>
<div class="step-content">Ceramic call &rarr; observation with snippets, <code>top_score</code>, <code>score_variance</code>. Reward = <code>&minus;&beta;</code>.</div>
</div>
<div class="trace-step">
<div class="step-marker"></div>
<div class="step-label">step(search, query="...")</div>
<div class="step-content">Another call. <code>context_window</code> now holds 2 snippets. <code>searches_remaining</code> decrements.</div>
</div>
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">step(commit, answer="...")</div>
<div class="step-content">Grade with EM + token-F1 &rarr; composite commit reward. <code>question_idx</code> advances.</div>
</div>
<div class="trace-step">
<div class="step-marker"></div>
<div class="step-label">&hellip; continue until &hellip;</div>
<div class="step-content">All questions committed <strong>or</strong> shared budget exhausted. <code>done=True</code> on the terminal step.</div>
</div>
</div>
</section>
<!-- 6. REWARD MATH -->
<section id="reward">
<h2>Reward: Weitzman's Pandora's Box in Math</h2>
<p>The per-step reward has two modes. Every <code>search</code> pays a flat cost:</p>
<div class="math-display" aria-label="Per-step search cost">
\[
R_t^{\text{search}} = -\beta
\]
</div>
<p class="math-note">Default <code>&beta; = 0.1</code>. Every search costs the same, regardless of whether it returned useful information. This matches Weitzman's assumption that opening costs are paid up-front.</p>
<p>Every <code>commit</code> pays the <strong>composite commit reward</strong>:</p>
<div class="math-display" aria-label="Composite commit reward">
\[
R_t^{\text{commit}}
= R_{\text{wrong}}
+ q \cdot (R_{\text{right}} - R_{\text{wrong}})
+ \eta \cdot \gamma \cdot \frac{B_t}{B_0}
\]
</div>
<p class="math-note">
<code>q &isin; [0, 1]</code> is the grader quality (<code>1.0</code> if exact match, else token-F1);
<code>&eta; = 1</code> iff <code>q &ge; q_min</code> (default <code>q_min = 1.0</code>, so only full-EM commits earn the efficiency bonus);
<code>B_t / B_0</code> is the fraction of search credits remaining;
and <code>&gamma; = 0.1</code> is the efficiency bonus weight.
</p>
<p>Setting the expected marginal benefit of one more search equal to its certain cost gives an <strong>indifference threshold</strong>. Each additional search improves quality by <code>&Delta;q</code>, worth <code>&Delta;q &middot; (R_right &minus; R_wrong) = 1.1 &middot; &Delta;q</code> in expected reward. It costs <code>&beta; = 0.1</code> directly, plus a lost efficiency bonus of <code>&gamma; / B_0 &asymp; 0.003</code> for <code>B_0 = 30</code>. Solving gives:</p>
<div class="math-display" aria-label="Reservation value / indifference threshold">
\[
\Delta q^{*}
= \frac{\beta + \gamma / B_0}{R_{\text{right}} - R_{\text{wrong}}}
\;\approx\; \frac{0.1 + 0.003}{1.1}
\;\approx\; 0.094
\]
</div>
<p class="math-note">A rational agent should keep searching whenever it expects to improve quality by about <strong>9 percentage points</strong> or more. This is the Weitzman reservation value, made concrete on our defaults.</p>
<p>A <strong>legacy binary</strong> grading mode (<code>commit_reward_mode="legacy_binary"</code>) restores a strict normalised-equality match without answer extraction, for ablations that isolate how much of the agent's improvement comes from partial-credit shaping versus genuine accuracy gains.</p>
<h3><code>EnvConfig</code> defaults (full)</h3>
<p>The complete set of knobs exposed by <code>SearchEconomicsEnv/env/config.py</code>, with the values the environment ships with. These are the numbers the abstract constants above resolve to.</p>
<div class="table-wrap">
<table>
<thead><tr><th>Field</th><th>Default</th><th>Notes</th></tr></thead>
<tbody>
<tr><td><code>num_questions</code></td><td class="num">10</td><td>Questions per episode; shared-budget denominator</td></tr>
<tr><td><code>max_searches_per_question</code></td><td class="num">5</td><td>Hard cap before force-commit on a single question</td></tr>
<tr><td><code>search_budget_ratio</code></td><td class="num">3.0</td><td>Pooled budget <code>B_0 = ratio &times; num_questions</code> (default 30)</td></tr>
<tr><td><code>use_shared_budget</code></td><td class="num">True</td><td>Frugality on easy questions funds hard ones</td></tr>
<tr><td><code>dataset</code></td><td><code>hotpotqa</code></td><td>HuggingFace dataset identifier</td></tr>
<tr><td><code>dataset_split</code></td><td><code>train</code></td><td>Split loaded on <code>reset()</code></td></tr>
<tr><td><code>difficulty_mix</code></td><td><code>{easy:0.3, medium:0.4, hard:0.3}</code></td><td>Stratified sampling proportions per episode</td></tr>
<tr><td><code>ceramic_api_key</code></td><td><code>&quot;&quot;</code></td><td>Empty string activates the deterministic fallback client</td></tr>
<tr><td><code>ceramic_timeout_s</code></td><td class="num">10.0</td><td>Per-request timeout for live Ceramic calls</td></tr>
<tr><td><code>max_results_per_search</code></td><td class="num">10</td><td>Snippets returned per successful search</td></tr>
<tr><td><code>beta</code></td><td class="num">0.1</td><td>Flat per-search cost</td></tr>
<tr><td><code>gamma</code></td><td class="num">0.1</td><td>Efficiency-bonus weight on <code>B_t / B_0</code></td></tr>
<tr><td><code>correct_reward</code></td><td class="num">+1.0</td><td><code>R_right</code> in the commit reward</td></tr>
<tr><td><code>incorrect_reward</code></td><td class="num">&minus;0.1</td><td><code>R_wrong</code> floor</td></tr>
<tr><td><code>grade_count_correct_mode</code></td><td><code>em_only</code></td><td>Criterion for the <em>correct count</em> metric (not the reward)</td></tr>
<tr><td><code>f1_count_threshold</code></td><td class="num">0.85</td><td>Token-F1 threshold when <code>grade_count_correct_mode</code> is permissive</td></tr>
<tr><td><code>commit_reward_mode</code></td><td><code>composite</code></td><td><code>composite</code> uses the shaped formula; <code>legacy_binary</code> for ablations</td></tr>
<tr><td><code>efficiency_bonus_min_quality</code></td><td class="num">1.0</td><td><code>q_min</code>; only full-EM commits earn the bonus</td></tr>
<tr><td><code>partial_reward_scale</code></td><td class="num">1.0</td><td>Multiplier on the <code>q &middot; (R_right - R_wrong)</code> term</td></tr>
<tr><td><code>embedding_model</code></td><td><code>all-MiniLM-L6-v2</code></td><td>Sentence-Transformers model for top-score computation</td></tr>
<tr><td><code>max_context_snippets</code></td><td class="num">5</td><td>Rolling context window size across searches</td></tr>
<tr><td><code>snippet_max_chars</code></td><td class="num">300</td><td>Character truncation per snippet before appending</td></tr>
<tr><td><code>seed</code></td><td><code>None</code></td><td>Episode-level RNG seed; <code>None</code> = environment-chosen</td></tr>
</tbody>
</table>
</div>
</section>
<!-- 6b. AGENT POLICIES -->
<section id="agent-policies">
<h2>How the Agent Interacts with Ceramic: Dual Policies under one Reward</h2>
<p>A single <code>SearchEconAction</code> looks flat on the wire: a discriminator plus one string. Structurally, though, the agent is learning <strong>two</strong> policies that share a single scalar objective. Understanding the decomposition is the clearest way to see what the environment measures.</p>
<h3>Two policies, one objective</h3>
<p>On every observation the agent chooses between two decisions: (a) formulate a <em>query</em> and issue another search, or (b) <em>commit</em> an answer and move on. That dual choice is encoded as one <code>SearchEconAction</code> with an <code>action_type</code> discriminator, but it decomposes cleanly into two learned sub-policies riding inside the same LLM weights.</p>
<h3>The stopping policy: <code>search</code> vs <code>commit</code></h3>
<p>Deciding when to stop searching is the classic Weitzman reservation problem. The agent has to learn when the expected marginal quality gain <code>&Delta;q</code> from another search falls below the reservation threshold <code>&Delta;q* = (&beta; + &gamma; / B_0) / (R_right &minus; R_wrong) &asymp; 0.094</code> on the defaults. It gets no direct feature telling it &quot;you know enough to answer&quot;; it has to infer that from <code>top_score</code>, <code>score_variance</code>, what is already sitting in the <code>context_window</code>, the remaining shared budget <code>budget_remaining_ratio</code>, and its own internal representation of the question.</p>
<h3>The search-formulation policy: the content of <code>query</code></h3>
<p>Conditional on choosing to search, the agent also picks the <em>query string</em>. A good policy learns two related sub-behaviours: (i) <strong>query compression</strong>, emitting a bridging entity or a short factual sub-question rather than a verbatim copy of the original multi-hop question; and (ii) <strong>multi-hop decomposition</strong>, using the result of turn <em>N</em> to shape the query at turn <em>N+1</em>. Both are measurable from episode logs at evaluation time: average query token count, semantic similarity between consecutive queries within an episode, and the distribution of unique bridging entities per question.</p>
<h3>One scalar objective</h3>
<p>Both sub-policies share the same optimiser: maximise expected episode reward. The reward function makes the trade-off concrete. Every extra search costs <code>&beta;</code>; every unused search credit on a correct commit pays <code>&gamma; &middot; B_t / B_0</code>; a correct-versus-wrong commit pays <code>R_right &minus; R_wrong = 1.1</code>. The agent that maximises expected reward is, by construction, the agent that answers the most questions correctly using the fewest searches.</p>
<blockquote>The open question is not whether such a policy exists. Weitzman tells us it does, with a known closed form. The question is whether GRPO on raw reward discovers it, and how that policy factorises, inside a single set of LLM weights, across stopping versus query formulation.</blockquote>
</section>
<!-- 7. CERAMIC PARTNERSHIP -->
<section id="ceramic-partnership">
<h2>The Ceramic AI partnership</h2>
<p>Ceramic AI is the retrieval partner on this submission, and the deal is mutual. Their search API is the <em>only</em> information channel available to a trained agent, so every gradient step is, implicitly, a statement about which Ceramic results drive downstream task accuracy.</p>
<div class="table-wrap">
<table>
<thead><tr><th>What Ceramic gets</th><th>What we get</th></tr></thead>
<tbody>
<tr><td>First RL benchmark measuring downstream task value of their search API</td><td>Live, non-mockable agentic search. No synthetic simulation.</td></tr>
<tr><td>Real stress-test of search quality under cost-constrained RL dynamics</td><td>Publishable environment grounded in economic theory (Weitzman 1979).</td></tr>
<tr><td>Quantitative marketing asset of the form &quot;X quality points per search credit&quot;</td><td>Ceramic API key and rate-limit sponsorship for training runs.</td></tr>
<tr><td>Every gradient update implicitly signals which results drive task accuracy</td><td>Primary-authorship credit on the HuggingFace blog (OpenEnv track requirement).</td></tr>
</tbody>
</table>
</div>
<p>Three differentiators make this a competitive OpenEnv submission rather than yet another dataset wrapper: a <strong>theoretically grounded</strong> reward function (the discretised Weitzman cost-per-box formula, not a heuristic); a <strong>vendor partnership</strong> where a paying product is in the training loop; and a <strong>direct deployment relevance</strong> where every result shows up as a metric a production RAG team would put on a slide. See <a href="https://www.ceramic.ai/" target="_blank">Ceramic AI</a> and the <a href="https://docs.ceramic.ai/api/search/quickstart" target="_blank">Search API quickstart</a> for product and integration details.<span class="ceramic-inline-logo"><a href="https://www.ceramic.ai/" target="_blank" rel="noopener noreferrer"><img src="ceramic-logo.png" alt="Ceramic AI logo" class="ceramic-logo" loading="lazy" decoding="async"/></a></span></p>
<h3>Operational risk is the point, not a bug</h3>
<p>Production retrieval-augmented generation is defined, above all else, by the fact that every call to the search API has non-trivial latency (on the order of seconds), metered cost, and a non-zero failure rate. An environment that mocks the search API out of that reality is measuring a different problem: the combinatorics of query phrasing against a static index, not the economics of an agent making decisions under real API risk.</p>
<p><code>SearchEconomicsEnv</code> leaves that risk in the loop. Every <code>step(search)</code> in the live configuration is a real HTTP call to Ceramic, subject to the real <code>ceramic_timeout_s = 10.0</code> timeout, real per-key rate limits, and a real probability of returning low-quality or sparsely-populated snippets. The <code>&minus;&beta;</code> search cost charges the agent regardless of outcome, which is the economically correct thing to do: a deployed RAG agent pays its vendor for failed and empty retrievals the same as for successful ones.</p>
<p>What the agent has to learn, therefore, is not just a policy over <em>which query to issue</em> but a policy under <strong>API risk</strong>. The training signal rewards agents that stop before a marginal search that is likely to time out, return nothing useful, or exhaust the rate limit quota for the episode. That risk-aware stopping behaviour is the exact thing today's production agent frameworks (ReAct-style loops, plan-and-execute scaffolds) cannot express and cannot optimise for, because their search calls are treated as free oracle invocations at evaluation time.</p>
</section>
<!-- 8. ARCHITECTURE -->
<section id="architecture">
<h2>Architecture</h2>
<p>The environment is two strictly separated pieces: <strong>SearchEconomicsEnv</strong> (this repo, the OpenEnv environment) and <strong>SearchEconomicsPT</strong> (the future GRPO training client, forked from <code>ReasoningEconomicsPT</code>). They communicate exclusively over the OpenEnv WebSocket, no in-process imports.</p>
<div class="mermaid-wrap">
<pre class="mermaid">
flowchart LR
subgraph PT ["SearchEconomicsPT (future, TRL GRPO + vLLM)"]
GRPO["GRPOTrainer<br/>TRL 1.0"]
RF["rollout_func"]
VLLM["vLLM<br/>colocate/server"]
PARSE["action_parser<br/>JSON + guardrails"]
end
subgraph ENV ["SearchEconomicsEnv (OpenEnv)"]
WS["FastAPI<br/>WebSocket"]
MDP["Search/Commit<br/>MDP"]
GRADE["Grader<br/>EM + token-F1"]
REW["Reward<br/>-&beta; / composite"]
end
subgraph EXT ["External"]
CER["Ceramic API"]
FB["FallbackClient<br/>hash-seeded"]
HQA["HotpotQA<br/>loader"]
end
GRPO --> RF
RF --> VLLM
VLLM -->|"generate"| PARSE
PARSE -->|"JSON action"| WS
WS --> MDP
MDP -->|"search"| CER
MDP -.->|"no key"| FB
CER -->|"snippets + scores"| MDP
FB -->|"snippets + scores"| MDP
HQA -->|"stratified batch"| MDP
MDP --> GRADE
GRADE --> REW
REW -->|"observation + reward"| WS
WS -->|"obs"| RF
style PT stroke-dasharray: 5 5
</pre>
<p class="mermaid-caption">Figure 1. System architecture. The environment is live today. The training client (dashed) is a planned fork of <code>ReasoningEconomicsPT</code>. Everything crosses the WebSocket, per OpenEnv contract.</p>
</div>
<p>The server-side wiring is straightforward but pinned carefully: <code>server/app.py</code> uses <code>openenv.core.env_server.create_app(...)</code> with <code>max_concurrent_envs=64</code>, so concurrent users get isolated episodes. <code>env_config_for_server()</code> reads either <code>CERAMIC_API_KEY</code> or the HuggingFace-Spaces convention <code>SEE_CERAMIC_API_KEY</code>, and patches it into the <code>EnvConfig</code>. If neither is set, the server silently uses the deterministic <code>FallbackCeramicClient</code>. The Docker build is a two-stage <code>uv sync</code> on top of <code>ghcr.io/meta-pytorch/openenv-base:latest</code>.</p>
</section>
<!-- 9. BASELINES -->
<section id="baselines">
<h2>Baselines &amp; Expected Behaviors</h2>
<p>Three baselines ship in <code>baselines/</code>, each under 30 lines. Together they bracket the achievable reward region: any policy worse than <code>NoSearchBaseline</code> is broken; any policy that uses more search than <code>AlwaysSearchBaseline</code> cannot exist; any policy that beats <code>ThresholdBaseline</code> at the same average number of searches has learned something Weitzman's reservation rule cannot exploit.</p>
<div class="table-wrap">
<table>
<thead><tr><th>Baseline</th><th>Policy</th><th>What it isolates</th></tr></thead>
<tbody>
<tr><td><strong>NoSearchBaseline</strong></td><td>Commits empty on every question</td><td>Lower bound on episode reward; verifies the loss floor is finite</td></tr>
<tr><td><strong>AlwaysSearchBaseline</strong></td><td>Searches with the raw question verbatim until the per-question cap, then commits empty</td><td>Upper bound on search cost; verifies that budget enforcement actually fires</td></tr>
<tr><td><strong>ThresholdBaseline(&tau;=10.0)</strong></td><td>Searches while <code>top_score &lt; &tau;</code>, then commits using the first context-window snippet truncated to 50 chars</td><td>Approximation to Weitzman's reservation rule. The threshold is a tunable hyperparameter, sweep <code>{5, 10, 15, 20}</code> for a static-policy Pareto frontier.</td></tr>
</tbody>
</table>
</div>
<p>A successfully post-trained policy should demonstrate four qualitatively different behaviors from a base LLM that either always searches or always answers:</p>
<ul>
<li><strong>Query compression.</strong> Generate focused bridging queries (<em>&quot;film directed by X that starred Y&quot;</em>) rather than copying the full question verbatim. Measurable as decreasing average query token count over training.</li>
<li><strong>Implicit stopping threshold.</strong> Commit once Ceramic's <code>top_score</code> crosses a learned threshold, not after a fixed number of searches. Measurable as commit-step-distribution becoming bimodal: high-score-fast-commit versus low-score-extended-search.</li>
<li><strong>Budget-aware allocation.</strong> Spend more searches on hard questions (low initial <code>top_score</code>) and commit fast on easy ones to bank credits. Measurable as positive correlation between the HotpotQA difficulty label and per-question search count.</li>
<li><strong>Multi-hop decomposition.</strong> Break two-part questions into sequential queries, using the first result to refine the second. Measurable as moderate (not near-1) semantic similarity between consecutive queries within an episode, indicating refinement rather than repetition.</li>
</ul>
<p>Demonstrating each one with a side-by-side trace (base versus post-trained) is the shape of the money-shot figure for the eventual paper.</p>
</section>
<!-- 10. EPISODE TRACES -->
<section id="traces">
<h2>Episode Traces: Schematics of Failure vs Target Behaviour</h2>
<p>Two schematics of episode shape on the multi-hop question <em>&quot;Who directed the film that won the 1994 Academy Award for Best Picture?&quot;</em> (gold answer: <strong>Robert Zemeckis</strong>). <strong>These are not real rollouts.</strong> The numbers below (<code>top_score</code>, per-step reward) are illustrative targets computed from the reward formula under hypothetical <code>q</code> and <code>B_t/B_0</code> values. They show the structural difference between a policy that has internalised <code>&Delta;q*</code> and one that has not.</p>
<h3>Schematic: the Failure Mode (Expected)</h3>
<p>What an untuned agent that treats search as free and copies the question verbatim on every call would look like.</p>
<div class="episode-trace">
<div class="trace-step">
<div class="step-marker"></div>
<div class="step-label">Turn 1 &middot; search</div>
<div class="step-content">
Query: <code>&quot;Who directed the film that won the 1994 Academy Award for Best Picture?&quot;</code><br>
<code>top_score = 6.3</code>, 3 snippets, all tangential. Reward <code>-0.1</code>.
</div>
</div>
<div class="trace-step">
<div class="step-marker"></div>
<div class="step-label">Turns 2-5 &middot; search (verbatim repeat)</div>
<div class="step-content">
Same query 4 more times. <code>top_score</code> oscillates in <code>[5.9, 6.4]</code>. Hits the per-question cap <code>max_searches_per_question = 5</code>.
</div>
</div>
<div class="trace-step">
<div class="step-marker terminal"></div>
<div class="step-label">Turn 6 &middot; force-commit (cap hit)</div>
<div class="step-content">
Environment force-commits with an empty answer. EM = 0, F1 = 0, quality = 0. Reward = <code>R_wrong = &minus;0.1</code>.
</div>
</div>
<div class="trace-step">
<div class="step-marker terminal"></div>
<div class="step-label">Turns 7+ &middot; cascade</div>
<div class="step-content">
Behaviour repeats on questions 2, 3, 4. By question 5 the shared budget is burned; remaining questions are force-committed as wrong with <code>forced=True</code> history flags.
</div>
</div>
<div class="trace-verdict bad">
Episode reward dominated by <code>&minus;5 &middot; &beta; &minus; 10 &middot; 0.1 = &minus;1.5</code>. The agent has learned nothing about <em>when</em> to stop, because it treats every search as costless.
</div>
</div>
<h3>Schematic: the Target Behaviour (Expected)</h3>
<p>What a converged post-trained policy <em>should</em> look like in principle: decompose the question into a bridge lookup, then a director lookup, then commit with plenty of budget left. No trained model has produced this trace; it is the target the reward function points at.</p>
<div class="episode-trace">
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">Turn 1 &middot; search (bridge entity)</div>
<div class="step-content">
Query: <code>&quot;1994 Academy Award Best Picture winner&quot;</code><br>
<code>top_score = 12.8</code>, top snippet names <em>Forrest Gump</em>. Reward <code>&minus;0.1</code>.
</div>
</div>
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">Turn 2 &middot; search (director lookup)</div>
<div class="step-content">
Query: <code>&quot;Forrest Gump director 1994&quot;</code><br>
<code>top_score = 14.2</code>, top snippet names <strong>Robert Zemeckis</strong>. Reward <code>&minus;0.1</code>.
</div>
</div>
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">Turn 3 &middot; commit</div>
<div class="step-content">
<code>{&quot;type&quot;:&quot;commit&quot;, &quot;answer&quot;:&quot;Robert Zemeckis&quot;}</code><br>
EM = 1, q = 1.0. Budget remaining: 28 / 30 credits. Efficiency bonus <code>&eta; &middot; &gamma; &middot; 28/30 &asymp; 0.093</code>.
</div>
</div>
<div class="trace-verdict good">
Reward on this question: <code>&minus;0.1 + (&minus;0.1) + (R_wrong + 1.0 &middot; 1.1 + 0.093) = &minus;0.2 + 1.093 = 0.893</code>. Two searches, one commit, net positive reward, and 28 credits banked for the next nine questions.
</div>
</div>
<blockquote>The difference between these schematics is the thing the environment is designed to teach: whether the agent has internalised the reservation value <code>&Delta;q*</code>. Whether a GRPO-trained LLM actually lands on the target schematic is an open empirical question, blocked today on compute, not on environment readiness.</blockquote>
</section>
<!-- 11. ENGINEERING LESSON: CERAMIC BUG -->
<section id="engineering">
<h2>Engineering Lesson: the Pydantic Zero-Results Bug</h2>
<p>A real-world debugging story always lands. Mid-implementation, when we first ran the environment against a live Ceramic API key, every search returned zero results. Fallback tests passed. Hermetic tests passed. Integration tests appeared to &quot;work&quot;; they just always saw <code>len(resp.results) == 0</code>.</p>
<h3>Root cause</h3>
<p>The <code>ceramic_ai</code> SDK (v1.2.1) returns a <code>SearchResponse</code> Pydantic model whose structure is:</p>
<pre><code>SearchResponse
.request_id: str
.result: Result <span class="c"># Pydantic, NOT a dict</span>
.results: list[ResultResult]
.search_metadata: ResultSearchMetadata
.execution_time: float
.total_results: int
ResultResult
.title: str
.url: str
.description: str
.score: float</code></pre>
<p>Our initial parser, written against the SDK's <em>documented JSON schema</em> before we had a key, treated it as a nested dict because that is what the docs show as the response shape:</p>
<pre><code><span class="c"># Buggy: treats a Pydantic model as a dict tree</span>
if hasattr(raw, "__dict__"):
raw = raw.__dict__
result_block = raw.get("result", {}) if isinstance(raw, dict) else {}
raw_results = result_block.get("results", []) if isinstance(result_block, dict) else []</code></pre>
<p>The failure mode is subtle. <code>raw.__dict__</code> on a Pydantic model gives <code>{&quot;result&quot;: &lt;Result object&gt;, &quot;request_id&quot;: &quot;...&quot;}</code>. <code>result_block</code> is then a <code>Result</code> <em>object</em>, so <code>isinstance(result_block, dict)</code> returns <code>False</code>, the <code>else []</code> branch fires, and you get an empty results list on every call. The fallback client &quot;worked&quot; by accident because dataclasses <em>do</em> have a usable <code>__dict__</code>.</p>
<h3>The fix</h3>
<p>Switch to attribute access on the Pydantic hierarchy directly, with a defensive fallback for schema drift:</p>
<pre><code><span class="c"># Fixed: type-aware attribute access with defensive fallback</span>
try:
result_obj = raw.result
raw_results = result_obj.results or []
exec_time = result_obj.search_metadata.execution_time
total = result_obj.total_results
except AttributeError:
logger.warning("Unexpected Ceramic response shape: %s", type(raw))
raw_results, exec_time, total = [], elapsed, 0</code></pre>
<p>We also pass <code>max_results</code> and <code>timeout</code> through to the SDK on construction and on the <code>search()</code> call, so the client no longer over-fetches and truncates in Python. The fixed code lives in <code>SearchEconomicsEnv/ceramic/ceramic_client.py</code>.</p>
<h3>Regression test</h3>
<p><code>tests/test_ceramic_client.py::test_ceramic_client_parses_sdk_pydantic_response</code> constructs a real <code>SearchResponse</code> with a <code>Result</code> holding a <code>ResultResult</code> and a <code>ResultSearchMetadata</code>, monkeypatches <code>ceramic_ai.Ceramic</code> with a fake that returns it, and asserts every parsed field. It would have failed against the original parser and prevents the regression from coming back.</p>
<blockquote>Any integrator against <code>ceramic_ai</code>, <code>openai</code>, <code>anthropic</code>, or any other Pydantic-based SDK has bumped into this exact failure mode. <strong>Pydantic SDKs from FastAPI services often look like dicts in docs and behave like objects in memory.</strong> Parse by attribute, test against real constructors.</blockquote>
<h3>The fallback client pattern (why the test suite is hermetic)</h3>
<p>The environment also ships a <code>FallbackCeramicClient</code>, a SHA-256-seeded deterministic stub with the same interface as the real client. Synthetic scores live in <code>[1.0, 20.0]</code> with deterministic dispersion across queries. Without this pattern, the test suite would require a live Ceramic key to pass, CI in HF Spaces would break on every cold start before the secret is injected, and reproducing results would require every reader to have a Ceramic account. With it, the environment is runnable end-to-end with zero external dependencies, which is what makes the OpenEnv submission self-contained.</p>
</section>
<!-- 11b. RISK REGISTER -->
<section id="risks">
<h2>Risk Register</h2>
<p>Shipping a vendor-API-backed RL environment means the failure surface extends beyond our own code. This register is the complete list of known risks we have mitigations for, transcribed from the migration plan. It is published here rather than hidden in internal docs because a reviewer reading this post should know exactly what can go wrong in a real training run, and what is already in place to contain it.</p>
<div class="table-wrap">
<table>
<thead><tr><th>Risk</th><th>Likelihood</th><th>Impact</th><th>Mitigation</th></tr></thead>
<tbody>
<tr>
<td>Ceramic API rate limits during training</td>
<td>High</td>
<td>Training stalls</td>
<td>Deterministic <code>FallbackCeramicClient</code> for dev and CI; rate-limit increase from Ceramic for production runs</td>
</tr>
<tr>
<td>Ceramic latency of 5&ndash;10 s per search step</td>
<td>High</td>
<td>10&times; slower episode collection</td>
<td>Batch where possible, consider an async client, profile on-path; <code>ceramic_timeout_s = 10.0</code> bounds the worst case</td>
</tr>
<tr>
<td>HotpotQA answer normalisation insufficient (e.g. <code>Zemeckis</code> vs <code>Robert Zemeckis</code>)</td>
<td>Medium</td>
<td>False negatives during grading</td>
<td>Token-F1 fallback already shipped; composite reward absorbs short-string mismatches via the <code>q</code> term; LLM-as-judge planned for v2</td>
</tr>
<tr>
<td>LLM emits malformed JSON actions</td>
<td>High</td>
<td>Wasted training signal</td>
<td>Robust parser with fall-through to an empty commit (guaranteed wrong, not a crash); parse-failure rate logged as a first-class metric</td>
</tr>
<tr>
<td>vLLM generation cap mismatch across distributed ranks</td>
<td>Medium</td>
<td>Hung GPUs on variable-length episodes</td>
<td>Recompute <code>DIST_SERVER_GENERATES_PER_EPISODE</code> as <code>max_searches_per_question &times; num_questions + num_questions</code></td>
</tr>
<tr>
<td><code>sentence-transformers</code> slow on CPU in Docker</td>
<td>Low</td>
<td>Slow <code>reset()</code> in production</td>
<td>Switch to <code>_FallbackEncoder</code> in deployments where top-score embedding quality is not load-bearing</td>
</tr>
<tr>
<td>Pydantic SDK schema drift breaks the response parser</td>
<td>Low</td>
<td>Silent zero-result returns (the exact bug fixed in the engineering section above)</td>
<td>Defensive <code>try / except</code> around attribute access; regression test constructs a real <code>SearchResponse</code> and asserts every parsed field</td>
</tr>
</tbody>
</table>
</div>
</section>
<!-- 12. FOUNDATIONS -->
<section id="foundations">
<h2>Foundations &amp; Citations</h2>
<div class="table-wrap">
<table>
<thead><tr><th>Foundation</th><th>Role in this project</th><th>Citation</th></tr></thead>
<tbody>
<tr><td><strong>Weitzman's Pandora's Box</strong></td><td>Closed-form optimal stopping rule; source of the reservation-value framing and the per-step <code>&minus;&beta;</code> cost</td><td>Weitzman, M. L. (1979). <em>Optimal Search for the Best Alternative</em>. <em>Econometrica</em> 47(3), 641&ndash;654. <a href="https://doi.org/10.2307/1914302" target="_blank" style="color:var(--accent2)">DOI</a></td></tr>
<tr><td><strong>HotpotQA</strong></td><td>Multi-hop question source; loaded via HuggingFace <code>datasets</code> as <code>hotpot_qa</code> / <code>fullwiki</code></td><td>Yang et al., <em>HotpotQA: A Dataset for Diverse, Explainable Multi-hop QA</em>, EMNLP 2018. <a href="https://arxiv.org/abs/1809.09600" target="_blank" style="color:var(--accent2)">arXiv:1809.09600</a> &middot; <a href="https://huggingface.co/datasets/hotpot_qa" target="_blank" style="color:var(--accent2)">Dataset card</a></td></tr>
<tr><td><strong>OpenEnv</strong></td><td>WebSocket environment contract, Space deployment, <code>create_app</code> FastAPI factory</td><td><a href="https://github.com/meta-pytorch/OpenEnv" target="_blank" style="color:var(--accent2)">meta-pytorch/OpenEnv</a></td></tr>
<tr><td><strong>TRL + GRPO</strong></td><td>Planned training path for <code>SearchEconomicsPT</code>: critic-free RL with group-relative advantages</td><td>Shao et al., <em>DeepSeekMath</em>. <a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a></td></tr>
<tr><td><strong>Ceramic AI</strong></td><td>Live search API and partnership; every environment step's information channel</td><td><a href="https://www.ceramic.ai/" target="_blank" style="color:var(--accent2)">ceramic.ai</a> &middot; <a href="https://docs.ceramic.ai/api/search/quickstart" target="_blank" style="color:var(--accent2)">Search API quickstart</a> &middot; <code>ceramic_ai</code> on PyPI</td></tr>
<tr><td><strong>ReasoningEconomicsEnv / PT</strong></td><td>Sibling project. Structural template for the two-repo split, the rollout function pattern, and the DDP padding strategy</td><td>Same monorepo</td></tr>
</tbody>
</table>
</div>
</section>
<!-- 13. QUICK START -->
<section id="quickstart">
<h2>Quick Start</h2>
<pre><code><span class="c"># 1. Local dev install</span>
cd SearchEconomicsEnv &amp;&amp; uv sync --all-extras
<span class="c"># 2. Hermetic unit tests (no network, no API key)</span>
uv run pytest tests/ -v -m "not integration"
<span class="c"># 3. Live Ceramic integration tests</span>
export CERAMIC_API_KEY="your-key-here"
uv run pytest tests/ -v -m integration
<span class="c"># 4. Run the OpenEnv FastAPI server</span>
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
<span class="c"># 5. Validate against the OpenEnv spec</span>
uv run openenv validate
<span class="c"># 6. Build and run the Docker container</span>
docker build -t search-economics-env:latest -f server/Dockerfile .
docker run -p 8000:8000 -e CERAMIC_API_KEY=... search-economics-env:latest
<span class="c"># 7. Push to HuggingFace Spaces</span>
openenv push</code></pre>
<p>Canonical client-side episode loop (replace the <code>if</code> branch with an LLM call to plug in any policy):</p>
<pre><code>from client import SearchEconomicsEnvClient
from env.models import SearchEconAction
with SearchEconomicsEnvClient(base_url="http://localhost:8000") as client:
result = client.reset(seed=42)
obs = result.observation
while not obs.done:
if obs.searches_remaining &gt; 0:
result = client.step(SearchEconAction(action_type="search", query=obs.question))
else:
result = client.step(SearchEconAction(action_type="commit", answer="unknown"))
obs = result.observation</code></pre>
<p>All episodes are seeded and reproducible from <code>(seed, num_questions, difficulty_mix)</code>. No external fixtures needed for the hermetic path; a Ceramic key enables the live retrieval mode.</p>
</section>
<div class="callout">
<div class="q">Do RL-trained LLMs Rediscover Weitzman's Optimal Stopping Rule from Raw Reward Signal Alone?</div>
<div class="sub">The environment is built, Dockerised, OpenEnv-validated, and Ceramic-integrated. The post-training run is the experiment this submission sets up.</div>
</div>
<!-- 14. FUTURE WORK -->
<section id="future">
<h2>Future work</h2>
<p>The post-training counterpart is scoped in the migration documents but not yet implemented. <code>SearchEconomicsPT</code> will be a fork of <code>ReasoningEconomicsPT</code> (TRL GRPO + vLLM + accelerate, already wired for distributed training on CARC and Lambda) with the following modifications:</p>
<ul>
<li><strong>System prompt</strong> rewritten for the search task: <em>&quot;Each turn, output JSON: <code>{&quot;type&quot;:&quot;search&quot;,&quot;query&quot;:...}</code> or <code>{&quot;type&quot;:&quot;commit&quot;,&quot;answer&quot;:...}</code>. Budget: N searches.&quot;</em></li>
<li><strong><code>format_observation_prompt</code></strong> rewritten to render searches-remaining, <code>top_score</code>, and the rolling context window, instead of the reasoning env's token budget.</li>
<li><strong><code>apply_action</code></strong> parses the LLM's text output (JSON or plain text) into a structured <code>SearchEconAction</code>.</li>
<li><strong>Variable episode length.</strong> Each question now takes 1 to N generates, not exactly one. The vLLM generation cap <code>DIST_SERVER_GENERATES_PER_EPISODE</code> has to be recalculated as <code>max_searches_per_question &middot; num_questions + num_questions</code>.</li>
<li><strong>Per-step max-new-tokens</strong> capped at around 150 tokens (each decision is short), versus the reasoning env's variable cap that scaled with remaining budget.</li>
<li><strong>Robust JSON parser</strong> with fall-through to empty-commit on malformed output. The bad reward from the empty commit acts as the gradient signal that trains away from malformed JSON.</li>
</ul>
<p>Target model artefacts: <strong>Qwen3-7B</strong> or <strong>Qwen3-14B</strong> fine-tuned via GRPO on <code>SearchEconomicsEnv</code>, to be published on the Hugging Face Hub when training completes (no public model URL yet). Evaluation artefacts: a Pareto frontier plot of accuracy versus searches used across checkpoints, plus ablations on <code>&beta;</code> sensitivity, budget ratio, and query-formulation quality. Target publication: NeurIPS 2026 Datasets &amp; Benchmarks or ICLR 2027, with the framing <em>&quot;Do RL-trained LLMs rediscover optimal sequential search? Evidence from a Pandora's Box environment.&quot;</em></p>
</section>
<!-- 15. CONCLUSION -->
<section>
<h2>Conclusion</h2>
<p><strong>SearchEconomicsEnv</strong> reframes an agentic-retrieval problem as a verifiable, theoretically grounded RL task. Ceramic AI as the live information channel, HotpotQA as the question source, and a Weitzman-shaped composite reward give us a sequential MDP where every component is auditable, every reward is grounded in arithmetic, and the optimal policy has a known closed form.</p>
<p>The engineering contributions shipped with the environment (robust answer extraction across raw, JSON, and prefix-formatted model outputs; hermetic fallback client for zero-dependency CI; Pydantic-aware SDK parsing with a regression test; two-stage Docker build with the OpenEnv base image) are the pattern the next multi-turn, verifiable-reward OpenEnv submission will need. The Pydantic zero-results bug in particular is a textbook case of &quot;type-aware versus schema-aware parsing&quot; that any SDK integrator will recognise.</p>
<p>The research question remains open: can a GRPO-trained LLM rediscover the reservation rule <code>&Delta;q* &asymp; 0.094</code> from reward alone, and if so, does it outperform the static threshold baseline at the same average number of searches? The pipeline to answer that question is built, validated, and documented. The next artefact is a trained checkpoint.</p>
</section>
<div class="footer">
<p>SearchEconomicsEnv &middot; AgentX OpenEnv Track &middot; USC in partnership with Ceramic AI</p>
<p style="margin-top:.5rem;">
<a href="https://github.com/sharma-yash01/SearchEconomicsEnv" target="_blank">GitHub</a> &middot;
<a href="https://huggingface.co/spaces/yashu2000/search-economics-env" target="_blank">HF Space</a> &middot;
<a href="https://rdi.berkeley.edu/agentx-agentbeats.html" target="_blank">AgentX</a> &middot;
<a href="https://github.com/meta-pytorch/OpenEnv" target="_blank">OpenEnv Framework</a> &middot;
<a href="https://huggingface.co/docs/trl/en/openenv" target="_blank">TRL x OpenEnv</a> &middot;
<a href="https://www.ceramic.ai/" target="_blank">Ceramic AI</a>
</p>
</div>
</div>
</body>
</html>