OpenEnv · AgentX Submission

SearchEconomicsEnv

A Pandora's-Box RL environment where LLMs learn how often to search, not just how to answer. Multi-hop HotpotQA under a hard search-credit budget, with rewards shaped by Weitzman's 1979 optimal-stopping theorem.

AgentX OpenEnv Track | Yashaswi Sharma (USC) · Defu Cao (USC) · Muyan Weng (USC) | in partnership with Ceramic AI (Lucas Han)

Environment shipped. Trained policy pending.

The OpenEnv environment, Ceramic integration, baselines, hermetic fallback path, and Docker deploy wiring are complete. We have no GRPO-trained checkpoint yet: we exhausted our compute budget before a run converged. Every episode trace, banner snippet, and reward number in this post that is not directly read from EnvConfig defaults is an illustrative target, not measured model output.

Live Environment Space → GitHub Repo AgentX Track

How Often Should an Agent Search?

Modern LLM agents are taught to use tools but almost never taught how often to use them. Retrieval-augmented agents either fire a search on every turn, or never, and the cost of a search is treated as a free lunch. SearchEconomicsEnv changes that. It is an OpenEnv-compliant RL environment where an LLM answers multi-hop HotpotQA questions under a hard search-credit budget, and each Ceramic search costs real reward.

The reward shape is lifted directly from Martin Weitzman's 1979 Pandora's Box model of optimal sequential search: open a box (issue a search) for cost β, or stop and commit. The research question is whether GRPO-trained LLMs rediscover Weitzman's optimal threshold rule from the raw reward signal alone, without ever being told what optimal stopping is. The environment is fully Dockerised, OpenEnv-validated, and Ceramic-AI-integrated.

Every Ceramic search is one Pandora's box: pay β, see what is inside, decide whether to keep opening.

Why this Benchmark Matters

Almost every public RAG benchmark (Natural Questions, HotpotQA, TriviaQA, MS MARCO) ranks systems by retrieval recall or answer F1. None of them measure the marginal value of one more search call. Production RAG is the opposite: every additional Pinecone, Weaviate, Brave, or Ceramic call has measurable latency and dollar cost, and "should I search again or just answer?" is the live question every agent framework has to answer inside inference loops like LangChain agents, OpenAI function-calling, and Claude tool use.

Treating that decision as a first-class learning signal has been studied under names like budgeted MDPs, cost-aware RL, and information-purchasing agents, but always with synthetic environments: gridworlds, token games, mock APIs. To our knowledge there is no publicly released RL environment that connects an LLM agent to a real, live, vendor-graded search API as the only information channel; shapes reward according to a theoretically optimal sequential-search policy (Weitzman's Pandora's Box); and ships in OpenEnv format so any GRPO, PPO, or DPO trainer can plug in. SearchEconomicsEnv is that environment.

Weitzman (1979) studies an agent who must decide which of N boxes to open (opening box i costs c_i and reveals a stochastic prize X_i) before committing. The optimal policy is shockingly clean: compute a reservation value z_i for each box such that E[max(X_i, z_i)] − c_i = z_i, open boxes in decreasing order of z_i, and stop the first time the best observed prize exceeds the highest unopened reservation value. It is one of the few sequential-search problems with a closed-form optimal policy, and the mapping to our environment is direct:

Pandora's Box	SearchEconomicsEnv
Box	One Ceramic search call
Cost of opening	`−β` (negative reward per search)
Prize	Information gain about the answer
Commit decision	`commit` action with final answer
Reservation value	Implicitly learned by the policy

Prior Work & Novelty

Most "LLM + retrieval" and "LLM + tool use" work lands in one of six buckets. None of them occupies the cell we target:

Existing work	What it does	What it lacks vs. SearchEconomicsEnv
HotpotQA leaderboard Yang et al., arXiv:1809.09600, EMNLP 2018	Ranks systems by EM and F1 on a fixed retrieval pipeline	No notion of search cost, no agent-controlled retrieval, no RL signal
LangChain ReAct agents	Multi-step tool use including search	No reward, no learning, no cost shaping
WebGPT Nakano et al., arXiv:2112.09332, 2021	RL-trained browsing agent on Bing	Closed-source, no public env, no theoretical reward grounding
Toolformer Schick et al., arXiv:2302.04761, 2023	LLM learns when to call tools via self-supervision	Self-supervised, no per-call cost, no MDP
BoxRL / Pandora gym envs	Toy Pandora's Box implementations	No LLM, no real API, no language at all
ReasoningEconomicsEnv (our sibling)	Token-budget RL on math problems via MetaMathQA + SymPy	Single-step decision, no real external action, no theoretical optimum
SearchEconomicsEnv (ours)	Multi-step MDP, structured search/commit actions, live Ceramic API, Weitzman-shaped composite reward, OpenEnv-compliant	No trained policy yet (future work)

To our knowledge, no publicly released RL environment combines all three of: a theoretically optimal reward formalism (Weitzman's Pandora's Box), a live vendor search API as the only information channel (Ceramic AI), and the OpenEnv contract so any GRPO, PPO, or DPO trainer can plug in. The environment, action schema, and reward semantics are new; the method (RLVR / GRPO on verifiable signals) is shared with the Recon and negotiation-RLVR lineages.

This project is the second environment in a research line on economic constraints for LLM agents. The first, ReasoningEconomicsEnv, budgets reasoning tokens across a batch of math problems. SearchEconomicsEnv replaces token allocation with sequential search-or-commit decisions, MetaMathQA with HotpotQA, and local DeepSeek inference with the live Ceramic Search API. The conceptual shift is from a one-step budgeted regression to a genuine stopping-problem MDP.

What SearchEconomicsEnv is

An OpenEnv-native MDP where an LLM agent decides, at every step, whether to issue another Ceramic search or commit a final answer. A shared search budget and per-step cost turn the problem into a real sequential stopping problem, not a bandit.

Each episode plays out like this:

The environment samples a stratified batch of HotpotQA questions (default 10, difficulty mix {easy: 0.3, medium: 0.4, hard: 0.3}).
At reset, every question is batch-encoded to a 384-dim embedding via sentence-transformers/all-MiniLM-L6-v2 (or a deterministic hash encoder as fallback).
On each step, the agent emits a structured search or commit action. A search fires a Ceramic call, returns snippets and their scores, and pays −β. A commit grades the answer with EM and token-F1, pays the composite commit reward, and advances to the next question.
Searches draw from a pooled budget of int(search_budget_ratio × num_questions) credits (default 30 searches for 10 questions), so frugality on easy questions buys effort on hard ones.
Exhausting the budget or the per-question cap (default 5) force-commits the current question as wrong, and force-commits every remaining question. done=True is reachable in two ways: all questions committed, or budget burned.

The conceptual diff from the sibling reasoning environment is the cleanest summary of what changed:

Aspect	ReasoningEconomicsEnv	SearchEconomicsEnv
Agent decision	How many tokens to allocate	When to search vs. commit
Episode MDP	1 `step()` per question	N `step()`s per question
Action	`response: str`	`action_type` + `query` or `answer`
Budget unit	Tokens (10–800)	Searches (pooled, ~3× `num_questions`)
Info source	Local DeepSeek inference	Ceramic Search API (+ fallback)
Dataset	MetaMathQA + NuminaMath-TIR	HotpotQA (difficulty-stratified)
Grading	SymPy symbolic + numeric	EM + token-F1, robust answer extraction
Reward shape	Correctness ± token cost	Per-step search cost + composite commit reward

Two of those rows do most of the conceptual work. First, 1 step() to N step()s per question turns a one-shot allocation into a genuine sequential-decision MDP. Second, DeepSeek to Ceramic replaces a local model that generates solutions with a live retrieval API that returns documents; the agent now has to formulate its own answer from snippets, and the environment never produces an answer for it.

Environment Design & Schemas

The core contract is two Pydantic types exchanged over the OpenEnv WebSocket:

# Action (agent → env)
class SearchEconAction(BaseModel):
    action_type: Literal["search", "commit"]
    query:  str | None = None   # required when action_type == "search"
    answer: str | None = None   # required when action_type == "commit"

# Observation (env → agent)
class SearchEconObservation(BaseModel):
    # Question
    question: str
    question_embedding: list[float]          # 384-dim
    question_idx: int
    question_done: bool

    # Budget
    searches_remaining: int
    searches_used_this_question: int
    max_searches_per_question: int
    budget_remaining_ratio: float

    # Last search results (empty on reset / after commit)
    ceramic_results: list[SearchResult]      # title/url/description/score
    top_score: float
    score_variance: float
    search_latency_s: float

    # Accumulated context for the current question
    context_window: list[str]                # max 5 snippets, 300 chars each

    # Episode tracking
    step_idx: int
    questions_remaining: int
    accuracy_so_far: float
    history: list[dict]                      # per-commit EM / F1 / quality / mode

    # Plumbing
    done: bool
    reward: float | None
    metadata: dict

The action is intentionally schema-strict. model_post_init raises ValueError if the wrong field is unset; a malformed action from the trainer's JSON parser falls through to a commit with an empty answer (guaranteed wrong, charges no extra search). This is what lets a GRPO trainer treat the LLM's output as a parsable signal and back-propagate "your JSON was malformed" as negative reward without any special-case logic.

The observation is rich because the agent needs both the question content and the search-state context to make a Weitzman-rational decision. The schema is deliberately large so that the same observation drives both a learned RL policy and a hand-coded threshold baseline.

How the grader turns a raw model string into a reward

Training an LLM with GRPO means the environment receives text, not a dataclass. Extraction has to be tolerant to reasoning traces, structured JSON, or bare answers, and it has to be deterministic and fast enough to run inside a reward function.

Extraction ladder. The grader in env/answer_grading.py tries, in order: (a) strip markdown code fences (```json, ```) and retry; (b) parse the remainder as JSON and read obj["answer"], or obj["answer"] when obj["type"] == "commit"; (c) first line matching ^Answer: or ^Final answer: (case-insensitive); (d) the last non-empty line of the raw text. Anything that still fails to extract a non-empty string falls through to a wrong-commit with q = 0.
Direct string comparison against HotpotQA gold. The extracted answer is normalised (lowercase, articles stripped, punctuation stripped, whitespace collapsed) and compared to the answer field of the sampled HotpotQA row. Exact Match is exact normalised equality. Token-F1 is multiset overlap on those same normalised tokens. There is no semantic scorer, no embedding similarity, and no LLM judge in the default grader, by design: GRPO throughput cannot tolerate LLM-judge latency in the inner loop.
LLM-as-judge is a planned v2. A future release will add an optional LLM-as-judge grading path behind a config flag, for evaluation runs that want to catch cases where the model's answer is factually equivalent to the gold string but not string-equivalent (Zemeckis vs Robert Zemeckis, US vs United States). v1 ships with deterministic EM + token-F1 only, which is the right default for a training-loop reward function but is known to under-credit semantically correct answers; the composite reward's partial-F1 term exists precisely to soften this.

reset(seed=42)

Returns observation for question 0. 10 HotpotQA questions loaded, 30 search credits, context_window = [].

step(search, query="...")

Ceramic call → observation with snippets, top_score, score_variance. Reward = −β.

step(search, query="...")

Another call. context_window now holds 2 snippets. searches_remaining decrements.

step(commit, answer="...")

Grade with EM + token-F1 → composite commit reward. question_idx advances.

… continue until …

All questions committed or shared budget exhausted. done=True on the terminal step.

Reward: Weitzman's Pandora's Box in Math

The per-step reward has two modes. Every search pays a flat cost:

R_t^{\text{search}} = -\beta

Default β = 0.1. Every search costs the same, regardless of whether it returned useful information. This matches Weitzman's assumption that opening costs are paid up-front.

Every commit pays the composite commit reward:

R_t^{\text{commit}} = R_{\text{wrong}} + q \cdot (R_{\text{right}} - R_{\text{wrong}}) + \eta \cdot \gamma \cdot \frac{B_t}{B_0}

q ∈ [0, 1] is the grader quality (1.0 if exact match, else token-F1); η = 1 iff q ≥ q_min (default q_min = 1.0, so only full-EM commits earn the efficiency bonus); B_t / B_0 is the fraction of search credits remaining; and γ = 0.1 is the efficiency bonus weight.

Setting the expected marginal benefit of one more search equal to its certain cost gives an indifference threshold. Each additional search improves quality by Δq, worth Δq · (R_right − R_wrong) = 1.1 · Δq in expected reward. It costs β = 0.1 directly, plus a lost efficiency bonus of γ / B_0 ≈ 0.003 for B_0 = 30. Solving gives:

\Delta q^{*} = \frac{\beta + \gamma / B_0}{R_{\text{right}} - R_{\text{wrong}}} \;\approx\; \frac{0.1 + 0.003}{1.1} \;\approx\; 0.094

A rational agent should keep searching whenever it expects to improve quality by about 9 percentage points or more. This is the Weitzman reservation value, made concrete on our defaults.

A legacy binary grading mode (commit_reward_mode="legacy_binary") restores a strict normalised-equality match without answer extraction, for ablations that isolate how much of the agent's improvement comes from partial-credit shaping versus genuine accuracy gains.

`EnvConfig` defaults (full)

The complete set of knobs exposed by SearchEconomicsEnv/env/config.py, with the values the environment ships with. These are the numbers the abstract constants above resolve to.

Field	Default	Notes
`num_questions`	10	Questions per episode; shared-budget denominator
`max_searches_per_question`	5	Hard cap before force-commit on a single question
`search_budget_ratio`	3.0	Pooled budget `B_0 = ratio × num_questions` (default 30)
`use_shared_budget`	True	Frugality on easy questions funds hard ones
`dataset`	`hotpotqa`	HuggingFace dataset identifier
`dataset_split`	`train`	Split loaded on `reset()`
`difficulty_mix`	`{easy:0.3, medium:0.4, hard:0.3}`	Stratified sampling proportions per episode
`ceramic_api_key`	`""`	Empty string activates the deterministic fallback client
`ceramic_timeout_s`	10.0	Per-request timeout for live Ceramic calls
`max_results_per_search`	10	Snippets returned per successful search
`beta`	0.1	Flat per-search cost
`gamma`	0.1	Efficiency-bonus weight on `B_t / B_0`
`correct_reward`	+1.0	`R_right` in the commit reward
`incorrect_reward`	−0.1	`R_wrong` floor
`grade_count_correct_mode`	`em_only`	Criterion for the correct count metric (not the reward)
`f1_count_threshold`	0.85	Token-F1 threshold when `grade_count_correct_mode` is permissive
`commit_reward_mode`	`composite`	`composite` uses the shaped formula; `legacy_binary` for ablations
`efficiency_bonus_min_quality`	1.0	`q_min`; only full-EM commits earn the bonus
`partial_reward_scale`	1.0	Multiplier on the `q · (R_right - R_wrong)` term
`embedding_model`	`all-MiniLM-L6-v2`	Sentence-Transformers model for top-score computation
`max_context_snippets`	5	Rolling context window size across searches
`snippet_max_chars`	300	Character truncation per snippet before appending
`seed`	`None`	Episode-level RNG seed; `None` = environment-chosen

How the Agent Interacts with Ceramic: Dual Policies under one Reward

A single SearchEconAction looks flat on the wire: a discriminator plus one string. Structurally, though, the agent is learning two policies that share a single scalar objective. Understanding the decomposition is the clearest way to see what the environment measures.

Two policies, one objective

On every observation the agent chooses between two decisions: (a) formulate a query and issue another search, or (b) commit an answer and move on. That dual choice is encoded as one SearchEconAction with an action_type discriminator, but it decomposes cleanly into two learned sub-policies riding inside the same LLM weights.

The stopping policy: `search` vs `commit`

Deciding when to stop searching is the classic Weitzman reservation problem. The agent has to learn when the expected marginal quality gain Δq from another search falls below the reservation threshold Δq* = (β + γ / B_0) / (R_right − R_wrong) ≈ 0.094 on the defaults. It gets no direct feature telling it "you know enough to answer"; it has to infer that from top_score, score_variance, what is already sitting in the context_window, the remaining shared budget budget_remaining_ratio, and its own internal representation of the question.

The search-formulation policy: the content of `query`

Conditional on choosing to search, the agent also picks the query string. A good policy learns two related sub-behaviours: (i) query compression, emitting a bridging entity or a short factual sub-question rather than a verbatim copy of the original multi-hop question; and (ii) multi-hop decomposition, using the result of turn N to shape the query at turn N+1. Both are measurable from episode logs at evaluation time: average query token count, semantic similarity between consecutive queries within an episode, and the distribution of unique bridging entities per question.

One scalar objective

Both sub-policies share the same optimiser: maximise expected episode reward. The reward function makes the trade-off concrete. Every extra search costs β; every unused search credit on a correct commit pays γ · B_t / B_0; a correct-versus-wrong commit pays R_right − R_wrong = 1.1. The agent that maximises expected reward is, by construction, the agent that answers the most questions correctly using the fewest searches.

The open question is not whether such a policy exists. Weitzman tells us it does, with a known closed form. The question is whether GRPO on raw reward discovers it, and how that policy factorises, inside a single set of LLM weights, across stopping versus query formulation.

The Ceramic AI partnership

Ceramic AI is the retrieval partner on this submission, and the deal is mutual. Their search API is the only information channel available to a trained agent, so every gradient step is, implicitly, a statement about which Ceramic results drive downstream task accuracy.

What Ceramic gets	What we get
First RL benchmark measuring downstream task value of their search API	Live, non-mockable agentic search. No synthetic simulation.
Real stress-test of search quality under cost-constrained RL dynamics	Publishable environment grounded in economic theory (Weitzman 1979).
Quantitative marketing asset of the form "X quality points per search credit"	Ceramic API key and rate-limit sponsorship for training runs.
Every gradient update implicitly signals which results drive task accuracy	Primary-authorship credit on the HuggingFace blog (OpenEnv track requirement).

Three differentiators make this a competitive OpenEnv submission rather than yet another dataset wrapper: a theoretically grounded reward function (the discretised Weitzman cost-per-box formula, not a heuristic); a vendor partnership where a paying product is in the training loop; and a direct deployment relevance where every result shows up as a metric a production RAG team would put on a slide. See Ceramic AI and the Search API quickstart for product and integration details.

Operational risk is the point, not a bug

Production retrieval-augmented generation is defined, above all else, by the fact that every call to the search API has non-trivial latency (on the order of seconds), metered cost, and a non-zero failure rate. An environment that mocks the search API out of that reality is measuring a different problem: the combinatorics of query phrasing against a static index, not the economics of an agent making decisions under real API risk.

SearchEconomicsEnv leaves that risk in the loop. Every step(search) in the live configuration is a real HTTP call to Ceramic, subject to the real ceramic_timeout_s = 10.0 timeout, real per-key rate limits, and a real probability of returning low-quality or sparsely-populated snippets. The −β search cost charges the agent regardless of outcome, which is the economically correct thing to do: a deployed RAG agent pays its vendor for failed and empty retrievals the same as for successful ones.

What the agent has to learn, therefore, is not just a policy over which query to issue but a policy under API risk. The training signal rewards agents that stop before a marginal search that is likely to time out, return nothing useful, or exhaust the rate limit quota for the episode. That risk-aware stopping behaviour is the exact thing today's production agent frameworks (ReAct-style loops, plan-and-execute scaffolds) cannot express and cannot optimise for, because their search calls are treated as free oracle invocations at evaluation time.

Architecture

The environment is two strictly separated pieces: SearchEconomicsEnv (this repo, the OpenEnv environment) and SearchEconomicsPT (the future GRPO training client, forked from ReasoningEconomicsPT). They communicate exclusively over the OpenEnv WebSocket, no in-process imports.

flowchart LR
    subgraph PT ["SearchEconomicsPT (future, TRL GRPO + vLLM)"]
        GRPO["GRPOTrainer
TRL 1.0"]
        RF["rollout_func"]
        VLLM["vLLM
colocate/server"]
        PARSE["action_parser
JSON + guardrails"]
    end
    subgraph ENV ["SearchEconomicsEnv (OpenEnv)"]
        WS["FastAPI
WebSocket"]
        MDP["Search/Commit
MDP"]
        GRADE["Grader
EM + token-F1"]
        REW["Reward
-β / composite"]
    end
    subgraph EXT ["External"]
        CER["Ceramic API"]
        FB["FallbackClient
hash-seeded"]
        HQA["HotpotQA
loader"]
    end

    GRPO --> RF
    RF --> VLLM
    VLLM -->|"generate"| PARSE
    PARSE -->|"JSON action"| WS
    WS --> MDP
    MDP -->|"search"| CER
    MDP -.->|"no key"| FB
    CER -->|"snippets + scores"| MDP
    FB -->|"snippets + scores"| MDP
    HQA -->|"stratified batch"| MDP
    MDP --> GRADE
    GRADE --> REW
    REW -->|"observation + reward"| WS
    WS -->|"obs"| RF

    style PT stroke-dasharray: 5 5

Figure 1. System architecture. The environment is live today. The training client (dashed) is a planned fork of ReasoningEconomicsPT. Everything crosses the WebSocket, per OpenEnv contract.

The server-side wiring is straightforward but pinned carefully: server/app.py uses openenv.core.env_server.create_app(...) with max_concurrent_envs=64, so concurrent users get isolated episodes. env_config_for_server() reads either CERAMIC_API_KEY or the HuggingFace-Spaces convention SEE_CERAMIC_API_KEY, and patches it into the EnvConfig. If neither is set, the server silently uses the deterministic FallbackCeramicClient. The Docker build is a two-stage uv sync on top of ghcr.io/meta-pytorch/openenv-base:latest.

Baselines & Expected Behaviors

Three baselines ship in baselines/, each under 30 lines. Together they bracket the achievable reward region: any policy worse than NoSearchBaseline is broken; any policy that uses more search than AlwaysSearchBaseline cannot exist; any policy that beats ThresholdBaseline at the same average number of searches has learned something Weitzman's reservation rule cannot exploit.

Baseline	Policy	What it isolates
NoSearchBaseline	Commits empty on every question	Lower bound on episode reward; verifies the loss floor is finite
AlwaysSearchBaseline	Searches with the raw question verbatim until the per-question cap, then commits empty	Upper bound on search cost; verifies that budget enforcement actually fires
ThresholdBaseline(τ=10.0)	Searches while `top_score < τ`, then commits using the first context-window snippet truncated to 50 chars	Approximation to Weitzman's reservation rule. The threshold is a tunable hyperparameter, sweep `{5, 10, 15, 20}` for a static-policy Pareto frontier.

A successfully post-trained policy should demonstrate four qualitatively different behaviors from a base LLM that either always searches or always answers:

Query compression. Generate focused bridging queries ("film directed by X that starred Y") rather than copying the full question verbatim. Measurable as decreasing average query token count over training.
Implicit stopping threshold. Commit once Ceramic's top_score crosses a learned threshold, not after a fixed number of searches. Measurable as commit-step-distribution becoming bimodal: high-score-fast-commit versus low-score-extended-search.
Budget-aware allocation. Spend more searches on hard questions (low initial top_score) and commit fast on easy ones to bank credits. Measurable as positive correlation between the HotpotQA difficulty label and per-question search count.
Multi-hop decomposition. Break two-part questions into sequential queries, using the first result to refine the second. Measurable as moderate (not near-1) semantic similarity between consecutive queries within an episode, indicating refinement rather than repetition.

Demonstrating each one with a side-by-side trace (base versus post-trained) is the shape of the money-shot figure for the eventual paper.

Episode Traces: Schematics of Failure vs Target Behaviour

Two schematics of episode shape on the multi-hop question "Who directed the film that won the 1994 Academy Award for Best Picture?" (gold answer: Robert Zemeckis). These are not real rollouts. The numbers below (top_score, per-step reward) are illustrative targets computed from the reward formula under hypothetical q and B_t/B_0 values. They show the structural difference between a policy that has internalised Δq* and one that has not.

Schematic: the Failure Mode (Expected)

What an untuned agent that treats search as free and copies the question verbatim on every call would look like.

Turn 1 · search

Query: "Who directed the film that won the 1994 Academy Award for Best Picture?"
top_score = 6.3, 3 snippets, all tangential. Reward -0.1.

Turns 2-5 · search (verbatim repeat)

Same query 4 more times. top_score oscillates in [5.9, 6.4]. Hits the per-question cap max_searches_per_question = 5.

Turn 6 · force-commit (cap hit)

Environment force-commits with an empty answer. EM = 0, F1 = 0, quality = 0. Reward = R_wrong = −0.1.

Turns 7+ · cascade

Behaviour repeats on questions 2, 3, 4. By question 5 the shared budget is burned; remaining questions are force-committed as wrong with forced=True history flags.

Episode reward dominated by −5 · β − 10 · 0.1 = −1.5. The agent has learned nothing about when to stop, because it treats every search as costless.

Schematic: the Target Behaviour (Expected)

What a converged post-trained policy should look like in principle: decompose the question into a bridge lookup, then a director lookup, then commit with plenty of budget left. No trained model has produced this trace; it is the target the reward function points at.

Turn 1 · search (bridge entity)

Query: "1994 Academy Award Best Picture winner"
top_score = 12.8, top snippet names Forrest Gump. Reward −0.1.

Turn 2 · search (director lookup)

Query: "Forrest Gump director 1994"
top_score = 14.2, top snippet names Robert Zemeckis. Reward −0.1.

Turn 3 · commit

{"type":"commit", "answer":"Robert Zemeckis"}
EM = 1, q = 1.0. Budget remaining: 28 / 30 credits. Efficiency bonus η · γ · 28/30 ≈ 0.093.

Reward on this question: −0.1 + (−0.1) + (R_wrong + 1.0 · 1.1 + 0.093) = −0.2 + 1.093 = 0.893. Two searches, one commit, net positive reward, and 28 credits banked for the next nine questions.

The difference between these schematics is the thing the environment is designed to teach: whether the agent has internalised the reservation value Δq*. Whether a GRPO-trained LLM actually lands on the target schematic is an open empirical question, blocked today on compute, not on environment readiness.

Engineering Lesson: the Pydantic Zero-Results Bug

A real-world debugging story always lands. Mid-implementation, when we first ran the environment against a live Ceramic API key, every search returned zero results. Fallback tests passed. Hermetic tests passed. Integration tests appeared to "work"; they just always saw len(resp.results) == 0.

Root cause

The ceramic_ai SDK (v1.2.1) returns a SearchResponse Pydantic model whose structure is:

SearchResponse
    .request_id: str
    .result: Result                     # Pydantic, NOT a dict
        .results: list[ResultResult]
        .search_metadata: ResultSearchMetadata
            .execution_time: float
        .total_results: int
ResultResult
    .title: str
    .url: str
    .description: str
    .score: float

Our initial parser, written against the SDK's documented JSON schema before we had a key, treated it as a nested dict because that is what the docs show as the response shape:

# Buggy: treats a Pydantic model as a dict tree
if hasattr(raw, "__dict__"):
    raw = raw.__dict__
result_block = raw.get("result", {}) if isinstance(raw, dict) else {}
raw_results  = result_block.get("results", []) if isinstance(result_block, dict) else []

The failure mode is subtle. raw.__dict__ on a Pydantic model gives {"result": <Result object>, "request_id": "..."}. result_block is then a Result object, so isinstance(result_block, dict) returns False, the else [] branch fires, and you get an empty results list on every call. The fallback client "worked" by accident because dataclasses do have a usable __dict__.

The fix

Switch to attribute access on the Pydantic hierarchy directly, with a defensive fallback for schema drift:

# Fixed: type-aware attribute access with defensive fallback
try:
    result_obj  = raw.result
    raw_results = result_obj.results or []
    exec_time   = result_obj.search_metadata.execution_time
    total       = result_obj.total_results
except AttributeError:
    logger.warning("Unexpected Ceramic response shape: %s", type(raw))
    raw_results, exec_time, total = [], elapsed, 0

We also pass max_results and timeout through to the SDK on construction and on the search() call, so the client no longer over-fetches and truncates in Python. The fixed code lives in SearchEconomicsEnv/ceramic/ceramic_client.py.

Regression test

tests/test_ceramic_client.py::test_ceramic_client_parses_sdk_pydantic_response constructs a real SearchResponse with a Result holding a ResultResult and a ResultSearchMetadata, monkeypatches ceramic_ai.Ceramic with a fake that returns it, and asserts every parsed field. It would have failed against the original parser and prevents the regression from coming back.

Any integrator against ceramic_ai, openai, anthropic, or any other Pydantic-based SDK has bumped into this exact failure mode. Pydantic SDKs from FastAPI services often look like dicts in docs and behave like objects in memory. Parse by attribute, test against real constructors.

The fallback client pattern (why the test suite is hermetic)

The environment also ships a FallbackCeramicClient, a SHA-256-seeded deterministic stub with the same interface as the real client. Synthetic scores live in [1.0, 20.0] with deterministic dispersion across queries. Without this pattern, the test suite would require a live Ceramic key to pass, CI in HF Spaces would break on every cold start before the secret is injected, and reproducing results would require every reader to have a Ceramic account. With it, the environment is runnable end-to-end with zero external dependencies, which is what makes the OpenEnv submission self-contained.

Risk Register

Shipping a vendor-API-backed RL environment means the failure surface extends beyond our own code. This register is the complete list of known risks we have mitigations for, transcribed from the migration plan. It is published here rather than hidden in internal docs because a reviewer reading this post should know exactly what can go wrong in a real training run, and what is already in place to contain it.

Risk	Likelihood	Impact	Mitigation
Ceramic API rate limits during training	High	Training stalls	Deterministic `FallbackCeramicClient` for dev and CI; rate-limit increase from Ceramic for production runs
Ceramic latency of 5–10 s per search step	High	10× slower episode collection	Batch where possible, consider an async client, profile on-path; `ceramic_timeout_s = 10.0` bounds the worst case
HotpotQA answer normalisation insufficient (e.g. `Zemeckis` vs `Robert Zemeckis`)	Medium	False negatives during grading	Token-F1 fallback already shipped; composite reward absorbs short-string mismatches via the `q` term; LLM-as-judge planned for v2
LLM emits malformed JSON actions	High	Wasted training signal	Robust parser with fall-through to an empty commit (guaranteed wrong, not a crash); parse-failure rate logged as a first-class metric
vLLM generation cap mismatch across distributed ranks	Medium	Hung GPUs on variable-length episodes	Recompute `DIST_SERVER_GENERATES_PER_EPISODE` as `max_searches_per_question × num_questions + num_questions`
`sentence-transformers` slow on CPU in Docker	Low	Slow `reset()` in production	Switch to `_FallbackEncoder` in deployments where top-score embedding quality is not load-bearing
Pydantic SDK schema drift breaks the response parser	Low	Silent zero-result returns (the exact bug fixed in the engineering section above)	Defensive `try / except` around attribute access; regression test constructs a real `SearchResponse` and asserts every parsed field

Foundations & Citations

Foundation	Role in this project	Citation
Weitzman's Pandora's Box	Closed-form optimal stopping rule; source of the reservation-value framing and the per-step `−β` cost	Weitzman, M. L. (1979). Optimal Search for the Best Alternative. Econometrica 47(3), 641–654. DOI
HotpotQA	Multi-hop question source; loaded via HuggingFace `datasets` as `hotpot_qa` / `fullwiki`	Yang et al., HotpotQA: A Dataset for Diverse, Explainable Multi-hop QA, EMNLP 2018. arXiv:1809.09600 · Dataset card
OpenEnv	WebSocket environment contract, Space deployment, `create_app` FastAPI factory	meta-pytorch/OpenEnv
TRL + GRPO	Planned training path for `SearchEconomicsPT`: critic-free RL with group-relative advantages	Shao et al., DeepSeekMath. arXiv:2402.03300
Ceramic AI	Live search API and partnership; every environment step's information channel	ceramic.ai · Search API quickstart · `ceramic_ai` on PyPI
ReasoningEconomicsEnv / PT	Sibling project. Structural template for the two-repo split, the rollout function pattern, and the DDP padding strategy	Same monorepo

Quick Start

# 1. Local dev install
cd SearchEconomicsEnv && uv sync --all-extras

# 2. Hermetic unit tests (no network, no API key)
uv run pytest tests/ -v -m "not integration"

# 3. Live Ceramic integration tests
export CERAMIC_API_KEY="your-key-here"
uv run pytest tests/ -v -m integration

# 4. Run the OpenEnv FastAPI server
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000

# 5. Validate against the OpenEnv spec
uv run openenv validate

# 6. Build and run the Docker container
docker build -t search-economics-env:latest -f server/Dockerfile .
docker run -p 8000:8000 -e CERAMIC_API_KEY=... search-economics-env:latest

# 7. Push to HuggingFace Spaces
openenv push

Canonical client-side episode loop (replace the if branch with an LLM call to plug in any policy):

from client import SearchEconomicsEnvClient
from env.models import SearchEconAction

with SearchEconomicsEnvClient(base_url="http://localhost:8000") as client:
    result = client.reset(seed=42)
    obs = result.observation
    while not obs.done:
        if obs.searches_remaining > 0:
            result = client.step(SearchEconAction(action_type="search", query=obs.question))
        else:
            result = client.step(SearchEconAction(action_type="commit", answer="unknown"))
        obs = result.observation

All episodes are seeded and reproducible from (seed, num_questions, difficulty_mix). No external fixtures needed for the hermetic path; a Ceramic key enables the live retrieval mode.

Do RL-trained LLMs Rediscover Weitzman's Optimal Stopping Rule from Raw Reward Signal Alone?

The environment is built, Dockerised, OpenEnv-validated, and Ceramic-integrated. The post-training run is the experiment this submission sets up.

Future work

The post-training counterpart is scoped in the migration documents but not yet implemented. SearchEconomicsPT will be a fork of ReasoningEconomicsPT (TRL GRPO + vLLM + accelerate, already wired for distributed training on CARC and Lambda) with the following modifications:

System prompt rewritten for the search task: "Each turn, output JSON: {"type":"search","query":...} or {"type":"commit","answer":...}. Budget: N searches."
format_observation_prompt rewritten to render searches-remaining, top_score, and the rolling context window, instead of the reasoning env's token budget.
apply_action parses the LLM's text output (JSON or plain text) into a structured SearchEconAction.
Variable episode length. Each question now takes 1 to N generates, not exactly one. The vLLM generation cap DIST_SERVER_GENERATES_PER_EPISODE has to be recalculated as max_searches_per_question · num_questions + num_questions.
Per-step max-new-tokens capped at around 150 tokens (each decision is short), versus the reasoning env's variable cap that scaled with remaining budget.
Robust JSON parser with fall-through to empty-commit on malformed output. The bad reward from the empty commit acts as the gradient signal that trains away from malformed JSON.

Target model artefacts: Qwen3-7B or Qwen3-14B fine-tuned via GRPO on SearchEconomicsEnv, to be published on the Hugging Face Hub when training completes (no public model URL yet). Evaluation artefacts: a Pareto frontier plot of accuracy versus searches used across checkpoints, plus ablations on β sensitivity, budget ratio, and query-formulation quality. Target publication: NeurIPS 2026 Datasets & Benchmarks or ICLR 2027, with the framing "Do RL-trained LLMs rediscover optimal sequential search? Evidence from a Pandora's Box environment."

Conclusion

SearchEconomicsEnv reframes an agentic-retrieval problem as a verifiable, theoretically grounded RL task. Ceramic AI as the live information channel, HotpotQA as the question source, and a Weitzman-shaped composite reward give us a sequential MDP where every component is auditable, every reward is grounded in arithmetic, and the optimal policy has a known closed form.

The engineering contributions shipped with the environment (robust answer extraction across raw, JSON, and prefix-formatted model outputs; hermetic fallback client for zero-dependency CI; Pydantic-aware SDK parsing with a regression test; two-stage Docker build with the OpenEnv base image) are the pattern the next multi-turn, verifiable-reward OpenEnv submission will need. The Pydantic zero-results bug in particular is a textbook case of "type-aware versus schema-aware parsing" that any SDK integrator will recognise.

The research question remains open: can a GRPO-trained LLM rediscover the reservation rule Δq* ≈ 0.094 from reward alone, and if so, does it outperform the static threshold baseline at the same average number of searches? The pipeline to answer that question is built, validated, and documented. The next artefact is a trained checkpoint.