A Pandora's-Box RL environment where LLMs learn how often to search, not just how to answer. Multi-hop HotpotQA under a hard search-credit budget, with rewards shaped by Weitzman's 1979 optimal-stopping theorem.
search → observe → commit loop. Snippet titles, scores, and the Robert Zemeckis answer shown in the banner are illustrative placeholders for visual clarity, not retrieved Ceramic results or trained-model output.EnvConfig defaults is an illustrative target, not measured model output.Modern LLM agents are taught to use tools but almost never taught how often to use them. Retrieval-augmented agents either fire a search on every turn, or never, and the cost of a search is treated as a free lunch. SearchEconomicsEnv changes that. It is an OpenEnv-compliant RL environment where an LLM answers multi-hop HotpotQA questions under a hard search-credit budget, and each Ceramic search costs real reward.
The reward shape is lifted directly from Martin Weitzman's 1979 Pandora's Box model of optimal sequential search: open a box (issue a search) for cost β, or stop and commit. The research question is whether GRPO-trained LLMs rediscover Weitzman's optimal threshold rule from the raw reward signal alone, without ever being told what optimal stopping is. The environment is fully Dockerised, OpenEnv-validated, and Ceramic-AI-integrated.
Every Ceramic search is one Pandora's box: pay β, see what is inside, decide whether to keep opening.
Almost every public RAG benchmark (Natural Questions, HotpotQA, TriviaQA, MS MARCO) ranks systems by retrieval recall or answer F1. None of them measure the marginal value of one more search call. Production RAG is the opposite: every additional Pinecone, Weaviate, Brave, or Ceramic call has measurable latency and dollar cost, and "should I search again or just answer?" is the live question every agent framework has to answer inside inference loops like LangChain agents, OpenAI function-calling, and Claude tool use.
Treating that decision as a first-class learning signal has been studied under names like budgeted MDPs, cost-aware RL, and information-purchasing agents, but always with synthetic environments: gridworlds, token games, mock APIs. To our knowledge there is no publicly released RL environment that connects an LLM agent to a real, live, vendor-graded search API as the only information channel; shapes reward according to a theoretically optimal sequential-search policy (Weitzman's Pandora's Box); and ships in OpenEnv format so any GRPO, PPO, or DPO trainer can plug in. SearchEconomicsEnv is that environment.
Weitzman (1979) studies an agent who must decide which of N boxes to open (opening box i costs c_i and reveals a stochastic prize X_i) before committing. The optimal policy is shockingly clean: compute a reservation value z_i for each box such that E[max(X_i, z_i)] − c_i = z_i, open boxes in decreasing order of z_i, and stop the first time the best observed prize exceeds the highest unopened reservation value. It is one of the few sequential-search problems with a closed-form optimal policy, and the mapping to our environment is direct:
| Pandora's Box | SearchEconomicsEnv |
|---|---|
| Box | One Ceramic search call |
| Cost of opening | −β (negative reward per search) |
| Prize | Information gain about the answer |
| Commit decision | commit action with final answer |
| Reservation value | Implicitly learned by the policy |
Most "LLM + retrieval" and "LLM + tool use" work lands in one of six buckets. None of them occupies the cell we target:
| Existing work | What it does | What it lacks vs. SearchEconomicsEnv |
|---|---|---|
| HotpotQA leaderboard Yang et al., arXiv:1809.09600, EMNLP 2018 | Ranks systems by EM and F1 on a fixed retrieval pipeline | No notion of search cost, no agent-controlled retrieval, no RL signal |
| LangChain ReAct agents | Multi-step tool use including search | No reward, no learning, no cost shaping |
| WebGPT Nakano et al., arXiv:2112.09332, 2021 | RL-trained browsing agent on Bing | Closed-source, no public env, no theoretical reward grounding |
| Toolformer Schick et al., arXiv:2302.04761, 2023 | LLM learns when to call tools via self-supervision | Self-supervised, no per-call cost, no MDP |
| BoxRL / Pandora gym envs | Toy Pandora's Box implementations | No LLM, no real API, no language at all |
| ReasoningEconomicsEnv (our sibling) | Token-budget RL on math problems via MetaMathQA + SymPy | Single-step decision, no real external action, no theoretical optimum |
| SearchEconomicsEnv (ours) | Multi-step MDP, structured search/commit actions, live Ceramic API, Weitzman-shaped composite reward, OpenEnv-compliant | No trained policy yet (future work) |
To our knowledge, no publicly released RL environment combines all three of: a theoretically optimal reward formalism (Weitzman's Pandora's Box), a live vendor search API as the only information channel (Ceramic AI), and the OpenEnv contract so any GRPO, PPO, or DPO trainer can plug in. The environment, action schema, and reward semantics are new; the method (RLVR / GRPO on verifiable signals) is shared with the Recon and negotiation-RLVR lineages.
This project is the second environment in a research line on economic constraints for LLM agents. The first, ReasoningEconomicsEnv, budgets reasoning tokens across a batch of math problems. SearchEconomicsEnv replaces token allocation with sequential search-or-commit decisions, MetaMathQA with HotpotQA, and local DeepSeek inference with the live Ceramic Search API. The conceptual shift is from a one-step budgeted regression to a genuine stopping-problem MDP.
An OpenEnv-native MDP where an LLM agent decides, at every step, whether to issue another Ceramic search or commit a final answer. A shared search budget and per-step cost turn the problem into a real sequential stopping problem, not a bandit.
Each episode plays out like this:
{easy: 0.3, medium: 0.4, hard: 0.3}).sentence-transformers/all-MiniLM-L6-v2 (or a deterministic hash encoder as fallback).search or commit action. A search fires a Ceramic call, returns snippets and their scores, and pays −β. A commit grades the answer with EM and token-F1, pays the composite commit reward, and advances to the next question.int(search_budget_ratio × num_questions) credits (default 30 searches for 10 questions), so frugality on easy questions buys effort on hard ones.done=True is reachable in two ways: all questions committed, or budget burned.The conceptual diff from the sibling reasoning environment is the cleanest summary of what changed:
| Aspect | ReasoningEconomicsEnv | SearchEconomicsEnv |
|---|---|---|
| Agent decision | How many tokens to allocate | When to search vs. commit |
| Episode MDP | 1 step() per question | N step()s per question |
| Action | response: str | action_type + query or answer |
| Budget unit | Tokens (10–800) | Searches (pooled, ~3× num_questions) |
| Info source | Local DeepSeek inference | Ceramic Search API (+ fallback) |
| Dataset | MetaMathQA + NuminaMath-TIR | HotpotQA (difficulty-stratified) |
| Grading | SymPy symbolic + numeric | EM + token-F1, robust answer extraction |
| Reward shape | Correctness ± token cost | Per-step search cost + composite commit reward |
Two of those rows do most of the conceptual work. First, 1 step() to N step()s per question turns a one-shot allocation into a genuine sequential-decision MDP. Second, DeepSeek to Ceramic replaces a local model that generates solutions with a live retrieval API that returns documents; the agent now has to formulate its own answer from snippets, and the environment never produces an answer for it.
The core contract is two Pydantic types exchanged over the OpenEnv WebSocket:
# Action (agent → env)
class SearchEconAction(BaseModel):
action_type: Literal["search", "commit"]
query: str | None = None # required when action_type == "search"
answer: str | None = None # required when action_type == "commit"
# Observation (env → agent)
class SearchEconObservation(BaseModel):
# Question
question: str
question_embedding: list[float] # 384-dim
question_idx: int
question_done: bool
# Budget
searches_remaining: int
searches_used_this_question: int
max_searches_per_question: int
budget_remaining_ratio: float
# Last search results (empty on reset / after commit)
ceramic_results: list[SearchResult] # title/url/description/score
top_score: float
score_variance: float
search_latency_s: float
# Accumulated context for the current question
context_window: list[str] # max 5 snippets, 300 chars each
# Episode tracking
step_idx: int
questions_remaining: int
accuracy_so_far: float
history: list[dict] # per-commit EM / F1 / quality / mode
# Plumbing
done: bool
reward: float | None
metadata: dict
The action is intentionally schema-strict. model_post_init raises ValueError if the wrong field is unset; a malformed action from the trainer's JSON parser falls through to a commit with an empty answer (guaranteed wrong, charges no extra search). This is what lets a GRPO trainer treat the LLM's output as a parsable signal and back-propagate "your JSON was malformed" as negative reward without any special-case logic.
The observation is rich because the agent needs both the question content and the search-state context to make a Weitzman-rational decision. The schema is deliberately large so that the same observation drives both a learned RL policy and a hand-coded threshold baseline.
Training an LLM with GRPO means the environment receives text, not a dataclass. Extraction has to be tolerant to reasoning traces, structured JSON, or bare answers, and it has to be deterministic and fast enough to run inside a reward function.
env/answer_grading.py tries, in order: (a) strip markdown code fences (```json, ```) and retry; (b) parse the remainder as JSON and read obj["answer"], or obj["answer"] when obj["type"] == "commit"; (c) first line matching ^Answer: or ^Final answer: (case-insensitive); (d) the last non-empty line of the raw text. Anything that still fails to extract a non-empty string falls through to a wrong-commit with q = 0.answer field of the sampled HotpotQA row. Exact Match is exact normalised equality. Token-F1 is multiset overlap on those same normalised tokens. There is no semantic scorer, no embedding similarity, and no LLM judge in the default grader, by design: GRPO throughput cannot tolerate LLM-judge latency in the inner loop.Zemeckis vs Robert Zemeckis, US vs United States). v1 ships with deterministic EM + token-F1 only, which is the right default for a training-loop reward function but is known to under-credit semantically correct answers; the composite reward's partial-F1 term exists precisely to soften this.context_window = [].top_score, score_variance. Reward = −β.context_window now holds 2 snippets. searches_remaining decrements.question_idx advances.done=True on the terminal step.The per-step reward has two modes. Every search pays a flat cost:
Default β = 0.1. Every search costs the same, regardless of whether it returned useful information. This matches Weitzman's assumption that opening costs are paid up-front.
Every commit pays the composite commit reward:
q ∈ [0, 1] is the grader quality (1.0 if exact match, else token-F1);
η = 1 iff q ≥ q_min (default q_min = 1.0, so only full-EM commits earn the efficiency bonus);
B_t / B_0 is the fraction of search credits remaining;
and γ = 0.1 is the efficiency bonus weight.
Setting the expected marginal benefit of one more search equal to its certain cost gives an indifference threshold. Each additional search improves quality by Δq, worth Δq · (R_right − R_wrong) = 1.1 · Δq in expected reward. It costs β = 0.1 directly, plus a lost efficiency bonus of γ / B_0 ≈ 0.003 for B_0 = 30. Solving gives:
A rational agent should keep searching whenever it expects to improve quality by about 9 percentage points or more. This is the Weitzman reservation value, made concrete on our defaults.
A legacy binary grading mode (commit_reward_mode="legacy_binary") restores a strict normalised-equality match without answer extraction, for ablations that isolate how much of the agent's improvement comes from partial-credit shaping versus genuine accuracy gains.
EnvConfig defaults (full)The complete set of knobs exposed by SearchEconomicsEnv/env/config.py, with the values the environment ships with. These are the numbers the abstract constants above resolve to.
| Field | Default | Notes |
|---|---|---|
num_questions | 10 | Questions per episode; shared-budget denominator |
max_searches_per_question | 5 | Hard cap before force-commit on a single question |
search_budget_ratio | 3.0 | Pooled budget B_0 = ratio × num_questions (default 30) |
use_shared_budget | True | Frugality on easy questions funds hard ones |
dataset | hotpotqa | HuggingFace dataset identifier |
dataset_split | train | Split loaded on reset() |
difficulty_mix | {easy:0.3, medium:0.4, hard:0.3} | Stratified sampling proportions per episode |
ceramic_api_key | "" | Empty string activates the deterministic fallback client |
ceramic_timeout_s | 10.0 | Per-request timeout for live Ceramic calls |
max_results_per_search | 10 | Snippets returned per successful search |
beta | 0.1 | Flat per-search cost |
gamma | 0.1 | Efficiency-bonus weight on B_t / B_0 |
correct_reward | +1.0 | R_right in the commit reward |
incorrect_reward | −0.1 | R_wrong floor |
grade_count_correct_mode | em_only | Criterion for the correct count metric (not the reward) |
f1_count_threshold | 0.85 | Token-F1 threshold when grade_count_correct_mode is permissive |
commit_reward_mode | composite | composite uses the shaped formula; legacy_binary for ablations |
efficiency_bonus_min_quality | 1.0 | q_min; only full-EM commits earn the bonus |
partial_reward_scale | 1.0 | Multiplier on the q · (R_right - R_wrong) term |
embedding_model | all-MiniLM-L6-v2 | Sentence-Transformers model for top-score computation |
max_context_snippets | 5 | Rolling context window size across searches |
snippet_max_chars | 300 | Character truncation per snippet before appending |
seed | None | Episode-level RNG seed; None = environment-chosen |
A single SearchEconAction looks flat on the wire: a discriminator plus one string. Structurally, though, the agent is learning two policies that share a single scalar objective. Understanding the decomposition is the clearest way to see what the environment measures.
On every observation the agent chooses between two decisions: (a) formulate a query and issue another search, or (b) commit an answer and move on. That dual choice is encoded as one SearchEconAction with an action_type discriminator, but it decomposes cleanly into two learned sub-policies riding inside the same LLM weights.
search vs commitDeciding when to stop searching is the classic Weitzman reservation problem. The agent has to learn when the expected marginal quality gain Δq from another search falls below the reservation threshold Δq* = (β + γ / B_0) / (R_right − R_wrong) ≈ 0.094 on the defaults. It gets no direct feature telling it "you know enough to answer"; it has to infer that from top_score, score_variance, what is already sitting in the context_window, the remaining shared budget budget_remaining_ratio, and its own internal representation of the question.
queryConditional on choosing to search, the agent also picks the query string. A good policy learns two related sub-behaviours: (i) query compression, emitting a bridging entity or a short factual sub-question rather than a verbatim copy of the original multi-hop question; and (ii) multi-hop decomposition, using the result of turn N to shape the query at turn N+1. Both are measurable from episode logs at evaluation time: average query token count, semantic similarity between consecutive queries within an episode, and the distribution of unique bridging entities per question.
Both sub-policies share the same optimiser: maximise expected episode reward. The reward function makes the trade-off concrete. Every extra search costs β; every unused search credit on a correct commit pays γ · B_t / B_0; a correct-versus-wrong commit pays R_right − R_wrong = 1.1. The agent that maximises expected reward is, by construction, the agent that answers the most questions correctly using the fewest searches.
The open question is not whether such a policy exists. Weitzman tells us it does, with a known closed form. The question is whether GRPO on raw reward discovers it, and how that policy factorises, inside a single set of LLM weights, across stopping versus query formulation.
Ceramic AI is the retrieval partner on this submission, and the deal is mutual. Their search API is the only information channel available to a trained agent, so every gradient step is, implicitly, a statement about which Ceramic results drive downstream task accuracy.
| What Ceramic gets | What we get |
|---|---|
| First RL benchmark measuring downstream task value of their search API | Live, non-mockable agentic search. No synthetic simulation. |
| Real stress-test of search quality under cost-constrained RL dynamics | Publishable environment grounded in economic theory (Weitzman 1979). |
| Quantitative marketing asset of the form "X quality points per search credit" | Ceramic API key and rate-limit sponsorship for training runs. |
| Every gradient update implicitly signals which results drive task accuracy | Primary-authorship credit on the HuggingFace blog (OpenEnv track requirement). |
Three differentiators make this a competitive OpenEnv submission rather than yet another dataset wrapper: a theoretically grounded reward function (the discretised Weitzman cost-per-box formula, not a heuristic); a vendor partnership where a paying product is in the training loop; and a direct deployment relevance where every result shows up as a metric a production RAG team would put on a slide. See Ceramic AI and the Search API quickstart for product and integration details.
Production retrieval-augmented generation is defined, above all else, by the fact that every call to the search API has non-trivial latency (on the order of seconds), metered cost, and a non-zero failure rate. An environment that mocks the search API out of that reality is measuring a different problem: the combinatorics of query phrasing against a static index, not the economics of an agent making decisions under real API risk.
SearchEconomicsEnv leaves that risk in the loop. Every step(search) in the live configuration is a real HTTP call to Ceramic, subject to the real ceramic_timeout_s = 10.0 timeout, real per-key rate limits, and a real probability of returning low-quality or sparsely-populated snippets. The −β search cost charges the agent regardless of outcome, which is the economically correct thing to do: a deployed RAG agent pays its vendor for failed and empty retrievals the same as for successful ones.
What the agent has to learn, therefore, is not just a policy over which query to issue but a policy under API risk. The training signal rewards agents that stop before a marginal search that is likely to time out, return nothing useful, or exhaust the rate limit quota for the episode. That risk-aware stopping behaviour is the exact thing today's production agent frameworks (ReAct-style loops, plan-and-execute scaffolds) cannot express and cannot optimise for, because their search calls are treated as free oracle invocations at evaluation time.
The environment is two strictly separated pieces: SearchEconomicsEnv (this repo, the OpenEnv environment) and SearchEconomicsPT (the future GRPO training client, forked from ReasoningEconomicsPT). They communicate exclusively over the OpenEnv WebSocket, no in-process imports.
flowchart LR
subgraph PT ["SearchEconomicsPT (future, TRL GRPO + vLLM)"]
GRPO["GRPOTrainer
TRL 1.0"]
RF["rollout_func"]
VLLM["vLLM
colocate/server"]
PARSE["action_parser
JSON + guardrails"]
end
subgraph ENV ["SearchEconomicsEnv (OpenEnv)"]
WS["FastAPI
WebSocket"]
MDP["Search/Commit
MDP"]
GRADE["Grader
EM + token-F1"]
REW["Reward
-β / composite"]
end
subgraph EXT ["External"]
CER["Ceramic API"]
FB["FallbackClient
hash-seeded"]
HQA["HotpotQA
loader"]
end
GRPO --> RF
RF --> VLLM
VLLM -->|"generate"| PARSE
PARSE -->|"JSON action"| WS
WS --> MDP
MDP -->|"search"| CER
MDP -.->|"no key"| FB
CER -->|"snippets + scores"| MDP
FB -->|"snippets + scores"| MDP
HQA -->|"stratified batch"| MDP
MDP --> GRADE
GRADE --> REW
REW -->|"observation + reward"| WS
WS -->|"obs"| RF
style PT stroke-dasharray: 5 5
Figure 1. System architecture. The environment is live today. The training client (dashed) is a planned fork of ReasoningEconomicsPT. Everything crosses the WebSocket, per OpenEnv contract.
The server-side wiring is straightforward but pinned carefully: server/app.py uses openenv.core.env_server.create_app(...) with max_concurrent_envs=64, so concurrent users get isolated episodes. env_config_for_server() reads either CERAMIC_API_KEY or the HuggingFace-Spaces convention SEE_CERAMIC_API_KEY, and patches it into the EnvConfig. If neither is set, the server silently uses the deterministic FallbackCeramicClient. The Docker build is a two-stage uv sync on top of ghcr.io/meta-pytorch/openenv-base:latest.
Three baselines ship in baselines/, each under 30 lines. Together they bracket the achievable reward region: any policy worse than NoSearchBaseline is broken; any policy that uses more search than AlwaysSearchBaseline cannot exist; any policy that beats ThresholdBaseline at the same average number of searches has learned something Weitzman's reservation rule cannot exploit.
| Baseline | Policy | What it isolates |
|---|---|---|
| NoSearchBaseline | Commits empty on every question | Lower bound on episode reward; verifies the loss floor is finite |
| AlwaysSearchBaseline | Searches with the raw question verbatim until the per-question cap, then commits empty | Upper bound on search cost; verifies that budget enforcement actually fires |
| ThresholdBaseline(τ=10.0) | Searches while top_score < τ, then commits using the first context-window snippet truncated to 50 chars | Approximation to Weitzman's reservation rule. The threshold is a tunable hyperparameter, sweep {5, 10, 15, 20} for a static-policy Pareto frontier. |
A successfully post-trained policy should demonstrate four qualitatively different behaviors from a base LLM that either always searches or always answers:
top_score crosses a learned threshold, not after a fixed number of searches. Measurable as commit-step-distribution becoming bimodal: high-score-fast-commit versus low-score-extended-search.top_score) and commit fast on easy ones to bank credits. Measurable as positive correlation between the HotpotQA difficulty label and per-question search count.Demonstrating each one with a side-by-side trace (base versus post-trained) is the shape of the money-shot figure for the eventual paper.
Two schematics of episode shape on the multi-hop question "Who directed the film that won the 1994 Academy Award for Best Picture?" (gold answer: Robert Zemeckis). These are not real rollouts. The numbers below (top_score, per-step reward) are illustrative targets computed from the reward formula under hypothetical q and B_t/B_0 values. They show the structural difference between a policy that has internalised Δq* and one that has not.
What an untuned agent that treats search as free and copies the question verbatim on every call would look like.
"Who directed the film that won the 1994 Academy Award for Best Picture?"top_score = 6.3, 3 snippets, all tangential. Reward -0.1.
top_score oscillates in [5.9, 6.4]. Hits the per-question cap max_searches_per_question = 5.
R_wrong = −0.1.
forced=True history flags.
−5 · β − 10 · 0.1 = −1.5. The agent has learned nothing about when to stop, because it treats every search as costless.
What a converged post-trained policy should look like in principle: decompose the question into a bridge lookup, then a director lookup, then commit with plenty of budget left. No trained model has produced this trace; it is the target the reward function points at.
"1994 Academy Award Best Picture winner"top_score = 12.8, top snippet names Forrest Gump. Reward −0.1.
"Forrest Gump director 1994"top_score = 14.2, top snippet names Robert Zemeckis. Reward −0.1.
{"type":"commit", "answer":"Robert Zemeckis"}η · γ · 28/30 ≈ 0.093.
−0.1 + (−0.1) + (R_wrong + 1.0 · 1.1 + 0.093) = −0.2 + 1.093 = 0.893. Two searches, one commit, net positive reward, and 28 credits banked for the next nine questions.
The difference between these schematics is the thing the environment is designed to teach: whether the agent has internalised the reservation value Δq*. Whether a GRPO-trained LLM actually lands on the target schematic is an open empirical question, blocked today on compute, not on environment readiness.
A real-world debugging story always lands. Mid-implementation, when we first ran the environment against a live Ceramic API key, every search returned zero results. Fallback tests passed. Hermetic tests passed. Integration tests appeared to "work"; they just always saw len(resp.results) == 0.
The ceramic_ai SDK (v1.2.1) returns a SearchResponse Pydantic model whose structure is:
SearchResponse
.request_id: str
.result: Result # Pydantic, NOT a dict
.results: list[ResultResult]
.search_metadata: ResultSearchMetadata
.execution_time: float
.total_results: int
ResultResult
.title: str
.url: str
.description: str
.score: float
Our initial parser, written against the SDK's documented JSON schema before we had a key, treated it as a nested dict because that is what the docs show as the response shape:
# Buggy: treats a Pydantic model as a dict tree
if hasattr(raw, "__dict__"):
raw = raw.__dict__
result_block = raw.get("result", {}) if isinstance(raw, dict) else {}
raw_results = result_block.get("results", []) if isinstance(result_block, dict) else []
The failure mode is subtle. raw.__dict__ on a Pydantic model gives {"result": <Result object>, "request_id": "..."}. result_block is then a Result object, so isinstance(result_block, dict) returns False, the else [] branch fires, and you get an empty results list on every call. The fallback client "worked" by accident because dataclasses do have a usable __dict__.
Switch to attribute access on the Pydantic hierarchy directly, with a defensive fallback for schema drift:
# Fixed: type-aware attribute access with defensive fallback
try:
result_obj = raw.result
raw_results = result_obj.results or []
exec_time = result_obj.search_metadata.execution_time
total = result_obj.total_results
except AttributeError:
logger.warning("Unexpected Ceramic response shape: %s", type(raw))
raw_results, exec_time, total = [], elapsed, 0
We also pass max_results and timeout through to the SDK on construction and on the search() call, so the client no longer over-fetches and truncates in Python. The fixed code lives in SearchEconomicsEnv/ceramic/ceramic_client.py.
tests/test_ceramic_client.py::test_ceramic_client_parses_sdk_pydantic_response constructs a real SearchResponse with a Result holding a ResultResult and a ResultSearchMetadata, monkeypatches ceramic_ai.Ceramic with a fake that returns it, and asserts every parsed field. It would have failed against the original parser and prevents the regression from coming back.
Any integrator againstceramic_ai,openai,anthropic, or any other Pydantic-based SDK has bumped into this exact failure mode. Pydantic SDKs from FastAPI services often look like dicts in docs and behave like objects in memory. Parse by attribute, test against real constructors.
The environment also ships a FallbackCeramicClient, a SHA-256-seeded deterministic stub with the same interface as the real client. Synthetic scores live in [1.0, 20.0] with deterministic dispersion across queries. Without this pattern, the test suite would require a live Ceramic key to pass, CI in HF Spaces would break on every cold start before the secret is injected, and reproducing results would require every reader to have a Ceramic account. With it, the environment is runnable end-to-end with zero external dependencies, which is what makes the OpenEnv submission self-contained.
Shipping a vendor-API-backed RL environment means the failure surface extends beyond our own code. This register is the complete list of known risks we have mitigations for, transcribed from the migration plan. It is published here rather than hidden in internal docs because a reviewer reading this post should know exactly what can go wrong in a real training run, and what is already in place to contain it.
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Ceramic API rate limits during training | High | Training stalls | Deterministic FallbackCeramicClient for dev and CI; rate-limit increase from Ceramic for production runs |
| Ceramic latency of 5–10 s per search step | High | 10× slower episode collection | Batch where possible, consider an async client, profile on-path; ceramic_timeout_s = 10.0 bounds the worst case |
HotpotQA answer normalisation insufficient (e.g. Zemeckis vs Robert Zemeckis) |
Medium | False negatives during grading | Token-F1 fallback already shipped; composite reward absorbs short-string mismatches via the q term; LLM-as-judge planned for v2 |
| LLM emits malformed JSON actions | High | Wasted training signal | Robust parser with fall-through to an empty commit (guaranteed wrong, not a crash); parse-failure rate logged as a first-class metric |
| vLLM generation cap mismatch across distributed ranks | Medium | Hung GPUs on variable-length episodes | Recompute DIST_SERVER_GENERATES_PER_EPISODE as max_searches_per_question × num_questions + num_questions |
sentence-transformers slow on CPU in Docker |
Low | Slow reset() in production |
Switch to _FallbackEncoder in deployments where top-score embedding quality is not load-bearing |
| Pydantic SDK schema drift breaks the response parser | Low | Silent zero-result returns (the exact bug fixed in the engineering section above) | Defensive try / except around attribute access; regression test constructs a real SearchResponse and asserts every parsed field |
| Foundation | Role in this project | Citation |
|---|---|---|
| Weitzman's Pandora's Box | Closed-form optimal stopping rule; source of the reservation-value framing and the per-step −β cost | Weitzman, M. L. (1979). Optimal Search for the Best Alternative. Econometrica 47(3), 641–654. DOI |
| HotpotQA | Multi-hop question source; loaded via HuggingFace datasets as hotpot_qa / fullwiki | Yang et al., HotpotQA: A Dataset for Diverse, Explainable Multi-hop QA, EMNLP 2018. arXiv:1809.09600 · Dataset card |
| OpenEnv | WebSocket environment contract, Space deployment, create_app FastAPI factory | meta-pytorch/OpenEnv |
| TRL + GRPO | Planned training path for SearchEconomicsPT: critic-free RL with group-relative advantages | Shao et al., DeepSeekMath. arXiv:2402.03300 |
| Ceramic AI | Live search API and partnership; every environment step's information channel | ceramic.ai · Search API quickstart · ceramic_ai on PyPI |
| ReasoningEconomicsEnv / PT | Sibling project. Structural template for the two-repo split, the rollout function pattern, and the DDP padding strategy | Same monorepo |
# 1. Local dev install
cd SearchEconomicsEnv && uv sync --all-extras
# 2. Hermetic unit tests (no network, no API key)
uv run pytest tests/ -v -m "not integration"
# 3. Live Ceramic integration tests
export CERAMIC_API_KEY="your-key-here"
uv run pytest tests/ -v -m integration
# 4. Run the OpenEnv FastAPI server
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
# 5. Validate against the OpenEnv spec
uv run openenv validate
# 6. Build and run the Docker container
docker build -t search-economics-env:latest -f server/Dockerfile .
docker run -p 8000:8000 -e CERAMIC_API_KEY=... search-economics-env:latest
# 7. Push to HuggingFace Spaces
openenv push
Canonical client-side episode loop (replace the if branch with an LLM call to plug in any policy):
from client import SearchEconomicsEnvClient
from env.models import SearchEconAction
with SearchEconomicsEnvClient(base_url="http://localhost:8000") as client:
result = client.reset(seed=42)
obs = result.observation
while not obs.done:
if obs.searches_remaining > 0:
result = client.step(SearchEconAction(action_type="search", query=obs.question))
else:
result = client.step(SearchEconAction(action_type="commit", answer="unknown"))
obs = result.observation
All episodes are seeded and reproducible from (seed, num_questions, difficulty_mix). No external fixtures needed for the hermetic path; a Ceramic key enables the live retrieval mode.
The post-training counterpart is scoped in the migration documents but not yet implemented. SearchEconomicsPT will be a fork of ReasoningEconomicsPT (TRL GRPO + vLLM + accelerate, already wired for distributed training on CARC and Lambda) with the following modifications:
{"type":"search","query":...} or {"type":"commit","answer":...}. Budget: N searches."format_observation_prompt rewritten to render searches-remaining, top_score, and the rolling context window, instead of the reasoning env's token budget.apply_action parses the LLM's text output (JSON or plain text) into a structured SearchEconAction.DIST_SERVER_GENERATES_PER_EPISODE has to be recalculated as max_searches_per_question · num_questions + num_questions.Target model artefacts: Qwen3-7B or Qwen3-14B fine-tuned via GRPO on SearchEconomicsEnv, to be published on the Hugging Face Hub when training completes (no public model URL yet). Evaluation artefacts: a Pareto frontier plot of accuracy versus searches used across checkpoints, plus ablations on β sensitivity, budget ratio, and query-formulation quality. Target publication: NeurIPS 2026 Datasets & Benchmarks or ICLR 2027, with the framing "Do RL-trained LLMs rediscover optimal sequential search? Evidence from a Pandora's Box environment."
SearchEconomicsEnv reframes an agentic-retrieval problem as a verifiable, theoretically grounded RL task. Ceramic AI as the live information channel, HotpotQA as the question source, and a Weitzman-shaped composite reward give us a sequential MDP where every component is auditable, every reward is grounded in arithmetic, and the optimal policy has a known closed form.
The engineering contributions shipped with the environment (robust answer extraction across raw, JSON, and prefix-formatted model outputs; hermetic fallback client for zero-dependency CI; Pydantic-aware SDK parsing with a regression test; two-stage Docker build with the OpenEnv base image) are the pattern the next multi-turn, verifiable-reward OpenEnv submission will need. The Pydantic zero-results bug in particular is a textbook case of "type-aware versus schema-aware parsing" that any SDK integrator will recognise.
The research question remains open: can a GRPO-trained LLM rediscover the reservation rule Δq* ≈ 0.094 from reward alone, and if so, does it outperform the static threshold baseline at the same average number of searches? The pipeline to answer that question is built, validated, and documented. The next artefact is a trained checkpoint.