--- title: ReasoningEconomicsEnv sdk: docker app_port: 8000 tags: - openenv - reasoning-economic-env - rl - math --- # ReasoningEconomicsEnv **An RL environment for learning to allocate reasoning compute under budget constraints.** > Modern reasoning models like DeepSeek-R1 "think" by generating internal tokens before > answering. More tokens = deeper reasoning = better answers — but tokens cost compute and > money. How should an agent decide how much to think on each problem? ReasoningEconomicsEnv frames this as a sequential decision problem: an agent faces a series of math questions with a fixed total token budget and must learn to **allocate tokens wisely** — spending less on easy questions, more on hard ones. Built on [Meta's OpenEnv framework](https://github.com/meta-pytorch/OpenEnv) for the [AgentX–AgentBeats Competition](https://rdi.berkeley.edu/agentx-agentbeats) hosted by Berkeley RDI. --- ## How It Works ``` Episode (10 questions, 4000 token budget) ┌─────────────────────────────────────────────────────────┐ │ 1. Agent observes: question embedding, remaining budget │ │ 2. Agent decides: token allocation (50–800) │ │ 3. Solver attempts question with that token limit │ │ 4. Reward = correctness − β·cost + γ·efficiency_bonus │ │ 5. Repeat until all questions answered or budget gone │ └─────────────────────────────────────────────────────────┘ ``` **Reward formula:** `R = correctness(±1/−0.1) − β·(tokens_used/budget) + γ·(savings/budget)` --- ## Quick Start ```bash pip install -e . # Run the OpenEnv server uvicorn reasonbudget_gym.server.app:app --port 8000 # In another terminal — use the Python client python -c " from reasonbudget_gym.client import ReasonBudgetClient client = ReasonBudgetClient() obs = client.reset() result = client.step(200) print(result.reward, result.done) " ``` **Or run baseline evaluation locally:** ```bash python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json python -m reasonbudget_gym.eval.plots eval_results.json ``` --- ## Baselines | Agent | Mean Accuracy | Mean Reward | Budget Used | |-------|---------------|-------------|-------------| | `uniform` | 0.780 | 7.620 | 100.0% | | `greedy_max` | 0.840 | 4.163 | 100.0% | | `oracle` | 0.728 | 6.933 | 98.3% | | `bandit` | 0.744 | 6.526 | 98.8% | Evaluation command: ```bash python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json ``` ![Baseline comparison](docs/agent_comparison.png) ![Budget pacing](docs/budget_pacing.png) --- ## Observation Space | Field | Shape | Description | |-------|-------|-------------| | `question_embedding` | 384-dim | Sentence-transformer encoding | | `remaining_budget` | int | Tokens left in episode | | `questions_remaining` | int | Questions left | | `budget_per_remaining` | float | remaining / questions_left | | `accuracy_so_far` | float | Running accuracy [0, 1] | | `history` | list | Past (allocated, used, correct) tuples | **Action:** integer token allocation, clamped to `[min_tokens, max_tokens]` and remaining budget. --- ## Data The repo ships with a deterministic offline question bundle and response cache under `reasonbudget_gym/data/`, so demos and tests work without external services. A **synthetic cache** (`reasonbudget_gym/data/response_cache.json`) simulates realistic DeepSeek-R1 accuracy curves across 4 difficulty tiers: `gsm8k`, `math_l1_l2`, `math_l3`, `math_l4_l5`. The sampler also caches MiniLM embeddings to `reasonbudget_gym/data/embeddings.npy` after the first run. Regenerate the synthetic cache with: ```bash python reasonbudget_gym/data/generate_synthetic_cache.py ``` --- ## Deployment (Docker / HF Spaces) ```bash docker build -t reasoning-economic-env . docker run -p 8000:8000 reasoning-economic-env curl http://localhost:8000/health ``` --- ## Related Work - **[MAS-TTS](https://github.com/jincan333/MAS-TTS):** Allocates reasoning across *agents* on one problem vs. our approach of allocating across *questions* for a single agent. - **[AgentTTS](https://arxiv.org/abs/2508.00890):** Test-time compute-optimal scaling across multi-stage complex tasks. --- ## Citation Part of the AgentX–AgentBeats Competition (Berkeley RDI, 2026). Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) by Meta/PyTorch.