Spaces:
Sleeping
title: ReasoningEconomicsEnv
sdk: docker
app_port: 8000
tags:
- openenv
- reasoning-economic-env
- rl
- math
ReasoningEconomicsEnv
An RL environment for learning to allocate reasoning compute under budget constraints.
Modern reasoning models like DeepSeek-R1 "think" by generating internal tokens before answering. More tokens = deeper reasoning = better answers β but tokens cost compute and money. How should an agent decide how much to think on each problem?
ReasoningEconomicsEnv frames this as a sequential decision problem: an agent faces a series of math questions with a fixed total token budget and must learn to allocate tokens wisely β spending less on easy questions, more on hard ones.
Built on Meta's OpenEnv framework for the AgentXβAgentBeats Competition hosted by Berkeley RDI.
How It Works
Episode (10 questions, 4000 token budget)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Agent observes: question embedding, remaining budget β
β 2. Agent decides: token allocation (50β800) β
β 3. Solver attempts question with that token limit β
β 4. Reward = correctness β Ξ²Β·cost + Ξ³Β·efficiency_bonus β
β 5. Repeat until all questions answered or budget gone β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Reward formula: R = correctness(Β±1/β0.1) β Ξ²Β·(tokens_used/budget) + Ξ³Β·(savings/budget)
Quick Start
pip install -e .
# Run the OpenEnv server
uvicorn reasonbudget_gym.server.app:app --port 8000
# In another terminal β use the Python client
python -c "
from reasonbudget_gym.client import ReasonBudgetClient
client = ReasonBudgetClient()
obs = client.reset()
result = client.step(200)
print(result.reward, result.done)
"
Or run baseline evaluation locally:
python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
python -m reasonbudget_gym.eval.plots eval_results.json
Baselines
| Agent | Mean Accuracy | Mean Reward | Budget Used |
|---|---|---|---|
uniform |
0.780 | 7.620 | 100.0% |
greedy_max |
0.840 | 4.163 | 100.0% |
oracle |
0.728 | 6.933 | 98.3% |
bandit |
0.744 | 6.526 | 98.8% |
Evaluation command:
python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
Observation Space
| Field | Shape | Description |
|---|---|---|
question_embedding |
384-dim | Sentence-transformer encoding |
remaining_budget |
int | Tokens left in episode |
questions_remaining |
int | Questions left |
budget_per_remaining |
float | remaining / questions_left |
accuracy_so_far |
float | Running accuracy [0, 1] |
history |
list | Past (allocated, used, correct) tuples |
Action: integer token allocation, clamped to [min_tokens, max_tokens] and remaining budget.
Data
The repo ships with a deterministic offline question bundle and response cache under
reasonbudget_gym/data/, so demos and tests work without external services.
A synthetic cache (reasonbudget_gym/data/response_cache.json) simulates realistic
DeepSeek-R1 accuracy curves across 4 difficulty tiers: gsm8k, math_l1_l2, math_l3,
math_l4_l5. The sampler also caches MiniLM embeddings to
reasonbudget_gym/data/embeddings.npy after the first run.
Regenerate the synthetic cache with:
python reasonbudget_gym/data/generate_synthetic_cache.py
Deployment (Docker / HF Spaces)
docker build -t reasoning-economic-env .
docker run -p 8000:8000 reasoning-economic-env
curl http://localhost:8000/health
Related Work
- MAS-TTS: Allocates reasoning across agents on one problem vs. our approach of allocating across questions for a single agent.
- AgentTTS: Test-time compute-optimal scaling across multi-stage complex tasks.
Citation
Part of the AgentXβAgentBeats Competition (Berkeley RDI, 2026). Built on OpenEnv by Meta/PyTorch.

