CollabReasoning / README.md
Andrew Lara
Deploy landing page update to Space
ee91164
---
title: ReasoningEconomicsEnv
sdk: docker
app_port: 8000
tags:
- openenv
- reasoning-economic-env
- rl
- math
---
# ReasoningEconomicsEnv
**An RL environment for learning to allocate reasoning compute under budget constraints.**
> Modern reasoning models like DeepSeek-R1 "think" by generating internal tokens before
> answering. More tokens = deeper reasoning = better answers β€” but tokens cost compute and
> money. How should an agent decide how much to think on each problem?
ReasoningEconomicsEnv frames this as a sequential decision problem: an agent faces a series
of math questions with a fixed total token budget and must learn to **allocate tokens wisely**
β€” spending less on easy questions, more on hard ones.
Built on [Meta's OpenEnv framework](https://github.com/meta-pytorch/OpenEnv) for the
[AgentX–AgentBeats Competition](https://rdi.berkeley.edu/agentx-agentbeats) hosted by
Berkeley RDI.
---
## How It Works
```
Episode (10 questions, 4000 token budget)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Agent observes: question embedding, remaining budget β”‚
β”‚ 2. Agent decides: token allocation (50–800) β”‚
β”‚ 3. Solver attempts question with that token limit β”‚
β”‚ 4. Reward = correctness βˆ’ Ξ²Β·cost + Ξ³Β·efficiency_bonus β”‚
β”‚ 5. Repeat until all questions answered or budget gone β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Reward formula:** `R = correctness(Β±1/βˆ’0.1) βˆ’ Ξ²Β·(tokens_used/budget) + Ξ³Β·(savings/budget)`
---
## Quick Start
```bash
pip install -e .
# Run the OpenEnv server
uvicorn reasonbudget_gym.server.app:app --port 8000
# In another terminal β€” use the Python client
python -c "
from reasonbudget_gym.client import ReasonBudgetClient
client = ReasonBudgetClient()
obs = client.reset()
result = client.step(200)
print(result.reward, result.done)
"
```
**Or run baseline evaluation locally:**
```bash
python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
python -m reasonbudget_gym.eval.plots eval_results.json
```
---
## Baselines
| Agent | Mean Accuracy | Mean Reward | Budget Used |
|-------|---------------|-------------|-------------|
| `uniform` | 0.780 | 7.620 | 100.0% |
| `greedy_max` | 0.840 | 4.163 | 100.0% |
| `oracle` | 0.728 | 6.933 | 98.3% |
| `bandit` | 0.744 | 6.526 | 98.8% |
Evaluation command:
```bash
python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
```
![Baseline comparison](docs/agent_comparison.png)
![Budget pacing](docs/budget_pacing.png)
---
## Observation Space
| Field | Shape | Description |
|-------|-------|-------------|
| `question_embedding` | 384-dim | Sentence-transformer encoding |
| `remaining_budget` | int | Tokens left in episode |
| `questions_remaining` | int | Questions left |
| `budget_per_remaining` | float | remaining / questions_left |
| `accuracy_so_far` | float | Running accuracy [0, 1] |
| `history` | list | Past (allocated, used, correct) tuples |
**Action:** integer token allocation, clamped to `[min_tokens, max_tokens]` and remaining budget.
---
## Data
The repo ships with a deterministic offline question bundle and response cache under
`reasonbudget_gym/data/`, so demos and tests work without external services.
A **synthetic cache** (`reasonbudget_gym/data/response_cache.json`) simulates realistic
DeepSeek-R1 accuracy curves across 4 difficulty tiers: `gsm8k`, `math_l1_l2`, `math_l3`,
`math_l4_l5`. The sampler also caches MiniLM embeddings to
`reasonbudget_gym/data/embeddings.npy` after the first run.
Regenerate the synthetic cache with:
```bash
python reasonbudget_gym/data/generate_synthetic_cache.py
```
---
## Deployment (Docker / HF Spaces)
```bash
docker build -t reasoning-economic-env .
docker run -p 8000:8000 reasoning-economic-env
curl http://localhost:8000/health
```
---
## Related Work
- **[MAS-TTS](https://github.com/jincan333/MAS-TTS):** Allocates reasoning across *agents* on
one problem vs. our approach of allocating across *questions* for a single agent.
- **[AgentTTS](https://arxiv.org/abs/2508.00890):** Test-time compute-optimal scaling across
multi-stage complex tasks.
---
## Citation
Part of the AgentX–AgentBeats Competition (Berkeley RDI, 2026).
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) by Meta/PyTorch.