---
title: ReasoningEconomicsEnv
sdk: docker
app_port: 8000
tags:
  - openenv
  - reasoning-economic-env
  - rl
  - math
---

# ReasoningEconomicsEnv

**An RL environment for learning to allocate reasoning compute under budget constraints.**

> Modern reasoning models like DeepSeek-R1 "think" by generating internal tokens before
> answering. More tokens = deeper reasoning = better answers — but tokens cost compute and
> money. How should an agent decide how much to think on each problem?

ReasoningEconomicsEnv frames this as a sequential decision problem: an agent faces a series
of math questions with a fixed total token budget and must learn to **allocate tokens wisely**
— spending less on easy questions, more on hard ones.

Built on [Meta's OpenEnv framework](https://github.com/meta-pytorch/OpenEnv) for the
[AgentX–AgentBeats Competition](https://rdi.berkeley.edu/agentx-agentbeats) hosted by
Berkeley RDI.

---

## How It Works

```
Episode (10 questions, 4000 token budget)
┌─────────────────────────────────────────────────────────┐
│  1. Agent observes: question embedding, remaining budget │
│  2. Agent decides: token allocation (50–800)            │
│  3. Solver attempts question with that token limit      │
│  4. Reward = correctness − β·cost + γ·efficiency_bonus  │
│  5. Repeat until all questions answered or budget gone  │
└─────────────────────────────────────────────────────────┘
```

**Reward formula:** `R = correctness(±1/−0.1) − β·(tokens_used/budget) + γ·(savings/budget)`

---

## Quick Start

```bash
pip install -e .

# Run the OpenEnv server
uvicorn reasonbudget_gym.server.app:app --port 8000

# In another terminal — use the Python client
python -c "
from reasonbudget_gym.client import ReasonBudgetClient
client = ReasonBudgetClient()
obs = client.reset()
result = client.step(200)
print(result.reward, result.done)
"
```

**Or run baseline evaluation locally:**

```bash
python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
python -m reasonbudget_gym.eval.plots eval_results.json
```

---

## Baselines

| Agent | Mean Accuracy | Mean Reward | Budget Used |
|-------|---------------|-------------|-------------|
| `uniform` | 0.780 | 7.620 | 100.0% |
| `greedy_max` | 0.840 | 4.163 | 100.0% |
| `oracle` | 0.728 | 6.933 | 98.3% |
| `bandit` | 0.744 | 6.526 | 98.8% |

Evaluation command:

```bash
python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
```

![Baseline comparison](docs/agent_comparison.png)

![Budget pacing](docs/budget_pacing.png)

---

## Observation Space

| Field | Shape | Description |
|-------|-------|-------------|
| `question_embedding` | 384-dim | Sentence-transformer encoding |
| `remaining_budget` | int | Tokens left in episode |
| `questions_remaining` | int | Questions left |
| `budget_per_remaining` | float | remaining / questions_left |
| `accuracy_so_far` | float | Running accuracy [0, 1] |
| `history` | list | Past (allocated, used, correct) tuples |

**Action:** integer token allocation, clamped to `[min_tokens, max_tokens]` and remaining budget.

---

## Data

The repo ships with a deterministic offline question bundle and response cache under
`reasonbudget_gym/data/`, so demos and tests work without external services.

A **synthetic cache** (`reasonbudget_gym/data/response_cache.json`) simulates realistic
DeepSeek-R1 accuracy curves across 4 difficulty tiers: `gsm8k`, `math_l1_l2`, `math_l3`,
`math_l4_l5`. The sampler also caches MiniLM embeddings to
`reasonbudget_gym/data/embeddings.npy` after the first run.

Regenerate the synthetic cache with:

```bash
python reasonbudget_gym/data/generate_synthetic_cache.py
```

---

## Deployment (Docker / HF Spaces)

```bash
docker build -t reasoning-economic-env .
docker run -p 8000:8000 reasoning-economic-env
curl http://localhost:8000/health
```

---

## Related Work

- **[MAS-TTS](https://github.com/jincan333/MAS-TTS):** Allocates reasoning across *agents* on
  one problem vs. our approach of allocating across *questions* for a single agent.
- **[AgentTTS](https://arxiv.org/abs/2508.00890):** Test-time compute-optimal scaling across
  multi-stage complex tasks.

---

## Citation

Part of the AgentX–AgentBeats Competition (Berkeley RDI, 2026).
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) by Meta/PyTorch.