File size: 4,640 Bytes
ee91164
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: ReasoningEconomicsEnv
sdk: docker
app_port: 8000
tags:
  - openenv
  - reasoning-economic-env
  - rl
  - math
---

# ReasoningEconomicsEnv

**An RL environment for learning to allocate reasoning compute under budget constraints.**

> Modern reasoning models like DeepSeek-R1 "think" by generating internal tokens before
> answering. More tokens = deeper reasoning = better answers β€” but tokens cost compute and
> money. How should an agent decide how much to think on each problem?

ReasoningEconomicsEnv frames this as a sequential decision problem: an agent faces a series
of math questions with a fixed total token budget and must learn to **allocate tokens wisely**
β€” spending less on easy questions, more on hard ones.

Built on [Meta's OpenEnv framework](https://github.com/meta-pytorch/OpenEnv) for the
[AgentX–AgentBeats Competition](https://rdi.berkeley.edu/agentx-agentbeats) hosted by
Berkeley RDI.

---

## How It Works

```
Episode (10 questions, 4000 token budget)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. Agent observes: question embedding, remaining budget β”‚
β”‚  2. Agent decides: token allocation (50–800)            β”‚
β”‚  3. Solver attempts question with that token limit      β”‚
β”‚  4. Reward = correctness βˆ’ Ξ²Β·cost + Ξ³Β·efficiency_bonus  β”‚
β”‚  5. Repeat until all questions answered or budget gone  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Reward formula:** `R = correctness(Β±1/βˆ’0.1) βˆ’ Ξ²Β·(tokens_used/budget) + Ξ³Β·(savings/budget)`

---

## Quick Start

```bash
pip install -e .

# Run the OpenEnv server
uvicorn reasonbudget_gym.server.app:app --port 8000

# In another terminal β€” use the Python client
python -c "
from reasonbudget_gym.client import ReasonBudgetClient
client = ReasonBudgetClient()
obs = client.reset()
result = client.step(200)
print(result.reward, result.done)
"
```

**Or run baseline evaluation locally:**

```bash
python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
python -m reasonbudget_gym.eval.plots eval_results.json
```

---

## Baselines

| Agent | Mean Accuracy | Mean Reward | Budget Used |
|-------|---------------|-------------|-------------|
| `uniform` | 0.780 | 7.620 | 100.0% |
| `greedy_max` | 0.840 | 4.163 | 100.0% |
| `oracle` | 0.728 | 6.933 | 98.3% |
| `bandit` | 0.744 | 6.526 | 98.8% |

Evaluation command:

```bash
python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
```

![Baseline comparison](docs/agent_comparison.png)

![Budget pacing](docs/budget_pacing.png)

---

## Observation Space

| Field | Shape | Description |
|-------|-------|-------------|
| `question_embedding` | 384-dim | Sentence-transformer encoding |
| `remaining_budget` | int | Tokens left in episode |
| `questions_remaining` | int | Questions left |
| `budget_per_remaining` | float | remaining / questions_left |
| `accuracy_so_far` | float | Running accuracy [0, 1] |
| `history` | list | Past (allocated, used, correct) tuples |

**Action:** integer token allocation, clamped to `[min_tokens, max_tokens]` and remaining budget.

---

## Data

The repo ships with a deterministic offline question bundle and response cache under
`reasonbudget_gym/data/`, so demos and tests work without external services.

A **synthetic cache** (`reasonbudget_gym/data/response_cache.json`) simulates realistic
DeepSeek-R1 accuracy curves across 4 difficulty tiers: `gsm8k`, `math_l1_l2`, `math_l3`,
`math_l4_l5`. The sampler also caches MiniLM embeddings to
`reasonbudget_gym/data/embeddings.npy` after the first run.

Regenerate the synthetic cache with:

```bash
python reasonbudget_gym/data/generate_synthetic_cache.py
```

---

## Deployment (Docker / HF Spaces)

```bash
docker build -t reasoning-economic-env .
docker run -p 8000:8000 reasoning-economic-env
curl http://localhost:8000/health
```

---

## Related Work

- **[MAS-TTS](https://github.com/jincan333/MAS-TTS):** Allocates reasoning across *agents* on
  one problem vs. our approach of allocating across *questions* for a single agent.
- **[AgentTTS](https://arxiv.org/abs/2508.00890):** Test-time compute-optimal scaling across
  multi-stage complex tasks.

---

## Citation

Part of the AgentX–AgentBeats Competition (Berkeley RDI, 2026).
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) by Meta/PyTorch.