CollabReasoning / README.md
Andrew Lara
Deploy landing page update to Space
ee91164
metadata
title: ReasoningEconomicsEnv
sdk: docker
app_port: 8000
tags:
  - openenv
  - reasoning-economic-env
  - rl
  - math

ReasoningEconomicsEnv

An RL environment for learning to allocate reasoning compute under budget constraints.

Modern reasoning models like DeepSeek-R1 "think" by generating internal tokens before answering. More tokens = deeper reasoning = better answers β€” but tokens cost compute and money. How should an agent decide how much to think on each problem?

ReasoningEconomicsEnv frames this as a sequential decision problem: an agent faces a series of math questions with a fixed total token budget and must learn to allocate tokens wisely β€” spending less on easy questions, more on hard ones.

Built on Meta's OpenEnv framework for the AgentX–AgentBeats Competition hosted by Berkeley RDI.


How It Works

Episode (10 questions, 4000 token budget)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. Agent observes: question embedding, remaining budget β”‚
β”‚  2. Agent decides: token allocation (50–800)            β”‚
β”‚  3. Solver attempts question with that token limit      β”‚
β”‚  4. Reward = correctness βˆ’ Ξ²Β·cost + Ξ³Β·efficiency_bonus  β”‚
β”‚  5. Repeat until all questions answered or budget gone  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Reward formula: R = correctness(Β±1/βˆ’0.1) βˆ’ Ξ²Β·(tokens_used/budget) + Ξ³Β·(savings/budget)


Quick Start

pip install -e .

# Run the OpenEnv server
uvicorn reasonbudget_gym.server.app:app --port 8000

# In another terminal β€” use the Python client
python -c "
from reasonbudget_gym.client import ReasonBudgetClient
client = ReasonBudgetClient()
obs = client.reset()
result = client.step(200)
print(result.reward, result.done)
"

Or run baseline evaluation locally:

python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
python -m reasonbudget_gym.eval.plots eval_results.json

Baselines

Agent Mean Accuracy Mean Reward Budget Used
uniform 0.780 7.620 100.0%
greedy_max 0.840 4.163 100.0%
oracle 0.728 6.933 98.3%
bandit 0.744 6.526 98.8%

Evaluation command:

python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json

Baseline comparison

Budget pacing


Observation Space

Field Shape Description
question_embedding 384-dim Sentence-transformer encoding
remaining_budget int Tokens left in episode
questions_remaining int Questions left
budget_per_remaining float remaining / questions_left
accuracy_so_far float Running accuracy [0, 1]
history list Past (allocated, used, correct) tuples

Action: integer token allocation, clamped to [min_tokens, max_tokens] and remaining budget.


Data

The repo ships with a deterministic offline question bundle and response cache under reasonbudget_gym/data/, so demos and tests work without external services.

A synthetic cache (reasonbudget_gym/data/response_cache.json) simulates realistic DeepSeek-R1 accuracy curves across 4 difficulty tiers: gsm8k, math_l1_l2, math_l3, math_l4_l5. The sampler also caches MiniLM embeddings to reasonbudget_gym/data/embeddings.npy after the first run.

Regenerate the synthetic cache with:

python reasonbudget_gym/data/generate_synthetic_cache.py

Deployment (Docker / HF Spaces)

docker build -t reasoning-economic-env .
docker run -p 8000:8000 reasoning-economic-env
curl http://localhost:8000/health

Related Work

  • MAS-TTS: Allocates reasoning across agents on one problem vs. our approach of allocating across questions for a single agent.
  • AgentTTS: Test-time compute-optimal scaling across multi-stage complex tasks.

Citation

Part of the AgentX–AgentBeats Competition (Berkeley RDI, 2026). Built on OpenEnv by Meta/PyTorch.