Spaces:

landrew9
/

CollabReasoning

Sleeping

App Files Files Community

Andrew Lara commited on 14 days ago

Commit

ee91164

0 Parent(s):

Deploy landing page update to Space

Browse files

Files changed (48) hide show

.dockerignore +7 -0
.env.example +5 -0
.gitattributes +2 -0
.github/workflows/ci.yml +18 -0
.gitignore +6 -0
CODEX_CONTEXT.md +110 -0
Dockerfile +23 -0
README.md +150 -0
docs/agent_comparison.png +3 -0
docs/budget_pacing.png +3 -0
eval_results.json +0 -0
openenv.yaml +12 -0
pyproject.toml +37 -0
reasonbudget_gym/__init__.py +3 -0
reasonbudget_gym/baselines/__init__.py +11 -0
reasonbudget_gym/baselines/bandit.py +82 -0
reasonbudget_gym/baselines/greedy_max.py +18 -0
reasonbudget_gym/baselines/oracle.py +46 -0
reasonbudget_gym/baselines/uniform.py +22 -0
reasonbudget_gym/client.py +51 -0
reasonbudget_gym/data/__init__.py +0 -0
reasonbudget_gym/data/embeddings.npy +3 -0
reasonbudget_gym/data/generate_synthetic_cache.py +156 -0
reasonbudget_gym/data/response_cache.json +0 -0
reasonbudget_gym/env/__init__.py +5 -0
reasonbudget_gym/env/config.py +37 -0
reasonbudget_gym/env/episode_sampler.py +210 -0
reasonbudget_gym/env/models.py +43 -0
reasonbudget_gym/env/reason_budget_env.py +167 -0
reasonbudget_gym/env/reward.py +44 -0
reasonbudget_gym/eval/__init__.py +0 -0
reasonbudget_gym/eval/evaluate.py +121 -0
reasonbudget_gym/eval/plots.py +89 -0
reasonbudget_gym/policy/__init__.py +3 -0
reasonbudget_gym/policy/allocation_policy.py +127 -0
reasonbudget_gym/server/__init__.py +0 -0
reasonbudget_gym/server/app.py +233 -0
reasonbudget_gym/solver/__init__.py +4 -0
reasonbudget_gym/solver/base.py +26 -0
reasonbudget_gym/solver/cached_solver.py +98 -0
reasonbudget_gym/solver/live_solver.py +62 -0
reasonbudget_gym/tests/__init__.py +0 -0
reasonbudget_gym/tests/test_config.py +27 -0
reasonbudget_gym/tests/test_integration.py +43 -0
reasonbudget_gym/tests/test_reward.py +23 -0
reasonbudget_gym/training/__init__.py +0 -0
reasonbudget_gym/training/ppo_train.py +252 -0
requirements.txt +8 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,7 @@

+.git
+.venv
+.pytest_cache
+__pycache__
+*.py[cod]
+.DS_Store
+runs

.env.example ADDED Viewed

	@@ -0,0 +1,5 @@

+# Optional: only needed for live solver (GPU inference)
+TOGETHER_API_KEY=your_key_here
+# Optional: for HF Spaces deployment
+HF_TOKEN=your_token_here

.gitattributes ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ docs/*.png filter=lfs diff=lfs merge=lfs -text
2	+ reasonbudget_gym/data/*.npy filter=lfs diff=lfs merge=lfs -text

.github/workflows/ci.yml ADDED Viewed

	@@ -0,0 +1,18 @@

+name: CI
+on:
+  push:
+  pull_request:
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install dependencies
+        run: pip install -e ".[dev]"
+      - name: Run tests
+        run: python -m pytest reasonbudget_gym/tests/ -v

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+.DS_Store
+.venv/
+.pytest_cache/
+__pycache__/
+*.py[cod]
+runs/

CODEX_CONTEXT.md ADDED Viewed

	@@ -0,0 +1,110 @@

+# Codex Context — ReasoningEconomicsEnv
+## Project
+- Repo root: `/Users/andrew/Mac/RL Research`
+- GitHub repo: `git@github.com:laraandrew/reasoningeconomicsenv.git`
+- Active branch: `polish-and-deploy`
+- Hugging Face Space: `landrew9/CollabReasoning`
+- Package: `reasonbudget_gym`
+- Goal: RL environment for token-budget allocation, competition submission, Docker-based HF Space deployment
+## Remotes
+- `origin`: `git@github.com:laraandrew/reasoningeconomicsenv.git`
+- `hf`: `https://huggingface.co/spaces/landrew9/CollabReasoning`
+## Current State
+- `main` and `polish-and-deploy` originally pointed to the same base commit.
+- Work on `polish-and-deploy` is pushed to GitHub through commit `efdc42b`.
+- The shipped cache works:
+  - `CachedSolver(EnvConfig())._cache` loads 500 entries.
+- The environment now defaults to an offline-safe path for cached runs:
+  - `EpisodeSampler` uses deterministic bundled questions when the cached solver is active.
+- Real question embeddings are enabled and cached at:
+  - `reasonbudget_gym/data/embeddings.npy`
+- README now contains measured evaluation metrics and embedded plot assets.
+- CI exists at `.github/workflows/ci.yml`.
+- Dockerfile was slimmed to a runtime-only serving image suitable for HF Spaces.
+- The Hugging Face Space repo was force-updated from a clean temporary clone because
+  Hugging Face rejected the branch's historical raw binary blobs.
+- The live Space is currently:
+  - Hub page: `https://huggingface.co/spaces/landrew9/CollabReasoning`
+  - Host: `https://landrew9-collabreasoning.hf.space`
+  - Runtime stage: `RUNNING`
+  - Health endpoint: `/health`
+  - Root path originally returned `404`; a landing page at `/` was then added in `server/app.py`
+## Local Tooling
+- Hugging Face CLI installed globally via the official installer.
+- Binary path: `/Users/andrew/.local/bin/hf`
+- Reported version at install time: `1.8.0`
+- Installer added `/Users/andrew/.local/bin` to `/Users/andrew/.zshrc`
+- `git-lfs` and `git-xet` are installed and initialized globally.
+- `.gitattributes` now tracks:
+  - `docs/*.png`
+  - `reasonbudget_gym/data/*.npy`
+## Verified Commands
+- Tests:
+  - `.venv/bin/python -m pytest reasonbudget_gym/tests/ -v`
+  - Result: `8 passed`
+- Eval:
+  - `.venv/bin/python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json`
+- Plot generation:
+  - `.venv/bin/python -c "from reasonbudget_gym.eval.plots import agent_comparison, budget_pacing; agent_comparison('eval_results.json', 'docs/agent_comparison.png'); budget_pacing('eval_results.json', 'docs/budget_pacing.png')"`
+- PPO smoke test:
+  - `.venv/bin/python -m reasonbudget_gym.training.ppo_train --n_episodes 100 --output_dir runs/smoke`
+  - Completed successfully and wrote checkpoints.
+- Docker:
+  - `docker build -t reasoning-economic-env .`
+  - `docker run -d -p 8000:8000 --name reasoning-economic-env-test reasoning-economic-env`
+  - `curl http://127.0.0.1:8000/health`
+  - Result: `{"status":"ok","env":"ReasonBudgetEnv","version":"0.1.0"}`
+## Current Eval Numbers
+From `eval_results.json` with `--n_episodes 50 --seed 42`:
+| Agent | Mean Accuracy | Mean Reward | Budget Used |
+|---|---:|---:|---:|
+| `uniform` | 0.780 | 7.620 | 100.0% |
+| `greedy_max` | 0.840 | 4.163 | 100.0% |
+| `oracle` | 0.728 | 6.933 | 98.3% |
+| `bandit` | 0.744 | 6.526 | 98.8% |
+## Important Files
+- `reasonbudget_gym/env/episode_sampler.py`
+- `reasonbudget_gym/env/config.py`
+- `reasonbudget_gym/solver/cached_solver.py`
+- `reasonbudget_gym/eval/evaluate.py`
+- `reasonbudget_gym/server/app.py`
+- `Dockerfile`
+- `README.md`
+- `.github/workflows/ci.yml`
+- `eval_results.json`
+- `docs/agent_comparison.png`
+- `docs/budget_pacing.png`
+## Git History Added On This Branch
+- `29b6ad0` Add gitignore for local dev artifacts
+- `ecd0ab1` Use bundled questions for cached offline runs
+- `9e122a2` Cache MiniLM question embeddings
+- `c4d6234` Add GitHub Actions test workflow
+- `fc6c606` Add baseline eval results and README plots
+- `280a6de` Slim Docker image for HF deployment
+- `fc4c73c` Add living Codex context file
+- `efdc42b` Track Space binaries with Xet
+## Notes For Next Codex
+- Keep `HANDOFF.md` deleted; update this file instead.
+- Do not remove `reasonbudget_gym/data/response_cache.json` or `reasonbudget_gym/data/embeddings.npy`; they are part of the current offline/demo story.
+- The Docker image should stay lean; avoid reintroducing `sentence-transformers`, `datasets`, or training dependencies into the serving image unless truly needed.
+- If enabling the live solver later, configure secrets in Hugging Face Space settings rather than hard-coding them.
+- The local repo may also have an `hf` remote pointing at the Space repo; if so, pushes there will trigger Space rebuilds.

Dockerfile ADDED Viewed

	@@ -0,0 +1,23 @@

+FROM python:3.11-slim
+WORKDIR /app
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1
+# Copy only the files needed to serve the packaged environment.
+COPY pyproject.toml README.md openenv.yaml ./
+COPY reasonbudget_gym ./reasonbudget_gym
+# The Space serves the bundled cached environment, so it only needs the
+# lightweight runtime deps plus an editable install of this package.
+RUN pip install --no-cache-dir \
+    "fastapi>=0.110.0" \
+    "uvicorn[standard]>=0.29.0" \
+    "pydantic>=2.0" \
+    "numpy>=1.24" \
+    "hatchling" \
+ && pip install --no-cache-dir --no-deps -e .
+EXPOSE 8000
+CMD ["uvicorn", "reasonbudget_gym.server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md ADDED Viewed

	@@ -0,0 +1,150 @@

+---
+title: ReasoningEconomicsEnv
+sdk: docker
+app_port: 8000
+tags:
+  - openenv
+  - reasoning-economic-env
+  - rl
+  - math
+---
+# ReasoningEconomicsEnv
+**An RL environment for learning to allocate reasoning compute under budget constraints.**
+> Modern reasoning models like DeepSeek-R1 "think" by generating internal tokens before
+> answering. More tokens = deeper reasoning = better answers — but tokens cost compute and
+> money. How should an agent decide how much to think on each problem?
+ReasoningEconomicsEnv frames this as a sequential decision problem: an agent faces a series
+of math questions with a fixed total token budget and must learn to **allocate tokens wisely**
+— spending less on easy questions, more on hard ones.
+Built on [Meta's OpenEnv framework](https://github.com/meta-pytorch/OpenEnv) for the
+[AgentX–AgentBeats Competition](https://rdi.berkeley.edu/agentx-agentbeats) hosted by
+Berkeley RDI.
+---
+## How It Works
+```
+Episode (10 questions, 4000 token budget)
+┌─────────────────────────────────────────────────────────┐
+│  1. Agent observes: question embedding, remaining budget │
+│  2. Agent decides: token allocation (50–800)            │
+│  3. Solver attempts question with that token limit      │
+│  4. Reward = correctness − β·cost + γ·efficiency_bonus  │
+│  5. Repeat until all questions answered or budget gone  │
+└─────────────────────────────────────────────────────────┘
+```
+**Reward formula:** `R = correctness(±1/−0.1) − β·(tokens_used/budget) + γ·(savings/budget)`
+---
+## Quick Start
+```bash
+pip install -e .
+# Run the OpenEnv server
+uvicorn reasonbudget_gym.server.app:app --port 8000
+# In another terminal — use the Python client
+python -c "
+from reasonbudget_gym.client import ReasonBudgetClient
+client = ReasonBudgetClient()
+obs = client.reset()
+result = client.step(200)
+print(result.reward, result.done)
+"
+```
+**Or run baseline evaluation locally:**
+```bash
+python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
+python -m reasonbudget_gym.eval.plots eval_results.json
+```
+---
+## Baselines
+| Agent | Mean Accuracy | Mean Reward | Budget Used |
+|-------|---------------|-------------|-------------|
+| `uniform` | 0.780 | 7.620 | 100.0% |
+| `greedy_max` | 0.840 | 4.163 | 100.0% |
+| `oracle` | 0.728 | 6.933 | 98.3% |
+| `bandit` | 0.744 | 6.526 | 98.8% |
+Evaluation command:
+```bash
+python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
+```
+![Baseline comparison](docs/agent_comparison.png)
+![Budget pacing](docs/budget_pacing.png)
+---
+## Observation Space
+| Field | Shape | Description |
+|-------|-------|-------------|
+| `question_embedding` | 384-dim | Sentence-transformer encoding |
+| `remaining_budget` | int | Tokens left in episode |
+| `questions_remaining` | int | Questions left |
+| `budget_per_remaining` | float | remaining / questions_left |
+| `accuracy_so_far` | float | Running accuracy [0, 1] |
+| `history` | list | Past (allocated, used, correct) tuples |
+**Action:** integer token allocation, clamped to `[min_tokens, max_tokens]` and remaining budget.
+---
+## Data
+The repo ships with a deterministic offline question bundle and response cache under
+`reasonbudget_gym/data/`, so demos and tests work without external services.
+A **synthetic cache** (`reasonbudget_gym/data/response_cache.json`) simulates realistic
+DeepSeek-R1 accuracy curves across 4 difficulty tiers: `gsm8k`, `math_l1_l2`, `math_l3`,
+`math_l4_l5`. The sampler also caches MiniLM embeddings to
+`reasonbudget_gym/data/embeddings.npy` after the first run.
+Regenerate the synthetic cache with:
+```bash
+python reasonbudget_gym/data/generate_synthetic_cache.py
+```
+---
+## Deployment (Docker / HF Spaces)
+```bash
+docker build -t reasoning-economic-env .
+docker run -p 8000:8000 reasoning-economic-env
+curl http://localhost:8000/health
+```
+---
+## Related Work
+- **[MAS-TTS](https://github.com/jincan333/MAS-TTS):** Allocates reasoning across *agents* on
+  one problem vs. our approach of allocating across *questions* for a single agent.
+- **[AgentTTS](https://arxiv.org/abs/2508.00890):** Test-time compute-optimal scaling across
+  multi-stage complex tasks.
+---
+## Citation
+Part of the AgentX–AgentBeats Competition (Berkeley RDI, 2026).
+Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) by Meta/PyTorch.

docs/agent_comparison.png ADDED Viewed

Git LFS Details

SHA256: 03e0666f33493659fdea9a45064a34556fa191a590544a2da2ed84bc65d21739
Pointer size: 130 Bytes
Size of remote file: 55.4 kB

docs/budget_pacing.png ADDED Viewed

Git LFS Details

SHA256: 2ab4d18ea1b3763880ac25fdd4ecd7f77b0289acf754dba9b7b85c3dfc54a525
Pointer size: 131 Bytes
Size of remote file: 142 kB

eval_results.json ADDED Viewed

The diff for this file is too large to render. See raw diff

openenv.yaml ADDED Viewed

	@@ -0,0 +1,12 @@

+spec_version: 1
+name: reasoning-economic-env
+type: space
+runtime: fastapi
+app: reasonbudget_gym.server.app:app
+port: 8000
+tags:
+  - openenv
+  - reasoning-economic-env
+  - rl
+  - math
+  - token-budget

pyproject.toml ADDED Viewed

	@@ -0,0 +1,37 @@

+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "reasonbudget-gym"
+version = "0.1.0"
+description = "RL environment for learning to allocate reasoning compute under budget constraints"
+requires-python = ">=3.10"
+dependencies = [
+    "fastapi>=0.110.0",
+    "uvicorn[standard]>=0.29.0",
+    "pydantic>=2.0",
+    "numpy>=1.24",
+    "datasets>=2.18.0",
+    "sentence-transformers>=2.7.0",
+    "matplotlib>=3.8",
+    "seaborn>=0.13",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0",
+    "httpx>=0.27",
+]
+train = [
+    "torch>=2.2",
+]
+live = [
+    "together>=1.2",
+]
+[tool.hatch.build.targets.wheel]
+packages = ["reasonbudget_gym"]
+[tool.pytest.ini_options]
+testpaths = ["reasonbudget_gym/tests"]

reasonbudget_gym/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ """ReasoningEconomicsEnv — token budget allocation RL environment."""
2	+
3	+ __version__ = "0.1.0"

reasonbudget_gym/baselines/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from .uniform import UniformBaseline
+from .greedy_max import GreedyMaxBaseline
+from .oracle import DifficultyOracleBaseline
+from .bandit import LinUCBBaseline
+__all__ = [
+    "UniformBaseline",
+    "GreedyMaxBaseline",
+    "DifficultyOracleBaseline",
+    "LinUCBBaseline",
+]

reasonbudget_gym/baselines/bandit.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""
+LinUCB bandit baseline: learns token allocation from question embeddings.
+Uses a simplified contextual bandit (LinUCB) where:
+- Context = question embedding (384-dim, but we project to 16-dim for speed)
+- Arms = discrete budget tiers [50, 100, 200, 400, 800]
+- Reward = observed correctness signal
+"""
+import math
+import numpy as np
+from ..env.models import Observation
+from ..env.config import EnvConfig
+PROJ_DIM = 16  # Projection dimension for efficiency
+class LinUCBBaseline:
+    """
+    Linear UCB contextual bandit for token allocation.
+    Projects 384-dim embeddings to PROJ_DIM via random projection,
+    then maintains a separate LinUCB arm for each budget tier.
+    """
+    name = "bandit"
+    def __init__(self, config: EnvConfig, alpha: float = 1.0, seed: int = 42):
+        self.config = config
+        self.alpha = alpha
+        self.tiers = config.budget_tiers
+        np.random.seed(seed)
+        # Random projection matrix: 384 -> PROJ_DIM
+        self._proj = np.random.randn(config.embedding_dim, PROJ_DIM) / math.sqrt(PROJ_DIM)
+        d = PROJ_DIM + 4  # projected emb + 4 scalars
+        # Per-arm LinUCB parameters
+        self._A = {t: np.eye(d) for t in self.tiers}
+        self._b = {t: np.zeros(d) for t in self.tiers}
+        self._last_context = None
+        self._last_arm = None
+    def _context(self, obs: Observation) -> np.ndarray:
+        emb = np.array(obs.question_embedding, dtype=float)
+        proj = emb @ self._proj
+        scalars = np.array([
+            obs.remaining_budget / self.config.total_budget,
+            obs.questions_remaining / self.config.questions_per_episode,
+            obs.budget_per_remaining / self.config.max_tokens,
+            obs.accuracy_so_far,
+        ])
+        return np.concatenate([proj, scalars])
+    def get_action(self, obs: Observation) -> int:
+        ctx = self._context(obs)
+        self._last_context = ctx
+        # Only consider tiers we can afford
+        affordable = [t for t in self.tiers if t <= obs.remaining_budget]
+        if not affordable:
+            affordable = [self.tiers[0]]
+        ucb_scores = {}
+        for arm in affordable:
+            A_inv = np.linalg.inv(self._A[arm])
+            theta = A_inv @ self._b[arm]
+            ucb = theta @ ctx + self.alpha * math.sqrt(ctx @ A_inv @ ctx)
+            ucb_scores[arm] = ucb
+        best_arm = max(ucb_scores, key=ucb_scores.__getitem__)
+        self._last_arm = best_arm
+        return best_arm
+    def update(self, reward: float):
+        """Update LinUCB parameters after observing a reward."""
+        if self._last_context is None or self._last_arm is None:
+            return
+        ctx = self._last_context
+        arm = self._last_arm
+        self._A[arm] += np.outer(ctx, ctx)
+        self._b[arm] += reward * ctx
+    def reset(self):
+        pass

reasonbudget_gym/baselines/greedy_max.py ADDED Viewed

	@@ -0,0 +1,18 @@

+"""GreedyMax baseline: always allocate the maximum fair share."""
+from ..env.models import Observation
+from ..env.config import EnvConfig
+class GreedyMaxBaseline:
+    """Allocates max_tokens each step regardless of budget state."""
+    name = "greedy_max"
+    def __init__(self, config: EnvConfig):
+        self.config = config
+    def get_action(self, obs: Observation) -> int:
+        return min(self.config.max_tokens, obs.remaining_budget)
+    def reset(self):
+        pass

reasonbudget_gym/baselines/oracle.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+DifficultyOracle baseline: knows question difficulty, allocates proportionally.
+This is an upper bound — in practice the agent doesn't see difficulty labels.
+"""
+from ..env.models import Observation
+from ..env.config import EnvConfig
+# Token multipliers by difficulty (relative units)
+DIFFICULTY_MULTIPLIERS = {
+    "gsm8k": 0.5,
+    "math_l1_l2": 0.75,
+    "math_l3": 1.25,
+    "math_l4_l5": 2.0,
+}
+DEFAULT_MULTIPLIER = 1.0
+class DifficultyOracleBaseline:
+    """
+    Allocates tokens proportional to question difficulty.
+    Requires `info['difficulty']` in the observation (injected by the env).
+    Falls back to uniform if difficulty is unknown.
+    """
+    name = "oracle"
+    def __init__(self, config: EnvConfig):
+        self.config = config
+        self._current_difficulty = "gsm8k"
+    def set_difficulty(self, difficulty: str):
+        """Called by evaluation harness after env.step() returns info."""
+        self._current_difficulty = difficulty
+    def get_action(self, obs: Observation) -> int:
+        mult = DIFFICULTY_MULTIPLIERS.get(self._current_difficulty, DEFAULT_MULTIPLIER)
+        base = obs.remaining_budget / max(1, obs.questions_remaining)
+        allocation = int(base * mult)
+        allocation = max(self.config.min_tokens, min(allocation, self.config.max_tokens))
+        allocation = min(allocation, obs.remaining_budget)
+        return max(self.config.min_tokens, allocation)
+    def reset(self):
+        self._current_difficulty = "gsm8k"

reasonbudget_gym/baselines/uniform.py ADDED Viewed

	@@ -0,0 +1,22 @@

+"""Uniform allocation baseline: split budget equally across all questions."""
+from ..env.models import Observation
+from ..env.config import EnvConfig
+class UniformBaseline:
+    """Allocates remaining_budget / questions_remaining tokens each step."""
+    name = "uniform"
+    def __init__(self, config: EnvConfig):
+        self.config = config
+    def get_action(self, obs: Observation) -> int:
+        if obs.questions_remaining == 0:
+            return self.config.min_tokens
+        allocation = int(obs.remaining_budget / obs.questions_remaining)
+        allocation = max(self.config.min_tokens, min(allocation, self.config.max_tokens))
+        return allocation
+    def reset(self):
+        pass

reasonbudget_gym/client.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""Remote HTTP client for the ReasonBudgetEnv server."""
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional
+try:
+    import requests
+    _OK = True
+except ImportError:
+    _OK = False
+@dataclass
+class RemoteObs:
+    question_embedding: List[float]
+    remaining_budget: int
+    questions_remaining: int
+    budget_per_remaining: float
+    accuracy_so_far: float
+    history: List[dict]
+    done: bool
+    episode_id: Optional[str] = None
+@dataclass
+class RemoteResult:
+    observation: RemoteObs
+    reward: float
+    done: bool
+    info: Dict[str, Any]
+class ReasonBudgetClient:
+    def __init__(self, base_url: str = "http://localhost:8000"):
+        if not _OK:
+            raise ImportError("pip install requests")
+        self.url = base_url.rstrip("/")
+    def health(self): return requests.get(f"{self.url}/health").json()
+    def info(self): return requests.get(f"{self.url}/info").json()
+    def reset(self, seed=None) -> RemoteObs:
+        r = requests.post(f"{self.url}/reset", json={"seed": seed})
+        r.raise_for_status()
+        return RemoteObs(**r.json())
+    def step(self, token_allocation: int) -> RemoteResult:
+        r = requests.post(f"{self.url}/step", json={"token_allocation": token_allocation})
+        r.raise_for_status()
+        d = r.json()
+        return RemoteResult(observation=RemoteObs(**d["observation"]),
+                            reward=d["reward"], done=d["done"], info=d["info"])

reasonbudget_gym/data/__init__.py ADDED Viewed

File without changes

reasonbudget_gym/data/embeddings.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2f5b49d773c0f605b508c0c2d2c736305cd346654b9b63af8d3a3856cd71f12
+size 768128

reasonbudget_gym/data/generate_synthetic_cache.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""
+Generate a synthetic response cache for offline/demo use.
+Simulates realistic DeepSeek-R1 accuracy curves across difficulty tiers
+WITHOUT any API calls. Uses MetaMathQA dataset if available, otherwise
+falls back to synthetic arithmetic questions.
+Usage:
+    python reasonbudget_gym/data/generate_synthetic_cache.py
+    # or from package root:
+    python -m reasonbudget_gym.data.generate_synthetic_cache
+Output: data/response_cache.json  (~500 questions × 5 budget tiers)
+"""
+import hashlib
+import json
+import random
+import sys
+from pathlib import Path
+# ---------------------------------------------------------------------------
+# Accuracy model (from task spec — based on real DeepSeek-R1 behaviour)
+# ---------------------------------------------------------------------------
+ACCURACY_TABLE = {
+    #           50     100    200    400    800
+    "gsm8k":     [0.85,  0.92,  0.95,  0.96,  0.97],
+    "math_l1_l2":[0.55,  0.70,  0.80,  0.88,  0.92],
+    "math_l3":   [0.25,  0.40,  0.55,  0.70,  0.80],
+    "math_l4_l5":[0.08,  0.15,  0.30,  0.50,  0.65],
+}
+BUDGET_TIERS = [50, 100, 200, 400, 800]
+N_QUESTIONS   = 500
+EVAL_FRACTION = 0.1
+SEED          = 42
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _make_qid(text: str, idx: int) -> str:
+    h = hashlib.md5(f"{idx}:{text}".encode()).hexdigest()[:12]
+    return f"q_{h}"
+def _infer_difficulty(item: dict) -> str:
+    source = item.get("type", "").lower()
+    if "gsm" in source or "grade" in source:
+        return "gsm8k"
+    resp = item.get("response", "")
+    n = len(resp.split())
+    if n < 80:   return "gsm8k"
+    if n < 150:  return "math_l1_l2"
+    if n < 250:  return "math_l3"
+    return "math_l4_l5"
+def _extract_answer(response: str) -> str:
+    import re
+    for pat in [r"[Tt]he answer is[:\s]+([^\n.]+)", r"####\s*(.+)", r"=\s*([^\n]+)$"]:
+        m = re.search(pat, response)
+        if m:
+            return m.group(1).strip()
+    lines = [l.strip() for l in response.split("\n") if l.strip()]
+    return lines[-1] if lines else "unknown"
+def _load_questions(n: int):
+    try:
+        from datasets import load_dataset
+        ds = load_dataset("meta-math/MetaMathQA", split="train", trust_remote_code=True)
+        items = list(ds.select(range(min(n, len(ds)))))
+        questions = []
+        for idx, item in enumerate(items):
+            qtext = item.get("query", f"Question {idx}")
+            answer = _extract_answer(item.get("response", ""))
+            diff = _infer_difficulty(item)
+            questions.append((_make_qid(qtext, idx), qtext, answer, diff))
+        print(f"  Loaded {len(questions)} questions from MetaMathQA")
+        return questions
+    except Exception as e:
+        print(f"  Dataset unavailable ({e}), using synthetic questions")
+        return _synthetic_questions(n)
+def _synthetic_questions(n: int):
+    """Pure-Python fallback — no external deps."""
+    rng = random.Random(SEED)
+    templates = [
+        ("If {a} people each have {b} apples, how many apples total?", lambda a,b,c: a*b),
+        ("What is {a} + {b} * {c}?",                                    lambda a,b,c: a+b*c),
+        ("A car travels at {a} km/h for {b} hours. Distance?",           lambda a,b,c: a*b),
+        ("Solve: {a}x = {b}. What is x?",                                lambda a,b,c: b//a if a else 0),
+    ]
+    difficulties = ["gsm8k", "math_l1_l2", "math_l3", "math_l4_l5"]
+    out = []
+    for i in range(n):
+        a, b, c = rng.randint(2, 20), rng.randint(1, 15), rng.randint(1, 10)
+        tmpl, fn = templates[i % len(templates)]
+        q = tmpl.format(a=a, b=b, c=c)
+        ans = str(fn(a, b, c))
+        diff = difficulties[i % len(difficulties)]
+        out.append((_make_qid(q, i), q, ans, diff))
+    return out
+# ---------------------------------------------------------------------------
+# Cache generation
+# ---------------------------------------------------------------------------
+def generate_cache(output_path: str = None):
+    if output_path is None:
+        # Resolve relative to this file's parent directory
+        output_path = str(Path(__file__).parent / "response_cache.json")
+    rng = random.Random(SEED)
+    print(f"Generating synthetic cache ({N_QUESTIONS} questions × {len(BUDGET_TIERS)} tiers)...")
+    questions = _load_questions(N_QUESTIONS)
+    entries = {}
+    for qid, qtext, answer, difficulty in questions:
+        entries[qid] = {}
+        acc_curve = ACCURACY_TABLE[difficulty]
+        for tier_idx, tier in enumerate(BUDGET_TIERS):
+            p_correct = acc_curve[tier_idx]
+            correct = rng.random() < p_correct
+            # tokens_used: 70-95% of budget with some noise
+            used = int(tier * rng.uniform(0.70, 0.95))
+            entries[qid][str(tier)] = {
+                "answer": answer if correct else "unknown",
+                "was_correct": correct,
+                "tokens_used": used,
+                "response_text": "[synthetic cache entry]",
+            }
+    n_eval = int(len(questions) * EVAL_FRACTION)
+    eval_ids = [qid for qid, *_ in questions[-n_eval:]]
+    cache = {"entries": entries, "eval_ids": eval_ids}
+    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "w") as f:
+        json.dump(cache, f)
+    n_correct = sum(
+        1 for qdata in entries.values()
+        for entry in qdata.values() if entry["was_correct"]
+    )
+    total = len(entries) * len(BUDGET_TIERS)
+    print(f"  Written: {output_path}")
+    print(f"  Questions: {len(entries)}  |  Overall accuracy: {n_correct/total:.1%}")
+    print(f"  Eval holdout: {len(eval_ids)} questions")
+    return output_path
+if __name__ == "__main__":
+    path = sys.argv[1] if len(sys.argv) > 1 else None
+    generate_cache(path)

reasonbudget_gym/data/response_cache.json ADDED Viewed

The diff for this file is too large to render. See raw diff

reasonbudget_gym/env/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+from .reason_budget_env import ReasonBudgetEnv
+from .config import EnvConfig
+from .models import Observation, StepResult, QuestionMeta
+__all__ = ["ReasonBudgetEnv", "EnvConfig", "Observation", "StepResult", "QuestionMeta"]

reasonbudget_gym/env/config.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""Environment configuration dataclass."""
+from dataclasses import dataclass, field
+from typing import List
+@dataclass
+class EnvConfig:
+    """Configuration for the ReasonBudgetEnv."""
+    # Token budget
+    total_budget: int = 4000
+    min_tokens: int = 50
+    max_tokens: int = 800
+    budget_tiers: List[int] = field(default_factory=lambda: [50, 100, 200, 400, 800])
+    # Episode structure
+    questions_per_episode: int = 10
+    seed: int = 42
+    # Reward weights
+    correct_reward: float = 1.0
+    wrong_penalty: float = -0.1
+    cost_penalty_weight: float = 0.0002   # β
+    efficiency_bonus_weight: float = 0.3  # γ
+    # Solver
+    solver_type: str = "cached"           # "cached" | "live"
+    cache_path: str = "data/response_cache.json"
+    # Dataset
+    dataset_name: str = "meta-math/MetaMathQA"
+    max_cache_questions: int = 500
+    # Embedding
+    embedding_dim: int = 384
+    embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"
+    embedding_cache_path: str = "data/embeddings.npy"

reasonbudget_gym/env/episode_sampler.py ADDED Viewed

	@@ -0,0 +1,210 @@

+"""Sample episodes (sequences of questions) from the dataset."""
+import hashlib
+import random
+from pathlib import Path
+from typing import List
+from .config import EnvConfig
+from .models import QuestionMeta
+# Difficulty label heuristics based on dataset origin
+_DIFFICULTY_LEVELS = ["gsm8k", "math_l1_l2", "math_l3", "math_l4_l5"]
+def _infer_difficulty(item: dict) -> str:
+    """Heuristically assign difficulty from MetaMathQA metadata."""
+    source = item.get("type", "").lower()
+    query = item.get("query", "")
+    if "gsm" in source or "grade" in source:
+        return "gsm8k"
+    # Use a simple proxy: length of the chain of thought as difficulty signal
+    response = item.get("response", "")
+    cot_len = len(response.split())
+    if cot_len < 80:
+        return "gsm8k"
+    elif cot_len < 150:
+        return "math_l1_l2"
+    elif cot_len < 250:
+        return "math_l3"
+    else:
+        return "math_l4_l5"
+def _extract_answer(response: str) -> str:
+    """Extract the final answer from a MetaMathQA response."""
+    # MetaMathQA answers follow "The answer is X" or "#### X"
+    import re
+    patterns = [
+        r"[Tt]he answer is[:\s]+([^\n.]+)",
+        r"####\s*(.+)",
+        r"=\s*([^\n]+)$",
+    ]
+    for pat in patterns:
+        m = re.search(pat, response)
+        if m:
+            return m.group(1).strip()
+    # Fallback: last non-empty line
+    lines = [l.strip() for l in response.split("\n") if l.strip()]
+    return lines[-1] if lines else "unknown"
+def _make_question_id(text: str, idx: int) -> str:
+    h = hashlib.md5(f"{idx}:{text}".encode()).hexdigest()[:12]
+    return f"q_{h}"
+class EpisodeSampler:
+    """Loads questions from MetaMathQA and provides episode batches."""
+    def __init__(self, config: EnvConfig, split: str = "train"):
+        self.config = config
+        self.rng = random.Random(config.seed)
+        self._questions: List[QuestionMeta] = []
+        self._loaded = False
+        self._split = split
+    def _resolve_cache_path(self) -> Path:
+        return self._resolve_project_path(self.config.cache_path)
+    def _resolve_project_path(self, relative_path: str) -> Path:
+        path = Path(relative_path)
+        if path.is_absolute():
+            return path
+        pkg_root = Path(__file__).resolve().parent.parent
+        package_relative = pkg_root / relative_path
+        cwd_relative = Path.cwd() / relative_path
+        if package_relative.exists() or not cwd_relative.exists():
+            return package_relative
+        return cwd_relative
+    def _load_bundled_questions(self) -> List[dict]:
+        """Return deterministic synthetic questions that align with the shipped cache."""
+        from ..data.generate_synthetic_cache import _synthetic_questions
+        questions = []
+        for qid, qtext, answer, difficulty in _synthetic_questions(self.config.max_cache_questions):
+            questions.append({
+                "question_id": qid,
+                "query": qtext,
+                "answer": answer,
+                "difficulty": difficulty,
+            })
+        return questions
+    def _apply_embeddings(self, embeddings) -> None:
+        for question, embedding in zip(self._questions, embeddings):
+            question.embedding = [float(x) for x in embedding]
+    def _load_embeddings(self) -> None:
+        import numpy as np
+        if not self._questions:
+            return
+        expected_shape = (len(self._questions), self.config.embedding_dim)
+        embeddings_path = self._resolve_project_path(self.config.embedding_cache_path)
+        if embeddings_path.exists():
+            try:
+                cached = np.load(embeddings_path)
+                if cached.shape == expected_shape:
+                    self._apply_embeddings(cached)
+                    return
+            except Exception:
+                pass
+        try:
+            from sentence_transformers import SentenceTransformer
+            model = SentenceTransformer(self.config.embedding_model)
+            texts = [question.question_text for question in self._questions]
+            embeddings = np.asarray(
+                model.encode(texts, show_progress_bar=False),
+                dtype=np.float32,
+            )
+            if embeddings.shape != expected_shape:
+                return
+            embeddings_path.parent.mkdir(parents=True, exist_ok=True)
+            np.save(embeddings_path, embeddings)
+            self._apply_embeddings(embeddings)
+        except Exception:
+            # Fall back to zero embeddings when the model is unavailable.
+            return
+    def _load(self):
+        if self._loaded:
+            return
+        cache_path = self._resolve_cache_path()
+        if self.config.solver_type == "cached" and cache_path.exists():
+            items = self._load_bundled_questions()
+        else:
+            try:
+                from datasets import load_dataset
+                ds = load_dataset(
+                    self.config.dataset_name,
+                    split=self._split,
+                    trust_remote_code=True,
+                )
+                # Limit to first max_cache_questions for speed
+                items = list(ds.select(range(min(self.config.max_cache_questions, len(ds)))))
+            except Exception:
+                # Offline fallback: generate synthetic questions
+                items = self._synthetic_questions(self.config.max_cache_questions)
+        self._questions = []
+        for idx, item in enumerate(items):
+            qtext = item.get("query", item.get("question", f"Question {idx}"))
+            response = item.get("response", item.get("answer", ""))
+            answer = _extract_answer(response)
+            difficulty = item.get("difficulty", _infer_difficulty(item))
+            qid = item.get("question_id", _make_question_id(qtext, idx))
+            # Embedding is a zero-vector placeholder; real embedding done lazily
+            self._questions.append(QuestionMeta(
+                question_id=qid,
+                question_text=qtext,
+                ground_truth=answer,
+                difficulty=difficulty,
+                embedding=[0.0] * self.config.embedding_dim,
+            ))
+        self._load_embeddings()
+        self._loaded = True
+    def _synthetic_questions(self, n: int) -> List[dict]:
+        """Generate synthetic questions when dataset is unavailable."""
+        templates = [
+            ("If a train travels at {v} km/h for {t} hours, how far does it go?", "{r}"),
+            ("What is {a} + {b} * {c}?", "{r}"),
+            ("Solve for x: {a}x + {b} = {c}", "{r}"),
+            ("A store sells apples for ${p} each. How much do {n} apples cost?", "${r}"),
+        ]
+        items = []
+        rng = random.Random(42)
+        difficulties = ["gsm8k", "math_l1_l2", "math_l3", "math_l4_l5"]
+        for i in range(n):
+            tmpl, ans_tmpl = templates[i % len(templates)]
+            a, b, c = rng.randint(1, 20), rng.randint(1, 10), rng.randint(1, 15)
+            v, t, p, nn = rng.randint(60, 200), rng.randint(1, 10), rng.randint(1, 5), rng.randint(2, 20)
+            r = a + b * c
+            query = tmpl.format(v=v, t=t, a=a, b=b, c=c, p=p, n=nn)
+            answer = str(r)
+            diff = difficulties[i % len(difficulties)]
+            # Fake CoT length based on difficulty
+            cot_lengths = {"gsm8k": 50, "math_l1_l2": 120, "math_l3": 200, "math_l4_l5": 300}
+            fake_cot = " word" * cot_lengths[diff]
+            items.append({"query": query, "response": f"{fake_cot} The answer is {answer}", "type": diff})
+        return items
+    def sample_episode(self) -> List[QuestionMeta]:
+        """Sample a sequence of questions_per_episode questions."""
+        self._load()
+        return self.rng.choices(self._questions, k=self.config.questions_per_episode)
+    def get_all_questions(self) -> List[QuestionMeta]:
+        self._load()
+        return list(self._questions)

reasonbudget_gym/env/models.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""Data models shared across the environment."""
+from dataclasses import dataclass, field
+from typing import List, Optional
+@dataclass
+class QuestionMeta:
+    """Metadata for a single question in an episode."""
+    question_id: str
+    question_text: str
+    ground_truth: str
+    difficulty: str          # gsm8k | math_l1_l2 | math_l3 | math_l4_l5
+    embedding: List[float]   # 384-dim sentence-transformer encoding
+@dataclass
+class StepInfo:
+    """Record of one completed step."""
+    tokens_allocated: int
+    tokens_used: int
+    was_correct: bool
+@dataclass
+class Observation:
+    """Full observation returned to the agent at each step."""
+    question_embedding: List[float]  # 384-dim
+    remaining_budget: int
+    questions_remaining: int
+    budget_per_remaining: float
+    accuracy_so_far: float
+    history: List[StepInfo]
+    done: bool = False
+    episode_id: Optional[str] = None
+@dataclass
+class StepResult:
+    """Result of env.step()."""
+    observation: Observation
+    reward: float
+    done: bool
+    info: dict = field(default_factory=dict)

reasonbudget_gym/env/reason_budget_env.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""
+ReasonBudgetEnv — core RL environment for token budget allocation.
+Compatible with OpenEnv v0.2.x (step/reset interface).
+"""
+import uuid
+from typing import Optional
+from .config import EnvConfig
+from .episode_sampler import EpisodeSampler
+from .models import Observation, StepInfo, StepResult
+from .reward import compute_reward
+class ReasonBudgetEnv:
+    """
+    Sequential token budget allocation environment.
+    At each step the agent observes the current question embedding plus
+    budget state and chooses how many tokens to allocate.  The solver
+    attempts the question with that limit and returns a correctness signal.
+    """
+    def __init__(self, config: Optional[EnvConfig] = None):
+        self.config = config or EnvConfig()
+        self.sampler = EpisodeSampler(self.config)
+        self._solver = None
+        self._reset_state()
+    # ------------------------------------------------------------------
+    # Solver lazy init
+    # ------------------------------------------------------------------
+    @property
+    def solver(self):
+        if self._solver is None:
+            if self.config.solver_type == "cached":
+                from ..solver.cached_solver import CachedSolver
+                self._solver = CachedSolver(self.config)
+            else:
+                from ..solver.live_solver import LiveSolver
+                self._solver = LiveSolver(self.config)
+        return self._solver
+    # ------------------------------------------------------------------
+    # Core interface
+    # ------------------------------------------------------------------
+    def reset(self) -> Observation:
+        """Start a new episode. Returns first observation."""
+        self._reset_state()
+        self._episode = self.sampler.sample_episode()
+        self._episode_id = str(uuid.uuid4())[:8]
+        return self._make_observation()
+    def step(self, token_allocation: int) -> StepResult:
+        """Take a step: allocate tokens to the current question."""
+        if self._done:
+            raise RuntimeError("Episode is done. Call reset() first.")
+        # Clamp allocation to valid range and remaining budget
+        token_allocation = max(self.config.min_tokens,
+                               min(token_allocation, self.config.max_tokens))
+        token_allocation = min(token_allocation, self._remaining_budget)
+        if token_allocation < self.config.min_tokens:
+            token_allocation = self.config.min_tokens  # allow overspend on last Q
+        question = self._episode[self._step_idx]
+        budget_before = self._remaining_budget
+        # Solve
+        result = self.solver.solve(
+            question_id=question.question_id,
+            question_text=question.question_text,
+            ground_truth=question.ground_truth,
+            token_budget=token_allocation,
+        )
+        # Accounting
+        tokens_used = min(result.tokens_used, token_allocation)
+        self._remaining_budget -= token_allocation
+        self._step_idx += 1
+        self._history.append(StepInfo(
+            tokens_allocated=token_allocation,
+            tokens_used=tokens_used,
+            was_correct=result.was_correct,
+        ))
+        if result.was_correct:
+            self._correct_count += 1
+        # Reward
+        reward = compute_reward(
+            was_correct=result.was_correct,
+            tokens_allocated=token_allocation,
+            tokens_used=tokens_used,
+            remaining_budget_before=budget_before,
+            questions_remaining=len(self._episode) - self._step_idx,
+            config=self.config,
+        )
+        # Done?
+        self._done = (
+            self._step_idx >= len(self._episode)
+            or self._remaining_budget < self.config.min_tokens
+        )
+        obs = self._make_observation()
+        return StepResult(
+            observation=obs,
+            reward=reward,
+            done=self._done,
+            info={
+                "was_correct": result.was_correct,
+                "tokens_allocated": token_allocation,
+                "tokens_used": tokens_used,
+                "difficulty": question.difficulty,
+                "remaining_budget": self._remaining_budget,
+                "step": self._step_idx,
+            },
+        )
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+    def _reset_state(self):
+        self._episode = []
+        self._episode_id = None
+        self._step_idx = 0
+        self._remaining_budget = self.config.total_budget
+        self._history = []
+        self._correct_count = 0
+        self._done = False
+    def _make_observation(self) -> Observation:
+        if self._step_idx < len(self._episode):
+            q = self._episode[self._step_idx]
+            emb = q.embedding
+        else:
+            emb = [0.0] * self.config.embedding_dim
+        n_remaining = max(0, len(self._episode) - self._step_idx)
+        bpr = self._remaining_budget / n_remaining if n_remaining > 0 else 0.0
+        acc = self._correct_count / self._step_idx if self._step_idx > 0 else 0.0
+        return Observation(
+            question_embedding=emb,
+            remaining_budget=self._remaining_budget,
+            questions_remaining=n_remaining,
+            budget_per_remaining=bpr,
+            accuracy_so_far=acc,
+            history=list(self._history),
+            done=self._done,
+            episode_id=self._episode_id,
+        )
+    @property
+    def observation_dim(self) -> int:
+        """Flat observation dimension (embedding + 4 scalars)."""
+        return self.config.embedding_dim + 4
+    def flat_observation(self, obs: Observation) -> list:
+        """Flatten observation to a list of floats for the policy."""
+        scalars = [
+            obs.remaining_budget / self.config.total_budget,
+            obs.questions_remaining / self.config.questions_per_episode,
+            obs.budget_per_remaining / self.config.max_tokens,
+            obs.accuracy_so_far,
+        ]
+        return list(obs.question_embedding) + scalars

reasonbudget_gym/env/reward.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""Reward function for the ReasonBudgetEnv."""
+from .config import EnvConfig
+def compute_reward(
+    was_correct: bool,
+    tokens_allocated: int,
+    tokens_used: int,
+    remaining_budget_before: int,
+    questions_remaining: int,
+    config: EnvConfig,
+) -> float:
+    """
+    Compute the per-step reward.
+    R = correctness_term − β * cost_penalty + γ * efficiency_bonus
+    correctness_term: +correct_reward if correct, else wrong_penalty
+    cost_penalty:     tokens_used / total_budget   (fractional spend)
+    efficiency_bonus: savings as fraction of budget if correct
+                      0 if wrong (no credit for being cheap and wrong)
+    """
+    # Correctness signal
+    if was_correct:
+        correctness_term = config.correct_reward
+    else:
+        correctness_term = config.wrong_penalty
+    # Cost penalty — proportional to tokens burned this step
+    cost_penalty = tokens_used / config.total_budget
+    # Efficiency bonus — reward for underspending when correct
+    if was_correct and tokens_allocated > 0:
+        savings = max(0, tokens_allocated - tokens_used)
+        efficiency_bonus = savings / config.total_budget
+    else:
+        efficiency_bonus = 0.0
+    reward = (
+        correctness_term
+        - config.cost_penalty_weight * cost_penalty
+        + config.efficiency_bonus_weight * efficiency_bonus
+    )
+    return float(reward)

reasonbudget_gym/eval/__init__.py ADDED Viewed

File without changes

reasonbudget_gym/eval/evaluate.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""
+Evaluation harness: run all 4 baselines for N episodes and collect metrics.
+Usage:
+    python -m reasonbudget_gym.eval.evaluate --n_episodes 50 --seed 42 --output eval_results.json
+"""
+import argparse
+import json
+import numpy as np
+from ..env.config import EnvConfig
+from ..env.reason_budget_env import ReasonBudgetEnv
+from ..baselines import UniformBaseline, GreedyMaxBaseline, DifficultyOracleBaseline, LinUCBBaseline
+def run_episode(env, agent, config):
+    obs = env.reset()
+    if hasattr(agent, "set_difficulty") and env._episode:
+        agent.set_difficulty(env._episode[0].difficulty)
+    done = False
+    total_reward = 0.0
+    correct = 0
+    tokens = 0
+    steps = 0
+    per_step = []
+    while not done:
+        action = agent.get_action(obs)
+        result = env.step(action)
+        if hasattr(agent, "update"):
+            agent.update(result.reward)
+        if hasattr(agent, "set_difficulty"):
+            ep = env._episode
+            if env._step_idx < len(ep):
+                agent.set_difficulty(ep[env._step_idx].difficulty)
+        total_reward += result.reward
+        if result.info.get("was_correct"):
+            correct += 1
+        tokens += result.info.get("tokens_allocated", 0)
+        steps += 1
+        per_step.append({
+            "step": steps,
+            "tokens_allocated": result.info.get("tokens_allocated", 0),
+            "was_correct": result.info.get("was_correct", False),
+            "difficulty": result.info.get("difficulty", "unknown"),
+            "remaining_budget": result.info.get("remaining_budget", 0),
+            "reward": result.reward,
+        })
+        done = result.done
+        obs = result.observation
+    return {
+        "total_reward": total_reward,
+        "accuracy": correct / max(1, steps),
+        "total_tokens_used": tokens,
+        "budget_utilization": tokens / config.total_budget,
+        "steps": steps,
+        "per_step": per_step,
+    }
+def evaluate_agent(name, agent, config, n_episodes, seed):
+    env = ReasonBudgetEnv(config)
+    agent.reset()
+    episodes = []
+    for i in range(n_episodes):
+        env.config.seed = seed + i
+        env.sampler.rng.seed(seed + i)
+        episodes.append(run_episode(env, agent, config))
+    rewards = [e["total_reward"] for e in episodes]
+    accs = [e["accuracy"] for e in episodes]
+    utils = [e["budget_utilization"] for e in episodes]
+    return {
+        "agent": name,
+        "n_episodes": n_episodes,
+        "mean_reward": float(np.mean(rewards)),
+        "std_reward": float(np.std(rewards)),
+        "mean_accuracy": float(np.mean(accs)),
+        "std_accuracy": float(np.std(accs)),
+        "mean_budget_utilization": float(np.mean(utils)),
+        "episodes": episodes,
+    }
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--n_episodes", type=int, default=50)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--output", type=str, default="eval_results.json")
+    args = parser.parse_args()
+    config = EnvConfig(seed=args.seed)
+    agents = {
+        "uniform": UniformBaseline(config),
+        "greedy_max": GreedyMaxBaseline(config),
+        "oracle": DifficultyOracleBaseline(config),
+        "bandit": LinUCBBaseline(config, seed=args.seed),
+    }
+    results = {}
+    print(f"Evaluating {len(agents)} agents × {args.n_episodes} episodes")
+    for name, agent in agents.items():
+        print(f"  {name}...", end=" ", flush=True)
+        res = evaluate_agent(name, agent, config, args.n_episodes, args.seed)
+        results[name] = res
+        print(f"acc={res['mean_accuracy']:.3f}  reward={res['mean_reward']:.3f}")
+    print(f"\n{'Agent':<15} {'Accuracy':>10} {'Reward':>10} {'Budget%':>10}")
+    print("-" * 50)
+    for name, res in results.items():
+        print(f"{name:<15} {res['mean_accuracy']:>10.3f} {res['mean_reward']:>10.3f} {res['mean_budget_utilization']:>9.1%}")
+    with open(args.output, "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"\nSaved to {args.output}")
+if __name__ == "__main__":
+    main()

reasonbudget_gym/eval/plots.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""Generate comparison charts from eval_results.json."""
+import json
+from pathlib import Path
+def agent_comparison(results_path: str, output_path: str = "docs/agent_comparison.png"):
+    import numpy as np
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    with open(results_path) as f:
+        results = json.load(f)
+    agents = list(results.keys())
+    accs = [results[a]["mean_accuracy"] for a in agents]
+    rewards = [results[a]["mean_reward"] for a in agents]
+    acc_stds = [results[a]["std_accuracy"] for a in agents]
+    x = np.arange(len(agents))
+    colors = ["#4C72B0", "#DD8452", "#55A868", "#C44E52"]
+    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
+    fig.suptitle("Baseline Agent Comparison — ReasoningEconomicsEnv", fontsize=13, fontweight="bold")
+    bars1 = ax1.bar(x, accs, 0.5, yerr=acc_stds, capsize=4, color=colors, alpha=0.85)
+    ax1.set_title("Mean Accuracy")
+    ax1.set_xticks(x); ax1.set_xticklabels(agents, rotation=15)
+    ax1.set_ylim(0, 1.1)
+    for b, v in zip(bars1, accs):
+        ax1.text(b.get_x() + b.get_width()/2, v + 0.03, f"{v:.3f}", ha="center", fontsize=9)
+    bars2 = ax2.bar(x, rewards, 0.5, color=colors, alpha=0.85)
+    ax2.set_title("Mean Episode Reward")
+    ax2.set_xticks(x); ax2.set_xticklabels(agents, rotation=15)
+    for b, v in zip(bars2, rewards):
+        ax2.text(b.get_x() + b.get_width()/2, v + abs(v)*0.03 + 0.01, f"{v:.3f}", ha="center", fontsize=9)
+    plt.tight_layout()
+    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+    plt.savefig(output_path, dpi=150, bbox_inches="tight")
+    plt.close()
+    print(f"Saved: {output_path}")
+def budget_pacing(results_path: str, output_path: str = "docs/budget_pacing.png"):
+    import numpy as np
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    with open(results_path) as f:
+        results = json.load(f)
+    fig, ax = plt.subplots(figsize=(10, 5))
+    colors = {"uniform": "#4C72B0", "greedy_max": "#DD8452", "oracle": "#55A868", "bandit": "#C44E52"}
+    for name, res in results.items():
+        episodes = res["episodes"]
+        max_steps = max(len(ep["per_step"]) for ep in episodes)
+        mat = np.zeros((len(episodes), max_steps))
+        for i, ep in enumerate(episodes):
+            cumsum = 0
+            for j, s in enumerate(ep["per_step"]):
+                cumsum += s["tokens_allocated"]
+                mat[i, j] = cumsum
+            mat[i, len(ep["per_step"]):] = cumsum
+        mean = mat.mean(0); std = mat.std(0)
+        steps = np.arange(1, max_steps + 1)
+        c = colors.get(name, "gray")
+        ax.plot(steps, mean, label=name, color=c, linewidth=2)
+        ax.fill_between(steps, mean - std, mean + std, color=c, alpha=0.15)
+    ax.axhline(y=4000, color="black", linestyle="--", linewidth=1.5, label="Budget (4000)")
+    ax.set_xlabel("Question #"); ax.set_ylabel("Cumulative Tokens")
+    ax.set_title("Budget Pacing by Agent — Mean ± 1 SD", fontweight="bold")
+    ax.legend(); ax.grid(True, alpha=0.3)
+    plt.tight_layout()
+    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+    plt.savefig(output_path, dpi=150, bbox_inches="tight")
+    plt.close()
+    print(f"Saved: {output_path}")
+if __name__ == "__main__":
+    import sys
+    f = sys.argv[1] if len(sys.argv) > 1 else "eval_results.json"
+    agent_comparison(f)
+    budget_pacing(f)

reasonbudget_gym/policy/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .allocation_policy import AllocationPolicy
2	+
3	+ __all__ = ["AllocationPolicy"]

reasonbudget_gym/policy/allocation_policy.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""
+MLP policy for token budget allocation.
+Architecture:
+  - Shared trunk: FC(obs_dim → 256) → ReLU → FC(256 → 128) → ReLU
+  - Actor head: FC(128 → 1) producing mean, with learned log_std
+    Output: Gaussian(mean, std) over normalised token allocation
+  - Value head: FC(128 → 1)
+The action is a continuous value in [0, 1] representing fraction of
+max_tokens to allocate. It is scaled to an integer allocation at step time.
+"""
+import math
+import numpy as np
+try:
+    import torch
+    import torch.nn as nn
+    import torch.nn.functional as F
+    from torch.distributions import Normal
+    _TORCH_AVAILABLE = True
+except ImportError:
+    _TORCH_AVAILABLE = False
+def _require_torch():
+    if not _TORCH_AVAILABLE:
+        raise ImportError(
+            "PyTorch is required for AllocationPolicy. "
+            "Install it with: pip install torch"
+        )
+class _PolicyNet(nn.Module if _TORCH_AVAILABLE else object):
+    """Neural network backbone for the allocation policy."""
+    def __init__(self, obs_dim: int, hidden: int = 256):
+        if not _TORCH_AVAILABLE:
+            raise ImportError("PyTorch required")
+        super().__init__()
+        self.trunk = nn.Sequential(
+            nn.Linear(obs_dim, hidden),
+            nn.ReLU(),
+            nn.Linear(hidden, hidden // 2),
+            nn.ReLU(),
+        )
+        self.actor_mean = nn.Linear(hidden // 2, 1)
+        self.log_std = nn.Parameter(torch.zeros(1))
+        self.value_head = nn.Linear(hidden // 2, 1)
+    def forward(self, x):
+        h = self.trunk(x)
+        mean = torch.sigmoid(self.actor_mean(h))  # in (0, 1)
+        std = torch.exp(self.log_std).clamp(0.01, 1.0)
+        value = self.value_head(h)
+        return mean, std, value
+class AllocationPolicy:
+    """
+    High-level wrapper around the policy network.
+    Provides get_action() compatible with the baseline interface,
+    plus PPO-specific methods (evaluate_actions, value).
+    """
+    def __init__(self, obs_dim: int, max_tokens: int, min_tokens: int = 50):
+        _require_torch()
+        self.obs_dim = obs_dim
+        self.max_tokens = max_tokens
+        self.min_tokens = min_tokens
+        self.net = _PolicyNet(obs_dim)
+        self.optimizer = torch.optim.Adam(self.net.parameters(), lr=3e-4)
+    def _obs_to_tensor(self, obs_flat: list) -> "torch.Tensor":
+        return torch.tensor(obs_flat, dtype=torch.float32).unsqueeze(0)
+    def get_action(self, obs_flat: list) -> tuple:
+        """
+        Returns (action_int, log_prob, value_estimate).
+        action_int: token allocation as integer
+        """
+        self.net.eval()
+        with torch.no_grad():
+            x = self._obs_to_tensor(obs_flat)
+            mean, std, value = self.net(x)
+            dist = Normal(mean, std)
+            sample = dist.sample().clamp(0.0, 1.0)
+            log_prob = dist.log_prob(sample).squeeze()
+        frac = sample.item()
+        action_int = int(self.min_tokens + frac * (self.max_tokens - self.min_tokens))
+        return action_int, log_prob.item(), value.squeeze().item()
+    def evaluate_actions(self, obs_batch, action_fracs):
+        """
+        Compute log_probs, entropy, values for a batch.
+        obs_batch:    Tensor [B, obs_dim]
+        action_fracs: Tensor [B, 1]  (normalised actions in [0,1])
+        """
+        self.net.train()
+        mean, std, values = self.net(obs_batch)
+        dist = Normal(mean, std)
+        log_probs = dist.log_prob(action_fracs)
+        entropy = dist.entropy()
+        return log_probs, entropy, values
+    def save(self, path: str):
+        _require_torch()
+        torch.save({
+            "net_state": self.net.state_dict(),
+            "optimizer_state": self.optimizer.state_dict(),
+            "obs_dim": self.obs_dim,
+            "max_tokens": self.max_tokens,
+            "min_tokens": self.min_tokens,
+        }, path)
+    def load(self, path: str):
+        _require_torch()
+        ckpt = torch.load(path, map_location="cpu")
+        self.net.load_state_dict(ckpt["net_state"])
+        self.optimizer.load_state_dict(ckpt["optimizer_state"])
+    def reset(self):
+        pass

reasonbudget_gym/server/__init__.py ADDED Viewed

File without changes

reasonbudget_gym/server/app.py ADDED Viewed

	@@ -0,0 +1,233 @@

+"""
+OpenEnv-compatible FastAPI server for ReasonBudgetEnv.
+Entry point: reasonbudget_gym.server.app:app
+"""
+from typing import Any, Dict, List, Optional
+from fastapi import FastAPI, HTTPException
+from fastapi.responses import HTMLResponse
+from pydantic import BaseModel
+from ..env.config import EnvConfig
+from ..env.models import Observation
+from ..env.reason_budget_env import ReasonBudgetEnv
+class ResetRequest(BaseModel):
+    seed: Optional[int] = None
+class StepRequest(BaseModel):
+    token_allocation: int
+class ObsResponse(BaseModel):
+    question_embedding: List[float]
+    remaining_budget: int
+    questions_remaining: int
+    budget_per_remaining: float
+    accuracy_so_far: float
+    history: List[dict]
+    done: bool
+    episode_id: Optional[str] = None
+class StepResponse(BaseModel):
+    observation: ObsResponse
+    reward: float
+    done: bool
+    info: Dict[str, Any]
+def _to_obs_response(obs: Observation) -> ObsResponse:
+    return ObsResponse(
+        question_embedding=obs.question_embedding,
+        remaining_budget=obs.remaining_budget,
+        questions_remaining=obs.questions_remaining,
+        budget_per_remaining=obs.budget_per_remaining,
+        accuracy_so_far=obs.accuracy_so_far,
+        history=[
+            {"tokens_allocated": s.tokens_allocated,
+             "tokens_used": s.tokens_used,
+             "was_correct": s.was_correct}
+            for s in obs.history
+        ],
+        done=obs.done,
+        episode_id=obs.episode_id,
+    )
+def _landing_page() -> str:
+    return """<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1" />
+    <title>ReasoningEconomicsEnv</title>
+    <style>
+      :root {
+        color-scheme: light;
+        --bg: #f6f4ec;
+        --card: #fffdf6;
+        --ink: #182022;
+        --muted: #536067;
+        --accent: #0e7c66;
+        --border: #d8d1c0;
+      }
+      body {
+        margin: 0;
+        font-family: "Iowan Old Style", "Palatino Linotype", serif;
+        background:
+          radial-gradient(circle at top left, rgba(14, 124, 102, 0.16), transparent 30%),
+          linear-gradient(180deg, #f8f6ef 0%, var(--bg) 100%);
+        color: var(--ink);
+      }
+      main {
+        max-width: 840px;
+        margin: 0 auto;
+        padding: 48px 24px 64px;
+      }
+      .eyebrow {
+        text-transform: uppercase;
+        letter-spacing: 0.12em;
+        font-size: 0.78rem;
+        color: var(--accent);
+        margin-bottom: 12px;
+      }
+      h1 {
+        font-size: clamp(2.2rem, 6vw, 4.2rem);
+        line-height: 0.95;
+        margin: 0 0 18px;
+      }
+      p {
+        font-size: 1.08rem;
+        line-height: 1.65;
+        color: var(--muted);
+        margin: 0 0 18px;
+      }
+      .card {
+        margin-top: 32px;
+        background: var(--card);
+        border: 1px solid var(--border);
+        border-radius: 18px;
+        padding: 24px;
+        box-shadow: 0 14px 50px rgba(24, 32, 34, 0.08);
+      }
+      .links {
+        display: flex;
+        flex-wrap: wrap;
+        gap: 12px;
+        margin: 24px 0 12px;
+      }
+      a {
+        color: inherit;
+      }
+      .button {
+        display: inline-block;
+        text-decoration: none;
+        border-radius: 999px;
+        padding: 12px 18px;
+        border: 1px solid var(--border);
+        background: white;
+        font-weight: 600;
+      }
+      .button.primary {
+        background: var(--accent);
+        border-color: var(--accent);
+        color: white;
+      }
+      ul {
+        margin: 18px 0 0;
+        padding-left: 20px;
+        color: var(--muted);
+      }
+      code {
+        font-family: "SFMono-Regular", "Menlo", monospace;
+        font-size: 0.95em;
+      }
+    </style>
+  </head>
+  <body>
+    <main>
+      <div class="eyebrow">AgentX–AgentBeats / OpenEnv</div>
+      <h1>ReasoningEconomicsEnv</h1>
+      <p>
+        A reinforcement learning environment for allocating reasoning tokens across
+        multi-question episodes under a fixed compute budget.
+      </p>
+      <p>
+        This Space serves the environment as an API-first FastAPI app. Use the links
+        below to inspect the schema, check runtime health, or query environment metadata.
+      </p>
+      <div class="links">
+        <a class="button primary" href="/docs">Open API Docs</a>
+        <a class="button" href="/health">Health Check</a>
+        <a class="button" href="/info">Environment Info</a>
+      </div>
+      <section class="card">
+        <p><strong>Core endpoints</strong></p>
+        <ul>
+          <li><code>GET /health</code> returns service status.</li>
+          <li><code>GET /info</code> returns observation/action-space metadata.</li>
+          <li><code>POST /reset</code> starts a fresh episode.</li>
+          <li><code>POST /step</code> advances the environment by one allocation decision.</li>
+        </ul>
+      </section>
+    </main>
+  </body>
+</html>
+"""
+def create_fastapi_app() -> FastAPI:
+    app = FastAPI(title="ReasoningEconomicsEnv", version="0.1.0")
+    config = EnvConfig()
+    env = ReasonBudgetEnv(config)
+    @app.get("/", response_class=HTMLResponse)
+    def index():
+        return _landing_page()
+    @app.get("/health")
+    def health():
+        return {"status": "ok", "env": "ReasonBudgetEnv", "version": "0.1.0"}
+    @app.get("/info")
+    def info():
+        return {
+            "name": "ReasoningEconomicsEnv",
+            "observation_dim": env.observation_dim,
+            "action_space": {"type": "integer", "min": config.min_tokens, "max": config.max_tokens},
+            "total_budget": config.total_budget,
+            "questions_per_episode": config.questions_per_episode,
+        }
+    @app.post("/reset", response_model=ObsResponse)
+    def reset(req: ResetRequest):
+        if req.seed is not None:
+            env.config.seed = req.seed
+            env.sampler.rng.seed(req.seed)
+        return _to_obs_response(env.reset())
+    @app.post("/step", response_model=StepResponse)
+    def step(req: StepRequest):
+        try:
+            result = env.step(req.token_allocation)
+        except RuntimeError as e:
+            raise HTTPException(status_code=400, detail=str(e))
+        return StepResponse(
+            observation=_to_obs_response(result.observation),
+            reward=result.reward,
+            done=result.done,
+            info=result.info,
+        )
+    return app
+# OpenEnv looks for `app` at module level
+app = create_fastapi_app()
+# backwards-compat alias
+create_app = create_fastapi_app

reasonbudget_gym/solver/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from .base import BaseSolver, SolverResult
+from .cached_solver import CachedSolver
+__all__ = ["BaseSolver", "SolverResult", "CachedSolver"]

reasonbudget_gym/solver/base.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""Abstract base class for solvers."""
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+@dataclass
+class SolverResult:
+    """Result of a solver attempt."""
+    answer: str
+    was_correct: bool
+    tokens_used: int
+    response_text: str = ""
+class BaseSolver(ABC):
+    """Attempt to answer a question within a token budget."""
+    @abstractmethod
+    def solve(
+        self,
+        question_id: str,
+        question_text: str,
+        ground_truth: str,
+        token_budget: int,
+    ) -> SolverResult:
+        ...

reasonbudget_gym/solver/cached_solver.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""
+CachedSolver — serves pre-computed solver results from a JSON cache.
+Cache schema:
+{
+  "entries": {
+    "<question_id>": {
+      "<budget_tier>": {
+        "answer": str,
+        "was_correct": bool,
+        "tokens_used": int,
+        "response_text": str
+      }
+    }
+  },
+  "eval_ids": [list of question_ids for holdout]
+}
+"""
+import json
+import os
+from pathlib import Path
+from .base import BaseSolver, SolverResult
+from ..env.config import EnvConfig
+class CachedSolver(BaseSolver):
+    """Look up pre-computed results from a JSON cache file."""
+    def __init__(self, config: EnvConfig):
+        self.config = config
+        self._cache: dict = {}
+        self._eval_ids: list = []
+        self._load_cache()
+    def _load_cache(self):
+        # Resolve relative path from project root
+        cache_path = Path(self.config.cache_path)
+        if not cache_path.is_absolute():
+            # Try relative to package root, then cwd
+            pkg_root = Path(__file__).parent.parent
+            candidate = pkg_root / self.config.cache_path
+            if candidate.exists():
+                cache_path = candidate
+        if not cache_path.exists():
+            # Cache missing — will return fallback results
+            return
+        with open(cache_path) as f:
+            data = json.load(f)
+        self._cache = data.get("entries", {})
+        self._eval_ids = data.get("eval_ids", [])
+    def _nearest_tier(self, budget: int) -> str:
+        """Return the nearest budget tier <= budget (or smallest if budget < min)."""
+        tiers = sorted(int(t) for t in (list(self._cache.values())[0].keys()
+                                        if self._cache else [50, 100, 200, 400, 800]))
+        best = tiers[0]
+        for t in tiers:
+            if t <= budget:
+                best = t
+        return str(best)
+    def solve(
+        self,
+        question_id: str,
+        question_text: str,
+        ground_truth: str,
+        token_budget: int,
+    ) -> SolverResult:
+        if question_id not in self._cache:
+            # Fallback: simulate based on budget tier ratio
+            return self._fallback(ground_truth, token_budget)
+        tier = self._nearest_tier(token_budget)
+        entry = self._cache[question_id].get(tier, self._cache[question_id].get(
+            str(sorted(int(k) for k in self._cache[question_id])[0])
+        ))
+        return SolverResult(
+            answer=entry["answer"],
+            was_correct=entry["was_correct"],
+            tokens_used=entry["tokens_used"],
+            response_text=entry.get("response_text", ""),
+        )
+    def _fallback(self, ground_truth: str, token_budget: int) -> SolverResult:
+        """When question not in cache, use a deterministic heuristic."""
+        import hashlib
+        h = int(hashlib.md5(ground_truth.encode()).hexdigest(), 16)
+        # Probability of correctness grows with budget
+        p = min(0.9, 0.3 + token_budget / 1600)
+        correct = (h % 100) < int(p * 100)
+        used = int(token_budget * 0.85)
+        return SolverResult(
+            answer=ground_truth if correct else "unknown",
+            was_correct=correct,
+            tokens_used=used,
+            response_text="[fallback]",
+        )

reasonbudget_gym/solver/live_solver.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""
+LiveSolver — calls DeepSeek-R1 via Together AI for real inference.
+Requires TOGETHER_API_KEY environment variable.
+Not used by default; switch solver_type to 'live' to enable.
+"""
+import os
+import re
+from .base import BaseSolver, SolverResult
+from ..env.config import EnvConfig
+class LiveSolver(BaseSolver):
+    """Call DeepSeek-R1 API for live reasoning."""
+    MODEL = "deepseek-ai/DeepSeek-R1"
+    def __init__(self, config: EnvConfig):
+        self.config = config
+        try:
+            from together import Together
+            self._client = Together(api_key=os.environ["TOGETHER_API_KEY"])
+        except ImportError:
+            raise ImportError("Install `together` package: pip install together")
+        except KeyError:
+            raise EnvironmentError("TOGETHER_API_KEY not set")
+    def solve(
+        self,
+        question_id: str,
+        question_text: str,
+        ground_truth: str,
+        token_budget: int,
+    ) -> SolverResult:
+        prompt = (
+            f"Solve the following math problem step by step.\n\n"
+            f"Problem: {question_text}\n\n"
+            f"Provide your final answer as: The answer is <answer>"
+        )
+        response = self._client.chat.completions.create(
+            model=self.MODEL,
+            messages=[{"role": "user", "content": prompt}],
+            max_tokens=token_budget,
+        )
+        text = response.choices[0].message.content or ""
+        tokens_used = response.usage.completion_tokens if response.usage else token_budget
+        # Extract answer
+        m = re.search(r"[Tt]he answer is[:\s]+([^\n.]+)", text)
+        answer = m.group(1).strip() if m else text.strip().split("\n")[-1]
+        # Simple correctness check
+        was_correct = (
+            ground_truth.strip().lower() in answer.lower()
+            or answer.lower() in ground_truth.strip().lower()
+        )
+        return SolverResult(
+            answer=answer,
+            was_correct=was_correct,
+            tokens_used=tokens_used,
+            response_text=text[:500],
+        )

reasonbudget_gym/tests/__init__.py ADDED Viewed

File without changes

reasonbudget_gym/tests/test_config.py ADDED Viewed

	@@ -0,0 +1,27 @@

+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+from reasonbudget_gym.env.config import EnvConfig
+def test_default_config():
+    c = EnvConfig()
+    assert c.total_budget == 4000
+    assert c.min_tokens == 50
+    assert c.max_tokens == 800
+    assert len(c.budget_tiers) == 5
+    assert c.questions_per_episode == 10
+def test_custom_config():
+    c = EnvConfig(total_budget=2000, questions_per_episode=5)
+    assert c.total_budget == 2000
+    assert c.questions_per_episode == 5
+def test_reward_weights():
+    c = EnvConfig()
+    assert c.correct_reward > 0
+    assert c.cost_penalty_weight >= 0
+    assert c.efficiency_bonus_weight >= 0

reasonbudget_gym/tests/test_integration.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+from reasonbudget_gym.env.config import EnvConfig
+from reasonbudget_gym.env.reason_budget_env import ReasonBudgetEnv
+from reasonbudget_gym.baselines.uniform import UniformBaseline
+def test_full_episode():
+    config = EnvConfig(questions_per_episode=5, total_budget=2000)
+    env = ReasonBudgetEnv(config)
+    agent = UniformBaseline(config)
+    obs = env.reset()
+    assert obs.questions_remaining == 5
+    assert not obs.done
+    done = False
+    steps = 0
+    total_reward = 0.0
+    while not done:
+        result = env.step(agent.get_action(obs))
+        total_reward += result.reward
+        done = result.done
+        obs = result.observation
+        steps += 1
+        assert steps <= 20
+    assert obs.done
+    assert steps == 5
+    assert -20 < total_reward < 20
+def test_reset_clears_state():
+    config = EnvConfig(questions_per_episode=3, total_budget=1000)
+    env = ReasonBudgetEnv(config)
+    env.reset()
+    env.step(100); env.step(100)
+    obs = env.reset()
+    assert obs.questions_remaining == 3
+    assert obs.remaining_budget == 1000
+    assert len(obs.history) == 0

reasonbudget_gym/tests/test_reward.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+from reasonbudget_gym.env.config import EnvConfig
+from reasonbudget_gym.env.reward import compute_reward
+def test_correct_positive():
+    r = compute_reward(True, 200, 180, 2000, 5, EnvConfig())
+    assert r > 0
+def test_wrong_expensive_negative():
+    r = compute_reward(False, 800, 800, 4000, 1, EnvConfig(cost_penalty_weight=0.5))
+    assert r < 0
+def test_efficiency_bonus():
+    cfg = EnvConfig(efficiency_bonus_weight=0.5)
+    r_cheap = compute_reward(True, 400, 50, 2000, 5, cfg)
+    r_full = compute_reward(True, 400, 400, 2000, 5, cfg)
+    assert r_cheap > r_full

reasonbudget_gym/training/__init__.py ADDED Viewed

File without changes

reasonbudget_gym/training/ppo_train.py ADDED Viewed

	@@ -0,0 +1,252 @@

+"""
+PPO training loop for the token budget allocation policy.
+Usage:
+    python -m reasonbudget_gym.training.ppo_train \
+        --n_episodes 500 \
+        --ppo_epochs 4 \
+        --clip_eps 0.2 \
+        --value_coef 0.5 \
+        --entropy_coef 0.01 \
+        --output_dir runs/ppo_run1
+Training procedure:
+1. Roll out N episodes using the current policy (stochastic)
+2. Compute returns and GAE advantages
+3. Run PPO update for K epochs over the rollout buffer
+4. Log metrics and save checkpoint every 50 episodes
+"""
+import argparse
+import json
+import math
+import os
+import random
+import sys
+from pathlib import Path
+try:
+    import torch
+    import torch.nn.functional as F
+    _TORCH_AVAILABLE = True
+except ImportError:
+    _TORCH_AVAILABLE = False
+from ..env.config import EnvConfig
+from ..env.reason_budget_env import ReasonBudgetEnv
+def collect_rollout(env: ReasonBudgetEnv, policy, config: EnvConfig) -> dict:
+    """Run one episode and collect (obs, action_frac, log_prob, reward, value, done)."""
+    obs = env.reset()
+    flat_obs = env.flat_observation(obs)
+    observations, action_fracs, log_probs, rewards, values, dones = [], [], [], [], [], []
+    done = False
+    total_reward = 0.0
+    correct_count = 0
+    steps = 0
+    while not done:
+        action_int, log_prob, value = policy.get_action(flat_obs)
+        result = env.step(action_int)
+        # Normalise action to [0, 1]
+        frac = (action_int - config.min_tokens) / max(1, config.max_tokens - config.min_tokens)
+        frac = max(0.0, min(1.0, frac))
+        observations.append(flat_obs)
+        action_fracs.append(frac)
+        log_probs.append(log_prob)
+        rewards.append(result.reward)
+        values.append(value)
+        dones.append(result.done)
+        if result.info.get("was_correct"):
+            correct_count += 1
+        total_reward += result.reward
+        steps += 1
+        done = result.done
+        flat_obs = env.flat_observation(result.observation)
+    return {
+        "observations": observations,
+        "action_fracs": action_fracs,
+        "log_probs": log_probs,
+        "rewards": rewards,
+        "values": values,
+        "dones": dones,
+        "total_reward": total_reward,
+        "accuracy": correct_count / max(1, steps),
+        "steps": steps,
+    }
+def compute_returns_and_advantages(
+    rewards, values, dones, gamma=0.99, lam=0.95
+):
+    """Compute GAE advantages and discounted returns."""
+    n = len(rewards)
+    advantages = [0.0] * n
+    returns = [0.0] * n
+    gae = 0.0
+    next_value = 0.0
+    for t in reversed(range(n)):
+        mask = 0.0 if dones[t] else 1.0
+        delta = rewards[t] + gamma * next_value * mask - values[t]
+        gae = delta + gamma * lam * mask * gae
+        advantages[t] = gae
+        next_value = values[t]
+    for t in range(n):
+        returns[t] = advantages[t] + values[t]
+    return advantages, returns
+def ppo_update(policy, rollouts: list, clip_eps: float, value_coef: float,
+               entropy_coef: float, ppo_epochs: int, batch_size: int = 64):
+    """Run PPO updates over the collected rollouts."""
+    import torch
+    # Flatten all rollouts
+    all_obs, all_fracs, all_lp, all_adv, all_ret = [], [], [], [], []
+    for r in rollouts:
+        all_obs.extend(r["observations"])
+        all_fracs.extend(r["action_fracs"])
+        all_lp.extend(r["log_probs"])
+        all_adv.extend(r["advantages"])
+        all_ret.extend(r["returns"])
+    obs_t = torch.tensor(all_obs, dtype=torch.float32)
+    fracs_t = torch.tensor(all_fracs, dtype=torch.float32).unsqueeze(1)
+    old_lp_t = torch.tensor(all_lp, dtype=torch.float32).unsqueeze(1)
+    adv_t = torch.tensor(all_adv, dtype=torch.float32).unsqueeze(1)
+    ret_t = torch.tensor(all_ret, dtype=torch.float32).unsqueeze(1)
+    # Normalise advantages
+    adv_t = (adv_t - adv_t.mean()) / (adv_t.std() + 1e-8)
+    n = obs_t.shape[0]
+    total_loss = 0.0
+    n_updates = 0
+    for _ in range(ppo_epochs):
+        idx = torch.randperm(n)
+        for start in range(0, n, batch_size):
+            b_idx = idx[start:start + batch_size]
+            b_obs = obs_t[b_idx]
+            b_fracs = fracs_t[b_idx]
+            b_old_lp = old_lp_t[b_idx]
+            b_adv = adv_t[b_idx]
+            b_ret = ret_t[b_idx]
+            new_lp, entropy, values = policy.evaluate_actions(b_obs, b_fracs)
+            ratio = torch.exp(new_lp - b_old_lp)
+            surr1 = ratio * b_adv
+            surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * b_adv
+            actor_loss = -torch.min(surr1, surr2).mean()
+            value_loss = F.mse_loss(values, b_ret)
+            entropy_loss = -entropy.mean()
+            loss = actor_loss + value_coef * value_loss + entropy_coef * entropy_loss
+            policy.optimizer.zero_grad()
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(policy.net.parameters(), 0.5)
+            policy.optimizer.step()
+            total_loss += loss.item()
+            n_updates += 1
+    return total_loss / max(1, n_updates)
+def train(args):
+    if not _TORCH_AVAILABLE:
+        print("ERROR: PyTorch not installed. Run: pip install torch", file=sys.stderr)
+        sys.exit(1)
+    config = EnvConfig(seed=args.seed)
+    env = ReasonBudgetEnv(config)
+    obs_dim = env.observation_dim
+    from ..policy.allocation_policy import AllocationPolicy
+    policy = AllocationPolicy(
+        obs_dim=obs_dim,
+        max_tokens=config.max_tokens,
+        min_tokens=config.min_tokens,
+    )
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    history = []
+    print(f"Starting PPO training: {args.n_episodes} episodes, obs_dim={obs_dim}")
+    print(f"Output dir: {output_dir}")
+    for episode in range(1, args.n_episodes + 1):
+        # Collect rollout
+        rollout = collect_rollout(env, policy, config)
+        advantages, returns = compute_returns_and_advantages(
+            rollout["rewards"], rollout["values"], rollout["dones"]
+        )
+        rollout["advantages"] = advantages
+        rollout["returns"] = returns
+        # PPO update
+        loss = ppo_update(
+            policy, [rollout],
+            clip_eps=args.clip_eps,
+            value_coef=args.value_coef,
+            entropy_coef=args.entropy_coef,
+            ppo_epochs=args.ppo_epochs,
+        )
+        record = {
+            "episode": episode,
+            "reward": rollout["total_reward"],
+            "accuracy": rollout["accuracy"],
+            "loss": loss,
+            "steps": rollout["steps"],
+        }
+        history.append(record)
+        if episode % 10 == 0:
+            recent = history[-10:]
+            avg_r = sum(r["reward"] for r in recent) / len(recent)
+            avg_a = sum(r["accuracy"] for r in recent) / len(recent)
+            print(f"  Ep {episode:4d} | avg_reward={avg_r:.3f} | avg_acc={avg_a:.3f} | loss={loss:.4f}")
+        if episode % 50 == 0:
+            ckpt_path = output_dir / f"policy_ep{episode}.pt"
+            policy.save(str(ckpt_path))
+            print(f"  Saved checkpoint: {ckpt_path}")
+    # Final save
+    policy.save(str(output_dir / "policy_final.pt"))
+    with open(output_dir / "training_history.json", "w") as f:
+        json.dump(history, f, indent=2)
+    print(f"\nTraining complete. Final checkpoint: {output_dir / 'policy_final.pt'}")
+    final_10 = history[-10:]
+    print(f"Last 10 episodes — avg reward: {sum(r['reward'] for r in final_10)/10:.3f}, "
+          f"avg accuracy: {sum(r['accuracy'] for r in final_10)/10:.3f}")
+def main():
+    parser = argparse.ArgumentParser(description="PPO training for ReasonBudgetEnv")
+    parser.add_argument("--n_episodes", type=int, default=500)
+    parser.add_argument("--ppo_epochs", type=int, default=4)
+    parser.add_argument("--clip_eps", type=float, default=0.2)
+    parser.add_argument("--value_coef", type=float, default=0.5)
+    parser.add_argument("--entropy_coef", type=float, default=0.01)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--output_dir", type=str, default="runs/ppo_run1")
+    args = parser.parse_args()
+    train(args)
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+fastapi>=0.110.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.0
+numpy>=1.24
+datasets>=2.18.0
+sentence-transformers>=2.7.0
+matplotlib>=3.8
+seaborn>=0.13