Spaces:

Siteshcodes
/

bug-triage-env

Sleeping

App Files Files Community

Siteshcodes commited on Apr 12

Commit

703aa57

1 Parent(s): 1893444

v2.0: multi-step episodes, procedural bugs, semantic grading, sessions, 71 tests

Browse files

Files changed (21) hide show

.gitignore +0 -0
README.md +163 -102
__pycache__/client.cpython-314.pyc +0 -0
__pycache__/model.cpython-314.pyc +0 -0
baseline.py +41 -24
bug_triage_client.py +0 -75
client.py +81 -17
inference.py +257 -69
model.py +21 -9
openenv.yaml +26 -8
pyproject.toml +11 -2
server/__pycache__/__init__.cpython-314.pyc +0 -0
server/__pycache__/task.cpython-314.pyc +0 -0
server/app.py +161 -81
server/environment.py +263 -42
server/requirements.txt +2 -1
server/task.py +725 -78
tests/__init__.py +1 -0
tests/test_api.py +190 -0
tests/test_environment.py +205 -0
tests/test_grading.py +253 -0

.gitignore CHANGED Viewed

Binary files a/.gitignore and b/.gitignore differ

README.md CHANGED Viewed

@@ -9,85 +9,127 @@ tags:
   - openenv
 ---
-# 🐛 Bug Triage Environment
 > **OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology**
-An OpenEnv reinforcement learning environment where an AI agent triages GitHub-style bug reports — assigning priority, labels, team ownership, and milestone — exactly as a senior engineer would.
 **Live:** [https://siteshcodes-bug-triage-env.hf.space](https://siteshcodes-bug-triage-env.hf.space)
 **GitHub:** [https://github.com/Siteshcodes/bug-triage-env](https://github.com/Siteshcodes/bug-triage-env)
 ---
-## Why This Environment?
-Every software team triages dozens of bug reports weekly. Getting prioritization wrong delays critical fixes and wastes engineering time. This environment trains and evaluates agents on real triage decision-making, with graders that reflect actual engineering judgment.
-**Key features:**
-- 🎯 Simulates a real-world engineering task (not a game or toy)
-- 📊 3 tasks of increasing difficulty with deterministic graders
-- 🔄 Meaningful partial-credit reward function
-- 🛡️ Security escalation penalty for missed critical vulnerabilities
-- 📦 Full OpenEnv spec compliance: `step()` / `reset()` / `state()`
 ---
 ## Action Space
-| Field           | Type      | Values                                          |
-|-----------------|-----------|-------------------------------------------------|
-| `priority`      | string    | `P0` · `P1` · `P2` · `P3`                      |
-| `labels`        | list[str] | `bug` · `performance` · `security` · `ux` · `data-integrity` · `payments` … |
-| `assigned_team` | string    | `backend` · `frontend` · `infra` · `security` · `devx` |
-| `milestone`     | string    | `hotfix` · `v2.1` · `backlog`                   |
-| `reasoning`     | string    | Free-form explanation of triage decision         |
 ## Observation Space
-| Field        | Type      | Description                              |
-|--------------|-----------|------------------------------------------|
-| `bug_report` | BugReport | Title, body, author, labels_hint, comments |
-| `task_id`    | string    | Current difficulty: `easy` / `medium` / `hard` |
-| `score`      | float     | Score from grader (0.0–1.0)              |
-| `reward`     | float     | Reward from last action (0.0–1.0)        |
-| `feedback`   | string    | Human-readable grader feedback           |
-| `done`       | bool      | Episode complete flag                    |
 ---
 ## Tasks
 ### Task 1 — Easy: Priority Assignment
-Assign a single P0–P3 priority to a bug report.
 - **Grader:** `server.task:priority_match`
-- **Scoring:** exact match → 0.95, one level off → 0.50, else → 0.05
-- **Weight:** priority 100%
-- **Reward range:** (0.0, 1.0) — strictly exclusive
 ### Task 2 — Medium: Priority + Labels + Team
-Assign priority, category labels, and team routing.
 - **Grader:** `server.task:priority_label_team`
-- **Scoring:** priority 45% + label Jaccard similarity 40% + team routing 15%
-- **Reward range:** (0.0, 1.0) — strictly exclusive
 ### Task 3 — Hard: Full Triage
-Full triage: priority, labels, team, and milestone. Security escalation failures are penalized.
 - **Grader:** `server.task:full_triage`
 - **Scoring:** priority 35% + labels 30% + team 20% + milestone 15%
-- **Penalty:** −0.15 for missing security escalation (e.g., SQL injection assigned to `backend` instead of `security`)
-- **Reward range:** (0.0, 1.0) — strictly exclusive
 ---
 ## Reward Function
-Rewards provide meaningful partial-credit signals at every step:
-- **Priority:** Close-but-wrong gets partial credit (0.50 for 1-level off vs 0.05 for 2+ levels off vs 0.95 for exact match)
-- **Labels:** Jaccard similarity between predicted and expected label sets (continuous signal)
-- **Team routing:** Binary accuracy, weighted per task difficulty
-- **Security escalation:** Hard penalty (−0.15) discourages ignoring critical security signals
-- **Clamping:** All scores strictly within (0.0, 1.0) — never exactly 0 or 1
 ---
@@ -107,7 +149,13 @@ docker build -t bug-triage-env .
 docker run -p 7860:7860 bug-triage-env
 ```
-### Run Inference (Hackathon Submission Script)
 ```bash
 pip install openai openenv-core requests pydantic
 export API_BASE_URL=https://router.huggingface.co/v1
@@ -119,77 +167,79 @@ python inference.py
 ### Environment Variables
-| Variable       | Description                          | Required |
-|----------------|--------------------------------------|----------|
-| `API_BASE_URL` | LLM API endpoint                     | Yes      |
-| `MODEL_NAME`   | Model identifier for inference       | Yes      |
-| `HF_TOKEN`     | Hugging Face / API key               | Yes      |
-| `ENV_BASE_URL` | Bug Triage environment URL           | Optional |
----
-## Baseline Scores
-Evaluated with `meta-llama/Llama-3.3-70B-Instruct` via HuggingFace router (temperature=0):
-| Task       | Difficulty | Score |
-|------------|------------|-------|
-| Easy       | easy       | 0.95  |
-| Medium     | medium     | 0.50  |
-| Hard       | hard       | 0.85  |
-| **Average**|            | **0.77** |
-> Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
 ---
 ## API Endpoints
-| Method | Endpoint         | Description                        |
-|--------|------------------|------------------------------------|
-| GET    | `/`              | Health check                       |
-| POST   | `/reset`         | Start new episode for a task       |
-| POST   | `/step`          | Submit triage action               |
-| GET    | `/state`         | Get current episode state          |
-| GET    | `/tasks`         | List all tasks with grader info    |
-| GET    | `/tasks/{id}`    | Get specific task metadata         |
-### Example: Reset + Step
 ```bash
-# Reset for easy task
 curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
   -H "Content-Type: application/json" \
-  -d '{"task_id": "easy"}'
-# Submit triage action
 curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
   -H "Content-Type: application/json" \
-  -d '{"action": {"priority": "P0", "labels": ["bug"], "assigned_team": "backend", "milestone": "hotfix", "reasoning": "App crash affecting all users"}}'
 ```
 ---
 ## Inference Log Format
-The inference script emits structured logs per the OpenEnv spec:
 ```
 [START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
-[STEP] step=1 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
-[END] success=true steps=1 score=0.95 rewards=0.95
 [START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
-[STEP] step=1 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
-[END] success=true steps=1 score=0.85 rewards=0.85
 [START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
-[STEP] step=1 action=priority=P0,team=security,milestone=hotfix reward=0.72 done=true error=null
-[END] success=true steps=1 score=0.72 rewards=0.72
 ```
-Each task gets its own `[START]` → `[STEP]` → `[END]` block.
 ---
 ## Project Structure
@@ -197,16 +247,24 @@ Each task gets its own `[START]` → `[STEP]` → `[END]` block.
 ```
 bug-triage-env/
 ├── server/
-│   ├── app.py             # FastAPI + OpenEnv stateful endpoints
-│   ├── environment.py     # BugTriageEnvironment (reset/step/state)
-│   ├── task.py            # 15 bug reports + 3 graders
 │   ├── __init__.py
-│   └── requirements.txt
 ├── model.py               # Pydantic models (TriageAction, TriageObservation, TriageState)
-├── inference.py           # OpenAI client submission script (per-task logs)
-├── openenv.yaml           # OpenEnv spec manifest (3 tasks with graders)
-├── Dockerfile             # Docker container config
-├── pyproject.toml         # Package metadata
 └── README.md
 ```
@@ -214,17 +272,20 @@ bug-triage-env/
 ## OpenEnv Spec Compliance
-| Requirement                         | Status |
-|-------------------------------------|--------|
 | Typed models (Action/Observation/State) | ✅ |
-| `step()` / `reset()` / `state()` API   | ✅ |
-| `openenv.yaml` manifest                | ✅ |
-| 3+ tasks with graders (easy→hard)      | ✅ |
-| Reward range strictly (0.0, 1.0)       | ✅ |
 | Baseline inference with reproducible scores | ✅ |
-| Dockerfile builds                       | ✅ |
-| Deployed on HF Spaces                  | ✅ |
-| Structured `[START]/[STEP]/[END]` logs  | ✅ |
 ---

   - openenv
 ---
+# 🐛 Bug Triage Environment v2.0
 > **OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology**
+A multi-step reinforcement learning environment where an AI agent investigates and triages GitHub-style bug reports — deciding priority, labels, team ownership, and milestone — just like a senior engineer would.
 **Live:** [https://siteshcodes-bug-triage-env.hf.space](https://siteshcodes-bug-triage-env.hf.space)
 **GitHub:** [https://github.com/Siteshcodes/bug-triage-env](https://github.com/Siteshcodes/bug-triage-env)
 ---
+## What Makes This Different
+| Feature | v1.0 (before) | v2.0 (now) |
+|---------|---------------|------------|
+| Episode length | 1 step (quiz) | Multi-step investigation |
+| Bug pool | 15 hardcrafted | 200+ procedurally generated |
+| Label matching | Exact string | Semantic (synonym-aware) |
+| Concurrency | Broken (global state) | Session-based, thread-safe |
+| Information reveal | Everything at once | Progressive (title → body → comments → logs) |
+| Tests | None | 50+ unit & integration tests |
+| Grading depth | String matching | Weighted scoring + reasoning bonus |
+---
+## Multi-Step Investigation
+Unlike simple Q&A environments, the agent must **investigate before deciding**:
+```
+reset()     → Agent sees: bug title + body preview
+step(read_body)      → Full description revealed
+step(read_comments)  → User comments revealed
+step(check_logs)     → Stack traces + severity signals revealed
+step(submit, ...)    → Final triage graded (reward returned)
+```
+Each investigation step costs a step (out of a limited budget). The agent must learn **when it has enough information to decide correctly** — balancing accuracy vs. efficiency.
 ---
 ## Action Space
+| Field | Type | Values |
+|-------|------|--------|
+| `action_type` | string | `read_body` · `read_comments` · `check_logs` · `check_similar` · `submit` |
+| `priority` | string | `P0` · `P1` · `P2` · `P3` (only for submit) |
+| `labels` | list[str] | `bug` · `performance` · `security` · `ux` · `data-integrity` · `payments` … |
+| `assigned_team` | string | `backend` · `frontend` · `infra` · `security` · `devx` |
+| `milestone` | string | `hotfix` · `v2.1` · `backlog` |
+| `reasoning` | string | Free-form explanation (earns bonus points) |
 ## Observation Space
+| Field | Type | Description |
+|-------|------|-------------|
+| `bug_report` | BugReport | Title, body, author, labels_hint, comments, stack_trace |
+| `task_id` | string | Current difficulty: `easy` / `medium` / `hard` |
+| `score` | float | Score from grader (0.0–1.0) |
+| `reward` | float | Reward from last action (0.0–1.0) |
+| `feedback` | string | Human-readable grader feedback |
+| `done` | bool | Episode complete flag |
+| `body_visible` | bool | Whether full body has been revealed |
+| `comments_visible` | bool | Whether comments have been revealed |
+| `logs_visible` | bool | Whether logs/stack traces have been revealed |
+| `steps_taken` | int | Steps used so far |
+| `max_steps` | int | Maximum steps allowed |
 ---
 ## Tasks
 ### Task 1 — Easy: Priority Assignment
+Assign a single P0–P3 priority. Up to 4 steps.
 - **Grader:** `server.task:priority_match`
+- **Scoring:** exact → 0.95, ±1 → 0.50, ±2 → 0.20, else → 0.05
+- **Reward range:** (0.0, 1.0)
 ### Task 2 — Medium: Priority + Labels + Team
+Assign priority, category labels, and team routing. Up to 5 steps.
 - **Grader:** `server.task:priority_label_team`
+- **Scoring:** priority 45% + label Jaccard (semantic) 40% + team 15%
+- **Reward range:** (0.0, 1.0)
 ### Task 3 — Hard: Full Triage
+Full triage with security escalation penalty. Up to 6 steps.
 - **Grader:** `server.task:full_triage`
 - **Scoring:** priority 35% + labels 30% + team 20% + milestone 15%
+- **Penalty:** −0.15 for missing security escalation
+- **Bonus:** up to +0.15 for relevant reasoning
+- **Reward range:** (0.0, 1.0)
 ---
 ## Reward Function
+- **Priority:** Graduated partial credit (0.95 → 0.50 → 0.20 → 0.05)
+- **Labels:** Semantic Jaccard similarity with synonym matching (e.g., "defect" ≈ "bug")
+- **Team routing:** Binary accuracy, weighted per difficulty
+- **Security escalation:** Hard penalty (−0.15) for ignoring security signals
+- **Reasoning bonus:** Up to +0.15 for mentioning relevant signals
+- **Efficiency:** +0.05 bonus for correct answers with minimal investigation
+- **Clamping:** All scores strictly within (0.0, 1.0)
+---
+## Procedural Bug Generation
+The environment generates bugs from **7 template categories**:
+| Category | Example Bugs |
+|----------|-------------|
+| `crash` | Service crashes, unhandled exceptions, segfaults |
+| `security` | SQL injection, XSS, auth bypass, data exposure |
+| `performance` | Memory leaks, slow queries, CPU spikes |
+| `ui_bug` | Layout breaks, dark mode issues, accessibility |
+| `data_corruption` | Race conditions, encoding issues, stale cache |
+| `documentation` | Typos, outdated docs, missing guides |
+| `api_bug` | Rate limiting bugs, pagination issues, webhook failures |
+Each category has 5-6 title templates × 2 body templates × 6-12 variables = hundreds of unique combinations. The 15 original handcrafted bugs are preserved as a high-quality subset (40% chance per sample).
 ---
 docker run -p 7860:7860 bug-triage-env
 ```
+### Run Tests
+```bash
+pip install -e ".[dev]"
+pytest tests/ -v
+```
+### Run Inference (Hackathon Submission)
 ```bash
 pip install openai openenv-core requests pydantic
 export API_BASE_URL=https://router.huggingface.co/v1
 ### Environment Variables
+| Variable | Description | Required |
+|----------|-------------|----------|
+| `API_BASE_URL` | LLM API endpoint | Yes |
+| `MODEL_NAME` | Model identifier for inference | Yes |
+| `HF_TOKEN` | Hugging Face / API key | Yes |
+| `ENV_BASE_URL` | Bug Triage environment URL | Optional |
 ---
 ## API Endpoints
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/` | Interactive demo frontend |
+| GET | `/health` | Health check + active sessions |
+| POST | `/reset` | Start new episode (returns session_id) |
+| POST | `/step` | Investigation or submit action |
+| GET | `/state` | Current episode state |
+| GET | `/tasks` | List all 3 tasks |
+| GET | `/tasks/{id}` | Task metadata |
+| GET | `/leaderboard` | Top agent scores |
+| POST | `/leaderboard/submit` | Submit agent scores |
+### Example: Multi-Step Episode
 ```bash
+# 1. Reset — get a bug and session_id
 curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
   -H "Content-Type: application/json" \
+  -d '{"task_id": "hard"}'
+# 2. Investigate — read full body (use session_id from step 1)
 curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
   -H "Content-Type: application/json" \
+  -d '{"session_id": "...", "action": {"action_type": "read_body"}}'
+# 3. Investigate — read comments
+curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
+  -H "Content-Type: application/json" \
+  -d '{"session_id": "...", "action": {"action_type": "read_comments"}}'
+# 4. Submit triage decision
+curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
+  -H "Content-Type: application/json" \
+  -d '{"session_id": "...", "action": {"action_type": "submit", "priority": "P0", "labels": ["bug", "security"], "assigned_team": "security", "milestone": "hotfix", "reasoning": "SQL injection in production — critical security vulnerability"}}'
 ```
 ---
 ## Inference Log Format
+Structured logs per OpenEnv spec (3 tasks, each with its own block):
 ```
 [START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
+[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
+[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
+[STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
+[END] success=true steps=3 score=0.95 rewards=0.95
 [START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
+[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
+[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
+[STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
+[END] success=true steps=3 score=0.85 rewards=0.85
 [START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
+[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
+[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
+[STEP] step=3 action=priority=P0,team=security,milestone=hotfix reward=0.92 done=true error=null
+[END] success=true steps=3 score=0.92 rewards=0.92
 ```
 ---
 ## Project Structure
 ```
 bug-triage-env/
 ├── server/
+│   ├── app.py             # FastAPI routes + session management
+│   ├── environment.py     # Multi-step environment + SessionManager
+│   ├── task.py            # 200+ bugs (procedural + handcrafted) + semantic grading
 │   ├── __init__.py
+│   ├── requirements.txt
+│   └── static/
+│       └── index.html     # Interactive demo
+├── tests/
+│   ├── test_grading.py    # Grading logic tests
+│   ├── test_environment.py # Environment flow tests
+│   └── test_api.py        # HTTP endpoint integration tests
 ├── model.py               # Pydantic models (TriageAction, TriageObservation, TriageState)
+├── client.py              # HTTP client (single source of truth)
+├── inference.py           # Multi-step OpenAI agent (hackathon submission)
+├── baseline.py            # Groq baseline agent
+├── openenv.yaml           # OpenEnv spec manifest
+├── Dockerfile             # Docker config
+├── pyproject.toml         # Package metadata + dev deps
 └── README.md
 ```
 ## OpenEnv Spec Compliance
+| Requirement | Status |
+|-------------|--------|
 | Typed models (Action/Observation/State) | ✅ |
+| `step()` / `reset()` / `state()` API | ✅ |
+| `openenv.yaml` manifest | ✅ |
+| 3+ tasks with graders (easy → hard) | ✅ |
+| Reward range strictly (0.0, 1.0) | ✅ |
+| Multi-step episodes | ✅ |
 | Baseline inference with reproducible scores | ✅ |
+| Dockerfile builds | ✅ |
+| Deployed on HF Spaces | ✅ |
+| Structured `[START]/[STEP]/[END]` logs | ✅ |
+| Session-based concurrency | ✅ |
+| 50+ automated tests | ✅ |
 ---

__pycache__/client.cpython-314.pyc DELETED Viewed

Binary file (5.72 kB)

__pycache__/model.cpython-314.pyc DELETED Viewed

Binary file (4.18 kB)

baseline.py CHANGED Viewed

@@ -1,17 +1,16 @@
 # baseline.py
-# Runs a Groq-hosted LLaMA model against all 3 tasks
 # Set env vars: GROQ_API_KEY, ENV_BASE_URL (optional)
 import os
 import json
 from groq import Groq
 from client import BugTriageClient
 from model import TriageAction
-import time
-# ── config ─────────────────────────────────────────────────
 GROQ_API_KEY = os.getenv("GROQ_API_KEY")
-MODEL = "llama-3.3-70b-versatile"   # strong + free on Groq
 TEMPERATURE = 0.0
 MAX_TOKENS = 400
@@ -40,12 +39,19 @@ Milestones: hotfix | v2.1 | backlog"""
 def format_bug(obs) -> str:
     bug = obs.bug_report
-    return (
-        f"Title: {bug.title}\n\n"
-        f"Description:\n{bug.body}\n\n"
-        f"Existing labels: {', '.join(bug.labels_hint) or 'none'}\n"
-        f"Comments:\n" + "\n".join(f"  - {c}" for c in bug.comments)
-    )
 def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
@@ -60,7 +66,6 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
     )
     raw = response.choices[0].message.content.strip()
-    # strip accidental markdown fences
     if raw.startswith("```"):
         raw = raw.split("```")[1]
         if raw.startswith("json"):
@@ -68,6 +73,7 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
     data = json.loads(raw)
     return TriageAction(
         priority=data["priority"],
         labels=data.get("labels", []),
         assigned_team=data.get("assigned_team", "backend"),
@@ -78,26 +84,39 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
 def main():
     if not GROQ_API_KEY:
-        raise EnvironmentError("GROQ_API_KEY not set. Get a free key at console.groq.com")
     groq_client = Groq(api_key=GROQ_API_KEY)
     scores = {}
-    step_count = 0
     print("=" * 50)
-    print("  Bug Triage Env — Baseline Inference Script")
     print(f"  Model: {MODEL}")
     print("=" * 50)
     with BugTriageClient() as env:
-        obs = env.reset()
-        MAX_STEPS = 3
-        step_count = 0
-        while not obs.done and step_count < MAX_STEPS:
-            task = obs.task_id
-            print(f"\n── Task: {task.upper()} ──")
             print(f"  Bug: {obs.bug_report.title}")
             bug_text = format_bug(obs)
             action = call_model(groq_client, bug_text)
@@ -112,21 +131,19 @@ def main():
             print(f"  ✓ Reward:    {result.reward:.3f}")
             print(f"  ✓ Feedback:  {obs.feedback}")
-            scores[task] = result.reward
-            step_count += 1
             time.sleep(2)
     print("\n" + "=" * 50)
     print("  BASELINE SCORES")
     print("=" * 50)
-    task_order = ["easy", "medium", "hard"]
     total = 0.0
     for task in task_order:
         s = scores.get(task, 0.0)
         bar = "█" * int(s * 20) + "░" * (20 - int(s * 20))
         print(f"  {task:<8} {bar}  {s:.3f}")
         total += s
-    avg = total / max(step_count, 1)
     print(f"\n  Average score: {avg:.3f}")
     print("=" * 50)

 # baseline.py
+# Runs a Groq-hosted LLaMA model against all 3 tasks with multi-step investigation
 # Set env vars: GROQ_API_KEY, ENV_BASE_URL (optional)
 import os
 import json
+import time
 from groq import Groq
 from client import BugTriageClient
 from model import TriageAction
 GROQ_API_KEY = os.getenv("GROQ_API_KEY")
+MODEL = "llama-3.3-70b-versatile"
 TEMPERATURE = 0.0
 MAX_TOKENS = 400
 def format_bug(obs) -> str:
     bug = obs.bug_report
+    parts = [f"Title: {bug.title}", f"\nDescription:\n{bug.body}"]
+    if obs.comments_visible and bug.comments:
+        comments = "\n".join(f"  - {c}" for c in bug.comments)
+        parts.append(f"\nComments:\n{comments}")
+    if bug.labels_hint:
+        parts.append(f"\nExisting labels: {', '.join(bug.labels_hint)}")
+    if obs.logs_visible and bug.stack_trace:
+        parts.append(f"\nStack trace: {bug.stack_trace}")
+    return "\n".join(parts)
 def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
     )
     raw = response.choices[0].message.content.strip()
     if raw.startswith("```"):
         raw = raw.split("```")[1]
         if raw.startswith("json"):
     data = json.loads(raw)
     return TriageAction(
+        action_type="submit",
         priority=data["priority"],
         labels=data.get("labels", []),
         assigned_team=data.get("assigned_team", "backend"),
 def main():
     if not GROQ_API_KEY:
+        raise EnvironmentError(
+            "GROQ_API_KEY not set. Get a free key at console.groq.com")
     groq_client = Groq(api_key=GROQ_API_KEY)
     scores = {}
     print("=" * 50)
+    print("  Bug Triage Env — Baseline (Multi-Step Agent)")
     print(f"  Model: {MODEL}")
     print("=" * 50)
+    task_order = ["easy", "medium", "hard"]
     with BugTriageClient() as env:
+        for task_id in task_order:
+            obs = env.reset(task_id=task_id)
+            print(f"\n── Task: {task_id.upper()} ──")
             print(f"  Bug: {obs.bug_report.title}")
+            # Step 1: Read full body
+            if not obs.body_visible:
+                result = env.investigate("read_body")
+                obs = result.observation
+                print(f"  📖 Investigated: read_body")
+            # Step 2: Read comments
+            if not obs.comments_visible:
+                result = env.investigate("read_comments")
+                obs = result.observation
+                print(f"  💬 Investigated: read_comments")
+            # Step 3: Submit triage
             bug_text = format_bug(obs)
             action = call_model(groq_client, bug_text)
             print(f"  ✓ Reward:    {result.reward:.3f}")
             print(f"  ✓ Feedback:  {obs.feedback}")
+            scores[task_id] = result.reward
             time.sleep(2)
     print("\n" + "=" * 50)
     print("  BASELINE SCORES")
     print("=" * 50)
     total = 0.0
     for task in task_order:
         s = scores.get(task, 0.0)
         bar = "█" * int(s * 20) + "░" * (20 - int(s * 20))
         print(f"  {task:<8} {bar}  {s:.3f}")
         total += s
+    avg = total / max(len(scores), 1)
     print(f"\n  Average score: {avg:.3f}")
     print("=" * 50)

bug_triage_client.py DELETED Viewed

@@ -1,75 +0,0 @@
-# client.py
-import os
-import requests
-from typing import Optional
-from model import TriageAction, TriageObservation, BugReport
-class StepResult:
-    def __init__(self, observation: TriageObservation, reward: float, done: bool, info: dict):
-        self.observation = observation
-        self.reward = reward
-        self.done = done
-        self.info = info
-def _parse_observation(data: dict) -> TriageObservation:
-    bug_data = data["bug_report"]
-    bug = BugReport(**bug_data)
-    return TriageObservation(
-        bug_report=bug,
-        task_id=data.get("task_id", "easy"),
-        score=data.get("score", 0.0),
-        feedback=data.get("feedback", ""),
-        done=data.get("done", False),
-        reward=data.get("reward", 0.0),
-    )
-class BugTriageClient:
-    def __init__(self, base_url: Optional[str] = None):
-        self.base_url = (
-            base_url
-            or os.getenv("ENV_BASE_URL", "https://siteshcodes-bug-triage-env.hf.space")
-        ).rstrip("/")
-        self.session = requests.Session()
-        self.session.headers.update({"Content-Type": "application/json"})
-    def reset(self) -> TriageObservation:
-        response = self.session.post(f"{self.base_url}/reset", json={}, timeout=30)
-        response.raise_for_status()
-        data = response.json()
-        obs_data = data.get("observation", data)
-        return _parse_observation(obs_data)
-    def step(self, action: TriageAction) -> StepResult:
-        try:
-            action_dict = action.model_dump()
-        except AttributeError:
-            action_dict = action.dict()
-        payload = {"action": action_dict}
-        response = self.session.post(f"{self.base_url}/step", json=payload, timeout=30)
-        response.raise_for_status()
-        data = response.json()
-        obs_data = data.get("observation", data)
-        obs = _parse_observation(obs_data)
-        return StepResult(
-            observation=obs,
-            reward=data.get("reward", obs.reward) or 0.0,
-            done=data.get("done", obs.done),
-            info={},
-        )
-    def state(self) -> dict:
-        response = self.session.get(f"{self.base_url}/state", timeout=30)
-        response.raise_for_status()
-        return response.json()
-    def close(self):
-        self.session.close()
-    def __enter__(self):
-        return self
-    def __exit__(self, *args):
-        self.close()

client.py CHANGED Viewed

@@ -1,12 +1,14 @@
-# client.py
 import os
 import requests
-from typing import Optional
 from model import TriageAction, TriageObservation, BugReport
 class StepResult:
-    def __init__(self, observation: TriageObservation, reward: float, done: bool, info: dict):
         self.observation = observation
         self.reward = reward
         self.done = done
@@ -14,11 +16,13 @@ class StepResult:
 def _parse_observation(data: dict) -> TriageObservation:
     bug_data = data["bug_report"]
     try:
         bug = BugReport.model_validate(bug_data)
     except Exception:
         bug = BugReport(**bug_data)
     return TriageObservation(
         bug_report=bug,
         task_id=data.get("task_id", "easy"),
@@ -26,10 +30,18 @@ def _parse_observation(data: dict) -> TriageObservation:
         feedback=data.get("feedback", ""),
         done=data.get("done", False),
         reward=data.get("reward", 0.0),
     )
 class BugTriageClient:
     def __init__(self, base_url: Optional[str] = None):
         self.base_url = (
             base_url
@@ -37,39 +49,91 @@ class BugTriageClient:
         ).rstrip("/")
         self.session = requests.Session()
         self.session.headers.update({"Content-Type": "application/json"})
-    def reset(self, task_id: str = "easy") -> TriageObservation:
         response = self.session.post(
-            f"{self.base_url}/reset",
-            json={"task_id": task_id},
-            timeout=30,
         )
         response.raise_for_status()
         data = response.json()
-        return _parse_observation(data.get("observation", data))
     def step(self, action: TriageAction) -> StepResult:
         try:
-            action_dict = action.model_dump()   # Pydantic v2
         except AttributeError:
-            action_dict = action.dict()         # Pydantic v1 fallback
         response = self.session.post(
-            f"{self.base_url}/step",
-            json={"action": action_dict},
-            timeout=30,
         )
         response.raise_for_status()
         data = response.json()
-        obs = _parse_observation(data.get("observation", data))
         return StepResult(
             observation=obs,
-            reward=data.get("reward", obs.reward) or 0.0,
             done=data.get("done", obs.done),
-            info={},
         )
     def state(self) -> dict:
-        response = self.session.get(f"{self.base_url}/state", timeout=30)
         response.raise_for_status()
         return response.json()

+# client.py — Single source of truth for environment client
 import os
 import requests
+from typing import Optional, List
 from model import TriageAction, TriageObservation, BugReport
 class StepResult:
+    """Result returned by env.step()."""
+    def __init__(self, observation: TriageObservation, reward: float,
+                 done: bool, info: dict):
         self.observation = observation
         self.reward = reward
         self.done = done
 def _parse_observation(data: dict) -> TriageObservation:
+    """Parse a JSON dict into a TriageObservation."""
     bug_data = data["bug_report"]
     try:
         bug = BugReport.model_validate(bug_data)
     except Exception:
         bug = BugReport(**bug_data)
     return TriageObservation(
         bug_report=bug,
         task_id=data.get("task_id", "easy"),
         feedback=data.get("feedback", ""),
         done=data.get("done", False),
         reward=data.get("reward", 0.0),
+        body_visible=data.get("body_visible", False),
+        comments_visible=data.get("comments_visible", False),
+        logs_visible=data.get("logs_visible", False),
+        similar_visible=data.get("similar_visible", False),
+        steps_taken=data.get("steps_taken", 0),
+        max_steps=data.get("max_steps", 6),
     )
 class BugTriageClient:
+    """HTTP client for the Bug Triage Environment server."""
     def __init__(self, base_url: Optional[str] = None):
         self.base_url = (
             base_url
         ).rstrip("/")
         self.session = requests.Session()
         self.session.headers.update({"Content-Type": "application/json"})
+        self._session_id: Optional[str] = None
+    @property
+    def session_id(self) -> Optional[str]:
+        return self._session_id
+    def reset(self, task_id: str = "easy", seed: int = None) -> TriageObservation:
+        """Start a new episode. Stores session_id for subsequent step() calls."""
+        payload = {"task_id": task_id}
+        if seed is not None:
+            payload["seed"] = seed
+        if self._session_id:
+            payload["session_id"] = self._session_id
         response = self.session.post(
+            f"{self.base_url}/reset", json=payload, timeout=30,
         )
         response.raise_for_status()
         data = response.json()
+        self._session_id = data.get("session_id")
+        obs_data = data.get("observation", data)
+        return _parse_observation(obs_data)
     def step(self, action: TriageAction) -> StepResult:
+        """Send an action (investigation or submit) and get the result."""
         try:
+            action_dict = action.model_dump()
         except AttributeError:
+            action_dict = action.dict()
+        payload = {"action": action_dict}
+        if self._session_id:
+            payload["session_id"] = self._session_id
         response = self.session.post(
+            f"{self.base_url}/step", json=payload, timeout=30,
         )
         response.raise_for_status()
         data = response.json()
+        obs_data = data.get("observation", data)
+        obs = _parse_observation(obs_data)
+        reward = data.get("reward", obs.reward) or 0.0
+        reward = float(reward)
+        # Update session_id if server returned one
+        if "session_id" in data:
+            self._session_id = data["session_id"]
         return StepResult(
             observation=obs,
+            reward=reward,
             done=data.get("done", obs.done),
+            info=data.get("info", {}),
         )
+    def investigate(self, action_type: str) -> StepResult:
+        """Shortcut for investigation actions."""
+        action = TriageAction(action_type=action_type)
+        return self.step(action)
+    def submit(self, priority: str, labels: List[str] = None,
+               assigned_team: str = "backend", milestone: str = "backlog",
+               reasoning: str = "") -> StepResult:
+        """Shortcut for submitting the final triage decision."""
+        action = TriageAction(
+            action_type="submit",
+            priority=priority,
+            labels=labels or ["bug"],
+            assigned_team=assigned_team,
+            milestone=milestone,
+            reasoning=reasoning,
+        )
+        return self.step(action)
     def state(self) -> dict:
+        """Get current environment state."""
+        params = {}
+        if self._session_id:
+            params["session_id"] = self._session_id
+        response = self.session.get(
+            f"{self.base_url}/state", params=params, timeout=30,
+        )
         response.raise_for_status()
         return response.json()

inference.py CHANGED Viewed

@@ -20,6 +20,10 @@ from openai import OpenAI
 from model import TriageAction, TriageObservation, BugReport
 API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 API_KEY      = os.getenv("HF_TOKEN") or os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY")
 MODEL_NAME   = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.3-70B-Instruct"
@@ -31,9 +35,9 @@ if not API_KEY:
 TASK_IDS                = ["easy", "medium", "hard"]
 BENCHMARK               = "bug-triage-env"
 TEMPERATURE             = 0.0
-MAX_TOKENS              = 400
-MAX_STEPS               = 1       # Each task is 1 step (reset → step → done)
-MAX_TOTAL_REWARD        = 1.0     # Per-task max reward
 SUCCESS_SCORE_THRESHOLD = 0.4
 print(f"[CONFIG] API_BASE_URL={API_BASE_URL}", flush=True)
@@ -41,7 +45,10 @@ print(f"[CONFIG] MODEL_NAME={MODEL_NAME}", flush=True)
 print(f"[CONFIG] ENV_BASE_URL={ENV_BASE_URL}", flush=True)
 print(f"[CONFIG] API_KEY={'set' if API_KEY else 'MISSING'}", flush=True)
-#inlined client
 def _parse_observation(data: dict) -> TriageObservation:
     try:
@@ -51,15 +58,22 @@ def _parse_observation(data: dict) -> TriageObservation:
     return TriageObservation(
         bug_report=bug,
         task_id=data.get("task_id", "easy"),
-        score=data.get("score", 0.05),
         feedback=data.get("feedback", ""),
         done=data.get("done", False),
-        reward=data.get("reward", 0.05),
     )
 class StepResult:
-    def __init__(self, observation: TriageObservation, reward: float, done: bool, info: dict):
         self.observation = observation
         self.reward = reward
         self.done = done
@@ -71,42 +85,53 @@ class BugTriageClient:
         self.base_url = (base_url or ENV_BASE_URL).rstrip("/")
         self.session = requests.Session()
         self.session.headers.update({"Content-Type": "application/json"})
     def reset(self, task_id: str = "easy") -> TriageObservation:
         print(f"[ENV] Resetting env for task={task_id}", flush=True)
         response = self.session.post(
-            f"{self.base_url}/reset",
-            json={"task_id": task_id},
-            timeout=30,
         )
         response.raise_for_status()
         data = response.json()
         return _parse_observation(data.get("observation", data))
     def step(self, action: TriageAction) -> StepResult:
-        print("[ENV] Sending step action...", flush=True)
         try:
             action_dict = action.model_dump()
         except AttributeError:
             action_dict = action.dict()
         response = self.session.post(
-            f"{self.base_url}/step",
-            json={"action": action_dict},
-            timeout=30,
         )
         response.raise_for_status()
         data = response.json()
         obs = _parse_observation(data.get("observation", data))
         reward = data.get("reward", obs.reward)
-        if reward is None or reward == 0:
-            reward = 0.05
         reward = float(reward)
-        reward = max(0.01, min(0.99, reward))
         return StepResult(
-            observation=obs,
-            reward=reward,
-            done=data.get("done", obs.done),
-            info={},
         )
     def close(self):
@@ -119,12 +144,14 @@ class BugTriageClient:
         self.close()
 SYSTEM_PROMPT = textwrap.dedent("""
-    You are a senior software engineering manager.
-    You will receive a bug report and must triage it. Respond ONLY with
-    valid JSON — no markdown, no explanation, no backticks.
     Return exactly this structure:
     {
@@ -143,22 +170,44 @@ SYSTEM_PROMPT = textwrap.dedent("""
     Teams: backend | frontend | infra | security | devx
     Milestones: hotfix | v2.1 | backlog
 """).strip()
 def log_start(task: str, env: str, model: str) -> None:
     print(f"[START] task={task} env={env} model={model}", flush=True)
-def log_step(
-    step: int,
-    action: str,
-    reward: float,
-    done: bool,
-    error: Optional[str] = None,
-) -> None:
     print(
         f"[STEP] step={step} action={action} "
         f"reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
@@ -166,7 +215,8 @@ def log_step(
     )
-def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
     print(
         f"[END] success={str(success).lower()} steps={steps} "
@@ -175,21 +225,97 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
     )
 def format_bug(obs: TriageObservation) -> str:
     bug = obs.bug_report
-    comments = "\n".join(f"  - {c}" for c in bug.comments) if bug.comments else "  None"
-    return (
-        f"Title: {bug.title}\n\n"
-        f"Description:\n{bug.body}\n\n"
-        f"Existing labels: {', '.join(bug.labels_hint) if bug.labels_hint else 'none'}\n"
-        f"Comments:\n{comments}"
-    )
 def call_model(client: OpenAI, bug_text: str) -> TriageAction:
-    print("[LLM] Sending request to model...", flush=True)
     completion = client.chat.completions.create(
         model=MODEL_NAME,
@@ -218,6 +344,7 @@ def call_model(client: OpenAI, bug_text: str) -> TriageAction:
         data = {}
     action = TriageAction(
         priority=data.get("priority", "P2"),
         labels=data.get("labels", ["bug"]),
         assigned_team=data.get("assigned_team", "backend"),
@@ -233,12 +360,13 @@ def call_model(client: OpenAI, bug_text: str) -> TriageAction:
     return action
 def main() -> None:
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
     all_scores = []
     with BugTriageClient(base_url=ENV_BASE_URL) as env:
@@ -247,32 +375,90 @@ def main() -> None:
             score = 0.0
             success = False
             steps_taken = 0
             log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
             try:
                 obs = env.reset(task_id=task_id)
-                action = call_model(client, format_bug(obs))
-                result = env.step(action)
-                reward = float(result.reward or 0.05)
-                reward = max(0.01, min(0.99, reward))
-                rewards.append(reward)
-                steps_taken = 1
-                action_str = (
-                    f"priority={action.priority},"
-                    f"team={action.assigned_team},"
-                    f"milestone={action.milestone}"
-                )
-                log_step(
-                    step=1,
-                    action=action_str,
-                    reward=reward,
-                    done=True,
-                )
-                score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
                 score = min(max(score, 0.01), 0.99)
                 success = score >= SUCCESS_SCORE_THRESHOLD
@@ -282,15 +468,17 @@ def main() -> None:
                 score = min(max(score, 0.01), 0.99)
                 success = False
-            # [END] for this task
             log_end(success, steps_taken, score, rewards)
             all_scores.append(score)
             time.sleep(0.5)
     avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
-    print(f"[SUMMARY] tasks={len(all_scores)} avg_score={avg_score:.2f} scores={all_scores}", flush=True)
 if __name__ == "__main__":

 from model import TriageAction, TriageObservation, BugReport
+# ---------------------------------------------------------------------------
+#  CONFIG — uses env vars required by hackathon spec
+# ---------------------------------------------------------------------------
 API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 API_KEY      = os.getenv("HF_TOKEN") or os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY")
 MODEL_NAME   = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.3-70B-Instruct"
 TASK_IDS                = ["easy", "medium", "hard"]
 BENCHMARK               = "bug-triage-env"
 TEMPERATURE             = 0.0
+MAX_TOKENS              = 500
+MAX_STEPS               = 4       # Max steps per task (investigate + submit)
+MAX_TOTAL_REWARD        = 1.0
 SUCCESS_SCORE_THRESHOLD = 0.4
 print(f"[CONFIG] API_BASE_URL={API_BASE_URL}", flush=True)
 print(f"[CONFIG] ENV_BASE_URL={ENV_BASE_URL}", flush=True)
 print(f"[CONFIG] API_KEY={'set' if API_KEY else 'MISSING'}", flush=True)
+# ---------------------------------------------------------------------------
+#  INLINED CLIENT — self-contained, no external dependency
+# ---------------------------------------------------------------------------
 def _parse_observation(data: dict) -> TriageObservation:
     try:
     return TriageObservation(
         bug_report=bug,
         task_id=data.get("task_id", "easy"),
+        score=data.get("score", 0.0),
         feedback=data.get("feedback", ""),
         done=data.get("done", False),
+        reward=data.get("reward", 0.0),
+        body_visible=data.get("body_visible", False),
+        comments_visible=data.get("comments_visible", False),
+        logs_visible=data.get("logs_visible", False),
+        similar_visible=data.get("similar_visible", False),
+        steps_taken=data.get("steps_taken", 0),
+        max_steps=data.get("max_steps", 6),
     )
 class StepResult:
+    def __init__(self, observation: TriageObservation, reward: float,
+                 done: bool, info: dict):
         self.observation = observation
         self.reward = reward
         self.done = done
         self.base_url = (base_url or ENV_BASE_URL).rstrip("/")
         self.session = requests.Session()
         self.session.headers.update({"Content-Type": "application/json"})
+        self._session_id: Optional[str] = None
     def reset(self, task_id: str = "easy") -> TriageObservation:
         print(f"[ENV] Resetting env for task={task_id}", flush=True)
+        payload = {"task_id": task_id}
+        if self._session_id:
+            payload["session_id"] = self._session_id
         response = self.session.post(
+            f"{self.base_url}/reset", json=payload, timeout=30,
         )
         response.raise_for_status()
         data = response.json()
+        self._session_id = data.get("session_id")
         return _parse_observation(data.get("observation", data))
     def step(self, action: TriageAction) -> StepResult:
+        print(f"[ENV] Sending step: action_type={action.action_type}", flush=True)
         try:
             action_dict = action.model_dump()
         except AttributeError:
             action_dict = action.dict()
+        payload = {"action": action_dict}
+        if self._session_id:
+            payload["session_id"] = self._session_id
         response = self.session.post(
+            f"{self.base_url}/step", json=payload, timeout=30,
         )
         response.raise_for_status()
         data = response.json()
         obs = _parse_observation(data.get("observation", data))
         reward = data.get("reward", obs.reward)
+        if reward is None:
+            reward = 0.0
         reward = float(reward)
+        if obs.done:
+            reward = max(0.01, min(0.99, reward))
+        if "session_id" in data:
+            self._session_id = data["session_id"]
         return StepResult(
+            observation=obs, reward=reward,
+            done=data.get("done", obs.done), info={},
         )
     def close(self):
         self.close()
+# ---------------------------------------------------------------------------
+#  LLM PROMPTS
+# ---------------------------------------------------------------------------
 SYSTEM_PROMPT = textwrap.dedent("""
+    You are a senior software engineering manager triaging a bug report.
+    You will receive a bug report (possibly with partial information).
+    Respond ONLY with valid JSON — no markdown, no explanation, no backticks.
     Return exactly this structure:
     {
     Teams: backend | frontend | infra | security | devx
     Milestones: hotfix | v2.1 | backlog
+    Important: Pay attention to security signals (SQL injection, XSS, auth bypass,
+    data exposure). Security bugs should almost always be P0 + security team + hotfix.
 """).strip()
+INVESTIGATION_PROMPT = textwrap.dedent("""
+    You are deciding whether to investigate further or submit your triage.
+    You have seen the following information about a bug. Based on what you see,
+    decide if you need more information or can triage now.
+    Respond with ONLY one of these JSON formats:
+    To investigate: {"action": "read_body"} or {"action": "read_comments"} or {"action": "check_logs"}
+    To submit:
+    {
+      "action": "submit",
+      "priority": "P0",
+      "labels": ["bug"],
+      "assigned_team": "backend",
+      "milestone": "hotfix",
+      "reasoning": "explanation"
+    }
+    Only investigate if the title and preview are genuinely ambiguous.
+    If the bug is clearly a typo or clearly critical, submit immediately.
+""").strip()
+# ---------------------------------------------------------------------------
+#  STRUCTURED LOGGING — strict [START]/[STEP]/[END] format
+# ---------------------------------------------------------------------------
 def log_start(task: str, env: str, model: str) -> None:
     print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool,
+             error: Optional[str] = None) -> None:
     print(
         f"[STEP] step={step} action={action} "
         f"reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
     )
+def log_end(success: bool, steps: int, score: float,
+            rewards: List[float]) -> None:
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
     print(
         f"[END] success={str(success).lower()} steps={steps} "
     )
+# ---------------------------------------------------------------------------
+#  BUG FORMATTING
+# ---------------------------------------------------------------------------
 def format_bug(obs: TriageObservation) -> str:
+    """Format a bug observation into text the LLM can read."""
     bug = obs.bug_report
+    parts = [f"Title: {bug.title}"]
+    parts.append(f"\nDescription:\n{bug.body}")
+    if obs.comments_visible and bug.comments:
+        comments = "\n".join(f"  - {c}" for c in bug.comments)
+        parts.append(f"\nComments:\n{comments}")
+    if bug.labels_hint:
+        parts.append(f"\nExisting labels: {', '.join(bug.labels_hint)}")
+    if obs.logs_visible:
+        if bug.stack_trace:
+            parts.append(f"\nStack trace: {bug.stack_trace}")
+        if bug.affected_component:
+            parts.append(f"\nAffected component: {bug.affected_component}")
+        if bug.severity_signals:
+            parts.append(f"\nSeverity signals: {', '.join(bug.severity_signals)}")
+    if obs.similar_visible and bug.related_bugs:
+        parts.append(f"\nRelated bugs: {', '.join(bug.related_bugs)}")
+    # Add visibility context
+    visibility = []
+    if not obs.body_visible:
+        visibility.append("body (truncated)")
+    if not obs.comments_visible:
+        visibility.append("comments (hidden)")
+    if not obs.logs_visible:
+        visibility.append("logs (hidden)")
+    if visibility:
+        parts.append(f"\n[Hidden info: {', '.join(visibility)}]")
+    parts.append(f"\nSteps used: {obs.steps_taken}/{obs.max_steps}")
+    return "\n".join(parts)
+def format_bug_for_decision(obs: TriageObservation) -> str:
+    """Shorter format for the investigation decision."""
+    bug = obs.bug_report
+    text = f"Title: {bug.title}\nPreview: {bug.body[:150]}"
+    if obs.body_visible:
+        text += f"\n\nFull body visible."
+    if obs.comments_visible and bug.comments:
+        text += f"\nComments: {len(bug.comments)} visible."
+    text += f"\nSteps remaining: {obs.max_steps - obs.steps_taken}"
+    return text
+# ---------------------------------------------------------------------------
+#  MODEL CALLS
+# ---------------------------------------------------------------------------
+def decide_action(client: OpenAI, obs: TriageObservation) -> dict:
+    """Ask the LLM whether to investigate or submit."""
+    bug_text = format_bug_for_decision(obs)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": INVESTIGATION_PROMPT},
+                {"role": "user", "content": bug_text},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=200,
+            stream=False,
+        )
+        raw = (completion.choices[0].message.content or "").strip()
+        if raw.startswith("```"):
+            parts = raw.split("```")
+            raw = parts[1] if len(parts) > 1 else raw
+            if raw.startswith("json"):
+                raw = raw[4:].strip()
+        return json.loads(raw)
+    except Exception as e:
+        print(f"[DEBUG] Decision model call failed: {e}", flush=True)
+        return {"action": "submit"}
 def call_model(client: OpenAI, bug_text: str) -> TriageAction:
+    """Ask the LLM to triage the bug report."""
+    print("[LLM] Sending triage request to model...", flush=True)
     completion = client.chat.completions.create(
         model=MODEL_NAME,
         data = {}
     action = TriageAction(
+        action_type="submit",
         priority=data.get("priority", "P2"),
         labels=data.get("labels", ["bug"]),
         assigned_team=data.get("assigned_team", "backend"),
     return action
+# ---------------------------------------------------------------------------
+#  MAIN — multi-step agent with per-task [START]/[STEP]/[END] logging
+# ---------------------------------------------------------------------------
 def main() -> None:
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
     all_scores = []
     with BugTriageClient(base_url=ENV_BASE_URL) as env:
             score = 0.0
             success = False
             steps_taken = 0
             log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
             try:
                 obs = env.reset(task_id=task_id)
+                for step_num in range(1, MAX_STEPS + 1):
+                    if obs.done:
+                        break
+                    # Decide: investigate or submit?
+                    # For efficiency, check if we have enough info
+                    # On step 1, always read full body; on later steps, decide
+                    if step_num == 1 and not obs.body_visible:
+                        # First step: read the full body
+                        action = TriageAction(action_type="read_body")
+                        result = env.step(action)
+                        obs = result.observation
+                        steps_taken = step_num
+                        log_step(
+                            step=step_num,
+                            action="investigate:read_body",
+                            reward=0.0,
+                            done=result.done,
+                        )
+                        if result.done:
+                            rewards.append(result.reward)
+                            break
+                        continue
+                    elif step_num == 2 and not obs.comments_visible:
+                        # Second step: read comments for extra context
+                        action = TriageAction(action_type="read_comments")
+                        result = env.step(action)
+                        obs = result.observation
+                        steps_taken = step_num
+                        log_step(
+                            step=step_num,
+                            action="investigate:read_comments",
+                            reward=0.0,
+                            done=result.done,
+                        )
+                        if result.done:
+                            rewards.append(result.reward)
+                            break
+                        continue
+                    # Now submit the triage decision
+                    bug_text = format_bug(obs)
+                    action = call_model(client, bug_text)
+                    result = env.step(action)
+                    obs = result.observation
+                    steps_taken = step_num
+                    reward = float(result.reward or 0.0)
+                    if result.done:
+                        reward = max(0.01, min(0.99, reward))
+                    rewards.append(reward)
+                    action_str = (
+                        f"priority={action.priority},"
+                        f"team={action.assigned_team},"
+                        f"milestone={action.milestone}"
+                    )
+                    log_step(
+                        step=step_num,
+                        action=action_str,
+                        reward=reward,
+                        done=result.done,
+                    )
+                    if result.done:
+                        break
+                # Calculate score
+                if rewards:
+                    score = sum(rewards) / MAX_TOTAL_REWARD
+                else:
+                    score = 0.0
                 score = min(max(score, 0.01), 0.99)
                 success = score >= SUCCESS_SCORE_THRESHOLD
                 score = min(max(score, 0.01), 0.99)
                 success = False
             log_end(success, steps_taken, score, rewards)
             all_scores.append(score)
             time.sleep(0.5)
     avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
+    print(
+        f"[SUMMARY] tasks={len(all_scores)} avg_score={avg_score:.2f} "
+        f"scores={all_scores}",
+        flush=True,
+    )
 if __name__ == "__main__":

model.py CHANGED Viewed

@@ -1,13 +1,10 @@
 # model.py
-from typing import List
 from pydantic import BaseModel, Field
 from openenv.core.env_server import Action, Observation
 from openenv.core.env_server.types import State
 class BugReport(BaseModel):
     """A single GitHub-style bug report."""
     id: str
@@ -16,16 +13,21 @@ class BugReport(BaseModel):
     author: str
     labels_hint: List[str] = Field(default_factory=list)
     comments: List[str] = Field(default_factory=list)
     class Config:
         arbitrary_types_allowed = True
 class TriageAction(Action):
-    """What the agent submits as its triage decision."""
-    priority: str            # "P0" | "P1" | "P2" | "P3"
     labels: List[str] = Field(default_factory=list)
     assigned_team: str = "backend"
     milestone: str = "backlog"
@@ -36,7 +38,7 @@ class TriageAction(Action):
 class TriageObservation(Observation):
-    """What the agent sees after each step."""
     bug_report: BugReport
     task_id: str = "easy"
     score: float = 0.0
@@ -44,6 +46,14 @@ class TriageObservation(Observation):
     done: bool = False
     reward: float = 0.0
     class Config:
         arbitrary_types_allowed = True
@@ -51,10 +61,12 @@ class TriageObservation(Observation):
 class TriageState(State):
     """Internal episode state."""
     episode_id: str = ""
     current_task: str = "easy"
     step_count: int = 0
     total_score: float = 0.0
     tasks_completed: List[str] = Field(default_factory=list)
     class Config:
         arbitrary_types_allowed = True

 # model.py
+from typing import List, Optional, Dict, Any
 from pydantic import BaseModel, Field
 from openenv.core.env_server import Action, Observation
 from openenv.core.env_server.types import State
 class BugReport(BaseModel):
     """A single GitHub-style bug report."""
     id: str
     author: str
     labels_hint: List[str] = Field(default_factory=list)
     comments: List[str] = Field(default_factory=list)
+    severity_signals: List[str] = Field(default_factory=list)
+    related_bugs: List[str] = Field(default_factory=list)
+    stack_trace: str = ""
+    affected_component: str = ""
     class Config:
         arbitrary_types_allowed = True
 class TriageAction(Action):
+    """What the agent submits — either an investigation or a final triage decision."""
+    action_type: str = "submit"     # "read_body" | "read_comments" | "check_logs" | "check_similar" | "submit"
+    # Only used when action_type == "submit"
+    priority: str = "P2"
     labels: List[str] = Field(default_factory=list)
     assigned_team: str = "backend"
     milestone: str = "backlog"
 class TriageObservation(Observation):
+    """What the agent sees after each step — progressively reveals info."""
     bug_report: BugReport
     task_id: str = "easy"
     score: float = 0.0
     done: bool = False
     reward: float = 0.0
+    # Progressive visibility fields
+    body_visible: bool = False
+    comments_visible: bool = False
+    logs_visible: bool = False
+    similar_visible: bool = False
+    steps_taken: int = 0
+    max_steps: int = 6
     class Config:
         arbitrary_types_allowed = True
 class TriageState(State):
     """Internal episode state."""
     episode_id: str = ""
+    session_id: str = ""
     current_task: str = "easy"
     step_count: int = 0
     total_score: float = 0.0
     tasks_completed: List[str] = Field(default_factory=list)
+    actions_taken: List[str] = Field(default_factory=list)
     class Config:
         arbitrary_types_allowed = True

openenv.yaml CHANGED Viewed

@@ -1,32 +1,43 @@
 spec_version: 1
 name: bug-triage-env
-version: "1.0.0"
 description: >
-  A reinforcement learning environment where an agent triages
-  GitHub-style bug reports by assigning priority, labels, team,
-  and milestone. 3 tasks of increasing difficulty (easy → medium → hard).
 endpoint: https://siteshcodes-bug-triage-env.hf.space
 tags:
   - openenv
   - bug-triage
   - real-world
   - nlp
 tasks:
   - id: easy
     name: Priority Assignment
-    description: Assign correct P0-P3 priority to a bug report
     difficulty: easy
     grader: server.task:priority_match
     reward_range: [0.0, 1.0]
   - id: medium
     name: Priority Labels and Team
-    description: Assign correct priority, labels, and team routing
     difficulty: medium
     grader: server.task:priority_label_team
     reward_range: [0.0, 1.0]
   - id: hard
     name: Full Triage
-    description: Full triage with priority, labels, team, milestone and security penalty
     difficulty: hard
     grader: server.task:full_triage
     reward_range: [0.0, 1.0]
@@ -35,6 +46,7 @@ endpoints:
   step: /step
   state: /state
 actions:
   priority: string
   labels: list
   assigned_team: string
@@ -46,4 +58,10 @@ observations:
   score: float
   reward: float
   feedback: string
-  done: bool

 spec_version: 1
 name: bug-triage-env
+version: "2.0.0"
 description: >
+  A multi-step reinforcement learning environment where an AI agent
+  investigates and triages GitHub-style bug reports by assigning
+  priority, labels, team, and milestone. Features progressive
+  information reveal, procedural bug generation (200+ unique bugs),
+  semantic label matching, and a security escalation penalty.
+  3 tasks of increasing difficulty (easy → medium → hard).
 endpoint: https://siteshcodes-bug-triage-env.hf.space
 tags:
   - openenv
   - bug-triage
   - real-world
   - nlp
+  - multi-step
 tasks:
   - id: easy
     name: Priority Assignment
+    description: >
+      Investigate a bug report and assign correct P0-P3 priority.
+      Use investigation actions to gather info before submitting.
     difficulty: easy
     grader: server.task:priority_match
     reward_range: [0.0, 1.0]
   - id: medium
     name: Priority Labels and Team
+    description: >
+      Investigate and assign correct priority, labels, and team
+      routing. More investigation steps available.
     difficulty: medium
     grader: server.task:priority_label_team
     reward_range: [0.0, 1.0]
   - id: hard
     name: Full Triage
+    description: >
+      Full triage with priority, labels, team, milestone and
+      security escalation penalty. Investigation is critical —
+      missing security signals is penalized.
     difficulty: hard
     grader: server.task:full_triage
     reward_range: [0.0, 1.0]
   step: /step
   state: /state
 actions:
+  action_type: string
   priority: string
   labels: list
   assigned_team: string
   score: float
   reward: float
   feedback: string
+  done: bool
+  body_visible: bool
+  comments_visible: bool
+  logs_visible: bool
+  similar_visible: bool
+  steps_taken: int
+  max_steps: int

pyproject.toml CHANGED Viewed

@@ -4,8 +4,8 @@ build-backend = "setuptools.backends.legacy:build"
 [project]
 name = "bug-triage-env"
-version = "1.0.0"
-description = "OpenEnv RL environment for bug report triage"
 requires-python = ">=3.11"
 dependencies = [
     "openenv-core>=0.2.0",
@@ -13,6 +13,15 @@ dependencies = [
     "uvicorn[standard]",
     "pydantic",
     "websockets",
     "groq",
 ]

 [project]
 name = "bug-triage-env"
+version = "2.0.0"
+description = "Multi-step OpenEnv RL environment for bug report triage"
 requires-python = ">=3.11"
 dependencies = [
     "openenv-core>=0.2.0",
     "uvicorn[standard]",
     "pydantic",
     "websockets",
+    "requests",
+    "openai",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.0",
+    "pytest-cov",
+    "httpx",
     "groq",
 ]

server/__pycache__/__init__.cpython-314.pyc DELETED Viewed

Binary file (434 Bytes)

server/__pycache__/task.cpython-314.pyc DELETED Viewed

Binary file (14.5 kB)

server/app.py CHANGED Viewed

@@ -1,18 +1,16 @@
 # server/app.py
 import sys
 import os
-import json
 sys.path.insert(0, "/app")
 sys.path.insert(0, "/app/server")
 from openenv.core.env_server import create_app
 from model import TriageAction, TriageObservation
-from environment import BugTriageEnvironment
 from task import sample_bug, grade_action, TASKS
-from fastapi import Response, Request
-from fastapi.responses import FileResponse
 from fastapi.staticfiles import StaticFiles
-from pydantic import BaseModel
 from typing import Optional, Dict, Any
 app = create_app(
@@ -22,39 +20,15 @@ app = create_app(
     env_name="bug-triage-env",
 )
-TASKS_META = [
-    {
-        "id": "easy",
-        "name": "Priority Assignment",
-        "description": "Assign correct P0-P3 priority to a bug report",
-        "difficulty": "easy",
-        "grader": "server.task:priority_match",
-        "reward_range": [0.0, 1.0]
-    },
-    {
-        "id": "medium",
-        "name": "Priority Labels and Team",
-        "description": "Assign correct priority, labels, and team routing",
-        "difficulty": "medium",
-        "grader": "server.task:priority_label_team",
-        "reward_range": [0.0, 1.0]
-    },
-    {
-        "id": "hard",
-        "name": "Full Triage",
-        "description": "Full triage with priority, labels, team, milestone and security penalty",
-        "difficulty": "hard",
-        "grader": "server.task:full_triage",
-        "reward_range": [0.0, 1.0]
-    }
-]
-_global_env = BugTriageEnvironment()
 routes_to_remove = []
 for route in app.routes:
     if hasattr(route, "path") and route.path in ("/reset", "/step", "/state"):
@@ -63,44 +37,60 @@ for route in routes_to_remove:
     app.routes.remove(route)
 @app.get("/health")
 def health():
-    return {"status": "ok", "env": "bug-triage-env"}
 @app.get("/")
 def root():
-    """Serve the interactive demo frontend at root."""
     static_dir = os.path.join(os.path.dirname(__file__), "static")
-    return FileResponse(os.path.join(static_dir, "index.html"))
 @app.get("/web")
 def web_ui():
     """Alias for the frontend."""
-    static_dir = os.path.join(os.path.dirname(__file__), "static")
-    return FileResponse(os.path.join(static_dir, "index.html"))
 @app.get("/tasks")
 def list_tasks():
     return TASKS_META
-@app.get("/tasks/easy")
-def task_easy():
-    return TASKS_META[0]
-@app.get("/tasks/medium")
-def task_medium():
-    return TASKS_META[1]
-@app.get("/tasks/hard")
-def task_hard():
-    return TASKS_META[2]
 @app.post("/reset")
 async def custom_reset(request: Request):
-    """Stateful reset — remembers the bug for the subsequent step() call."""
-    global _global_env
     body = {}
     try:
@@ -111,9 +101,20 @@ async def custom_reset(request: Request):
     task_id = body.get("task_id", "easy")
     seed = body.get("seed", None)
     episode_id = body.get("episode_id", None)
-    _global_env = BugTriageEnvironment()
-    obs = _global_env.reset(task_id=task_id, seed=seed, episode_id=episode_id)
     try:
         obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
@@ -124,21 +125,31 @@ async def custom_reset(request: Request):
         obs_dict.pop("metadata", None)
     return {
         "observation": obs_dict,
-        "reward": obs.reward,
-        "done": obs.done,
     }
 @app.post("/step")
 async def custom_step(request: Request):
-    """Stateful step — uses the bug from the last reset() call."""
-    global _global_env
     body = await request.json()
     action_data = body.get("action", body)
     action = TriageAction(
         priority=action_data.get("priority", "P2"),
         labels=action_data.get("labels", ["bug"]),
         assigned_team=action_data.get("assigned_team", "backend"),
@@ -146,7 +157,7 @@ async def custom_step(request: Request):
         reasoning=action_data.get("reasoning", ""),
     )
-    obs = _global_env.step(action)
     try:
         obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
@@ -156,53 +167,122 @@ async def custom_step(request: Request):
         obs_dict.pop("done", None)
         obs_dict.pop("metadata", None)
-    reward = float(obs.reward) if obs.reward is not None else 0.05
-    # Strictly clamp to open interval (0, 1)
-    reward = max(0.01, min(0.99, reward))
-    return {
         "observation": obs_dict,
         "reward": reward,
         "done": obs.done,
     }
 @app.get("/state")
-def custom_state():
     """Return current environment state."""
-    global _global_env
-    state = _global_env.get_state()
     try:
         return state.model_dump()
     except AttributeError:
         return state.dict()
 @app.post("/tasks/easy/reset")
-def reset_easy():
-    global _global_env
-    _global_env = BugTriageEnvironment()
-    obs = _global_env.reset(task_id="easy")
-    return {"task_id": "easy", "bug_report": obs.bug_report.model_dump(), "done": False, "reward": 0.05}
 @app.post("/tasks/medium/reset")
-def reset_medium():
-    global _global_env
-    _global_env = BugTriageEnvironment()
-    obs = _global_env.reset(task_id="medium")
-    return {"task_id": "medium", "bug_report": obs.bug_report.model_dump(), "done": False, "reward": 0.05}
 @app.post("/tasks/hard/reset")
-def reset_hard():
-    global _global_env
-    _global_env = BugTriageEnvironment()
-    obs = _global_env.reset(task_id="hard")
-    return {"task_id": "hard", "bug_report": obs.bug_report.model_dump(), "done": False, "reward": 0.05}
 def main():
     import uvicorn
     uvicorn.run(app, host="0.0.0.0", port=7860)
 if __name__ == "__main__":
     main()

 # server/app.py
 import sys
 import os
 sys.path.insert(0, "/app")
 sys.path.insert(0, "/app/server")
 from openenv.core.env_server import create_app
 from model import TriageAction, TriageObservation
+from environment import BugTriageEnvironment, SessionManager, TASKS_META
 from task import sample_bug, grade_action, TASKS
+from fastapi import Response, Request, HTTPException
+from fastapi.responses import FileResponse, JSONResponse
 from fastapi.staticfiles import StaticFiles
 from typing import Optional, Dict, Any
 app = create_app(
     env_name="bug-triage-env",
 )
+# Session manager replaces the broken global state
+sessions = SessionManager(max_sessions=500, ttl_seconds=600)
+# Fallback env for backward-compatible (non-session) requests
+_fallback_env = BugTriageEnvironment()
+_fallback_answer = None
+# Remove default routes from create_app — we override them
 routes_to_remove = []
 for route in app.routes:
     if hasattr(route, "path") and route.path in ("/reset", "/step", "/state"):
     app.routes.remove(route)
+# ---------------------------------------------------------------------------
+#  CORE ENDPOINTS
+# ---------------------------------------------------------------------------
 @app.get("/health")
 def health():
+    return {
+        "status": "ok",
+        "env": "bug-triage-env",
+        "version": "2.0.0",
+        "active_sessions": sessions.active_count,
+    }
 @app.get("/")
 def root():
+    """Serve the interactive demo frontend."""
     static_dir = os.path.join(os.path.dirname(__file__), "static")
+    index_path = os.path.join(static_dir, "index.html")
+    if os.path.exists(index_path):
+        return FileResponse(index_path)
+    return {"message": "Bug Triage Environment v2.0.0", "docs": "/docs"}
 @app.get("/web")
 def web_ui():
     """Alias for the frontend."""
+    return root()
 @app.get("/tasks")
 def list_tasks():
     return TASKS_META
+@app.get("/tasks/{task_id}")
+def get_task(task_id: str):
+    for t in TASKS_META:
+        if t["id"] == task_id:
+            return t
+    raise HTTPException(404, detail={
+        "error": "task_not_found",
+        "message": f"Task '{task_id}' not found. Valid: easy, medium, hard",
+    })
+# ---------------------------------------------------------------------------
+#  SESSION-BASED RESET / STEP / STATE
+# ---------------------------------------------------------------------------
 @app.post("/reset")
 async def custom_reset(request: Request):
+    """Start a new episode. Returns a session_id for subsequent step() calls."""
+    global _fallback_env, _fallback_answer
     body = {}
     try:
     task_id = body.get("task_id", "easy")
     seed = body.get("seed", None)
     episode_id = body.get("episode_id", None)
+    session_id = body.get("session_id", None)
+    # If session_id provided, reuse that session
+    if session_id:
+        env = sessions.get_session(session_id)
+        if env is None:
+            session_id, env = sessions.create_session()
+    else:
+        session_id, env = sessions.create_session()
+    obs = env.reset(task_id=task_id, seed=seed, episode_id=episode_id)
+    # Also update fallback for backward compatibility
+    _fallback_env = env
     try:
         obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
         obs_dict.pop("metadata", None)
     return {
+        "session_id": session_id,
         "observation": obs_dict,
+        "reward": 0.0,
+        "done": False,
     }
 @app.post("/step")
 async def custom_step(request: Request):
+    """Process an action — either investigation or final triage submission."""
+    global _fallback_env
     body = await request.json()
     action_data = body.get("action", body)
+    session_id = body.get("session_id", None)
+    # Find the right environment
+    env = None
+    if session_id:
+        env = sessions.get_session(session_id)
+    if env is None:
+        env = _fallback_env
     action = TriageAction(
+        action_type=action_data.get("action_type", "submit"),
         priority=action_data.get("priority", "P2"),
         labels=action_data.get("labels", ["bug"]),
         assigned_team=action_data.get("assigned_team", "backend"),
         reasoning=action_data.get("reasoning", ""),
     )
+    obs = env.step(action)
     try:
         obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
         obs_dict.pop("done", None)
         obs_dict.pop("metadata", None)
+    reward = float(obs.reward) if obs.reward is not None else 0.0
+    reward = max(0.01, min(0.99, reward)) if obs.done else 0.0
+    response_data = {
         "observation": obs_dict,
         "reward": reward,
         "done": obs.done,
     }
+    if session_id:
+        response_data["session_id"] = session_id
+    # Cleanup session when episode is done
+    if obs.done and session_id:
+        sessions.remove_session(session_id)
+    return response_data
 @app.get("/state")
+def custom_state(session_id: Optional[str] = None):
     """Return current environment state."""
+    env = None
+    if session_id:
+        env = sessions.get_session(session_id)
+    if env is None:
+        env = _fallback_env
+    state = env.get_state()
     try:
         return state.model_dump()
     except AttributeError:
         return state.dict()
+# ---------------------------------------------------------------------------
+#  PER-TASK SHORTCUT ENDPOINTS
+# ---------------------------------------------------------------------------
 @app.post("/tasks/easy/reset")
+async def reset_easy():
+    session_id, env = sessions.create_session()
+    obs = env.reset(task_id="easy")
+    return {
+        "session_id": session_id,
+        "task_id": "easy",
+        "bug_report": obs.bug_report.model_dump(),
+        "done": False,
+        "reward": 0.0,
+    }
 @app.post("/tasks/medium/reset")
+async def reset_medium():
+    session_id, env = sessions.create_session()
+    obs = env.reset(task_id="medium")
+    return {
+        "session_id": session_id,
+        "task_id": "medium",
+        "bug_report": obs.bug_report.model_dump(),
+        "done": False,
+        "reward": 0.0,
+    }
 @app.post("/tasks/hard/reset")
+async def reset_hard():
+    session_id, env = sessions.create_session()
+    obs = env.reset(task_id="hard")
+    return {
+        "session_id": session_id,
+        "task_id": "hard",
+        "bug_report": obs.bug_report.model_dump(),
+        "done": False,
+        "reward": 0.0,
+    }
+# ---------------------------------------------------------------------------
+#  LEADERBOARD
+# ---------------------------------------------------------------------------
+_leaderboard = []
+@app.get("/leaderboard")
+def get_leaderboard():
+    """Return top 50 agent scores."""
+    return sorted(_leaderboard, key=lambda x: x.get("avg_score", 0), reverse=True)[:50]
+@app.post("/leaderboard/submit")
+async def submit_to_leaderboard(request: Request):
+    """Submit agent scores to the leaderboard."""
+    body = await request.json()
+    entry = {
+        "agent_name": body.get("agent_name", "anonymous"),
+        "model": body.get("model", "unknown"),
+        "scores": body.get("scores", {}),
+        "avg_score": body.get("avg_score", 0.0),
+    }
+    _leaderboard.append(entry)
+    rank = sorted(
+        _leaderboard, key=lambda x: x.get("avg_score", 0), reverse=True
+    ).index(entry) + 1
+    return {"status": "submitted", "rank": rank, "total_entries": len(_leaderboard)}
+# ---------------------------------------------------------------------------
+#  ENTRYPOINT
+# ---------------------------------------------------------------------------
 def main():
     import uvicorn
     uvicorn.run(app, host="0.0.0.0", port=7860)
 if __name__ == "__main__":
     main()

server/environment.py CHANGED Viewed

@@ -3,25 +3,39 @@ import sys
 sys.path.insert(0, "/app")
 sys.path.insert(0, "/app/server")
 import uuid
 from openenv.core.env_server.interfaces import Environment
 from model import TriageAction, TriageObservation, TriageState, BugReport
 from task import grade_action, sample_bug
 VALID_TASKS = ["easy", "medium", "hard"]
 TASKS_META = [
-    {"id": "easy", "name": "Priority Assignment", "grader": "server.task:priority_match",
      "difficulty": "easy", "reward_range": [0.0, 1.0],
-     "description": "Assign a single P0-P3 priority to a bug report"},
-    {"id": "medium", "name": "Priority Labels and Team", "grader": "server.task:priority_label_team",
      "difficulty": "medium", "reward_range": [0.0, 1.0],
-     "description": "Assign priority, labels, and team routing"},
-    {"id": "hard", "name": "Full Triage", "grader": "server.task:full_triage",
      "difficulty": "hard", "reward_range": [0.0, 1.0],
-     "description": "Full triage with security escalation penalty"},
 ]
 class BugTriageEnvironment(Environment):
     SUPPORTS_CONCURRENT_SESSIONS = True
@@ -29,13 +43,25 @@ class BugTriageEnvironment(Environment):
         super().__init__()
         self._current_task_key: str = "easy"
         self._episode_done: bool = False
-        self._current_bug: BugReport = sample_bug("easy")
         self._state = TriageState(
             episode_id=str(uuid.uuid4()),
             current_task="easy",
             step_count=0,
-            total_score=0.05,
             tasks_completed=[],
         )
     def get_metadata(self):
@@ -43,72 +69,203 @@ class BugTriageEnvironment(Environment):
             from openenv.core.env_server.types import EnvironmentMetadata
             return EnvironmentMetadata(
                 name="bug-triage-env",
-                description="Bug triage RL environment with 3 tasks of increasing difficulty",
-                version="1.0.0",
                 author="Siteshcodes",
                 tasks=TASKS_META,
             )
         except Exception:
             return {
                 "name": "bug-triage-env",
-                "description": "Bug triage RL environment with 3 tasks of increasing difficulty",
-                "version": "1.0.0",
                 "author": "Siteshcodes",
                 "tasks": TASKS_META,
             }
-    def reset(self, task_id: str = "easy", seed: int = None, episode_id: str = None, **kwargs) -> TriageObservation:
-        """Start a fresh episode for the specified task."""
         if task_id not in VALID_TASKS:
             task_id = "easy"
         self._current_task_key = task_id
         self._episode_done = False
-        self._current_bug = sample_bug(task_id)
         self._state = TriageState(
             episode_id=episode_id or str(uuid.uuid4()),
             current_task=task_id,
             step_count=0,
-            total_score=0.05,
             tasks_completed=[],
         )
-        return TriageObservation(
-            bug_report=self._current_bug,
-            task_id=task_id,
-            score=0.05,
-            feedback=f"Episode started for task: {task_id}. Triage this bug report.",
-            done=False,
-            reward=0.05,
         )
     def step(self, action: TriageAction) -> TriageObservation:
-        """Process the agent's triage action — one step, then done."""
         if self._episode_done:
-            return TriageObservation(
-                bug_report=self._current_bug,
-                task_id=self._current_task_key,
-                score=0.05,
                 feedback="Episode already complete. Call reset() to start a new episode.",
-                done=True,
-                reward=0.05,
             )
-        self._state.step_count += 1
-        task_key = self._current_task_key
-        score, feedback = grade_action(task_key, self._current_bug, action)
         self._state.total_score += score
-        self._state.tasks_completed.append(task_key)
         self._episode_done = True
-        return TriageObservation(
-            bug_report=self._current_bug,
-            task_id=task_key,
-            score=round(score, 3),
-            feedback=feedback,
-            done=True,
-            reward=round(score, 3),
         )
     @property
@@ -116,4 +273,68 @@ class BugTriageEnvironment(Environment):
         return self._state
     def get_state(self) -> TriageState:
-        return self._state

 sys.path.insert(0, "/app")
 sys.path.insert(0, "/app/server")
 import uuid
+import time
+from typing import Dict, Optional, Tuple
 from openenv.core.env_server.interfaces import Environment
 from model import TriageAction, TriageObservation, TriageState, BugReport
 from task import grade_action, sample_bug
 VALID_TASKS = ["easy", "medium", "hard"]
+MAX_STEPS_PER_TASK = {"easy": 4, "medium": 5, "hard": 6}
 TASKS_META = [
+    {"id": "easy", "name": "Priority Assignment",
+     "grader": "server.task:priority_match",
      "difficulty": "easy", "reward_range": [0.0, 1.0],
+     "description": "Investigate a bug report and assign a P0-P3 priority. "
+                    "Use investigation actions to gather info before submitting."},
+    {"id": "medium", "name": "Priority Labels and Team",
+     "grader": "server.task:priority_label_team",
      "difficulty": "medium", "reward_range": [0.0, 1.0],
+     "description": "Investigate and assign priority, labels, and team routing. "
+                    "More investigation steps available."},
+    {"id": "hard", "name": "Full Triage",
+     "grader": "server.task:full_triage",
      "difficulty": "hard", "reward_range": [0.0, 1.0],
+     "description": "Full triage with priority, labels, team, milestone, "
+                    "and security escalation penalty. Investigation is critical."},
 ]
+INVESTIGATION_ACTIONS = {"read_body", "read_comments", "check_logs", "check_similar"}
 class BugTriageEnvironment(Environment):
+    """Multi-step bug triage environment with progressive information reveal."""
     SUPPORTS_CONCURRENT_SESSIONS = True
         super().__init__()
         self._current_task_key: str = "easy"
         self._episode_done: bool = False
+        self._current_bug: Optional[BugReport] = None
+        self._current_answer: Optional[dict] = None
+        self._step_count: int = 0
+        self._max_steps: int = 4
+        self._actions_taken: list = []
+        # Progressive visibility
+        self._body_visible: bool = False
+        self._comments_visible: bool = False
+        self._logs_visible: bool = False
+        self._similar_visible: bool = False
         self._state = TriageState(
             episode_id=str(uuid.uuid4()),
             current_task="easy",
             step_count=0,
+            total_score=0.0,
             tasks_completed=[],
+            actions_taken=[],
         )
     def get_metadata(self):
             from openenv.core.env_server.types import EnvironmentMetadata
             return EnvironmentMetadata(
                 name="bug-triage-env",
+                description="Multi-step bug triage RL environment with progressive "
+                            "information reveal and 3 difficulty levels",
+                version="2.0.0",
                 author="Siteshcodes",
                 tasks=TASKS_META,
             )
         except Exception:
             return {
                 "name": "bug-triage-env",
+                "description": "Multi-step bug triage RL environment",
+                "version": "2.0.0",
                 "author": "Siteshcodes",
                 "tasks": TASKS_META,
             }
+    def _build_observation(self, score=0.0, feedback="", done=False,
+                           reward=0.0) -> TriageObservation:
+        """Build observation with current visibility state."""
+        bug = self._current_bug
+        # Create a visibility-filtered view of the bug
+        visible_bug = BugReport(
+            id=bug.id,
+            title=bug.title,
+            body=bug.body if self._body_visible else bug.body[:120] + "..." if len(bug.body) > 120 else bug.body,
+            author=bug.author,
+            labels_hint=bug.labels_hint,
+            comments=bug.comments if self._comments_visible else [],
+            severity_signals=bug.severity_signals if self._logs_visible else [],
+            related_bugs=bug.related_bugs if self._similar_visible else [],
+            stack_trace=bug.stack_trace if self._logs_visible else "",
+            affected_component=bug.affected_component if self._logs_visible else "",
+        )
+        return TriageObservation(
+            bug_report=visible_bug,
+            task_id=self._current_task_key,
+            score=round(score, 3),
+            feedback=feedback,
+            done=done,
+            reward=round(reward, 3),
+            body_visible=self._body_visible,
+            comments_visible=self._comments_visible,
+            logs_visible=self._logs_visible,
+            similar_visible=self._similar_visible,
+            steps_taken=self._step_count,
+            max_steps=self._max_steps,
+        )
+    def reset(self, task_id: str = "easy", seed: int = None,
+              episode_id: str = None, **kwargs) -> TriageObservation:
+        """Start a fresh episode for the given task."""
         if task_id not in VALID_TASKS:
             task_id = "easy"
         self._current_task_key = task_id
         self._episode_done = False
+        self._step_count = 0
+        self._max_steps = MAX_STEPS_PER_TASK.get(task_id, 4)
+        self._actions_taken = []
+        # Reset visibility — title + truncated body are always visible
+        self._body_visible = False
+        self._comments_visible = False
+        self._logs_visible = False
+        self._similar_visible = False
+        # Sample a bug and its answer
+        self._current_bug, self._current_answer = sample_bug(task_id, seed=seed)
         self._state = TriageState(
             episode_id=episode_id or str(uuid.uuid4()),
             current_task=task_id,
             step_count=0,
+            total_score=0.0,
             tasks_completed=[],
+            actions_taken=[],
         )
+        feedback = (
+            f"Episode started for task: {task_id}. "
+            f"You see the bug title and a preview. "
+            f"Use investigation actions (read_body, read_comments, check_logs, check_similar) "
+            f"to reveal more information, then submit your triage. "
+            f"You have {self._max_steps} steps max."
+        )
+        return self._build_observation(
+            score=0.0, feedback=feedback, done=False, reward=0.0,
         )
     def step(self, action: TriageAction) -> TriageObservation:
+        """Process agent's action — either investigate or submit final triage."""
         if self._episode_done:
+            return self._build_observation(
+                score=0.0,
                 feedback="Episode already complete. Call reset() to start a new episode.",
+                done=True, reward=0.0,
+            )
+        self._step_count += 1
+        self._state.step_count = self._step_count
+        action_type = getattr(action, "action_type", "submit")
+        self._actions_taken.append(action_type)
+        self._state.actions_taken = list(self._actions_taken)
+        # Check if max steps reached — force submission
+        if self._step_count >= self._max_steps and action_type != "submit":
+            action_type = "submit"
+        # --- Investigation actions ---
+        if action_type in INVESTIGATION_ACTIONS:
+            feedback = self._handle_investigation(action_type)
+            return self._build_observation(
+                score=0.0, feedback=feedback, done=False, reward=0.0,
+            )
+        # --- Submit action ---
+        return self._handle_submission(action)
+    def _handle_investigation(self, action_type: str) -> str:
+        """Reveal information based on the investigation action."""
+        if action_type == "read_body":
+            if self._body_visible:
+                return "Full body already revealed. Choose another action or submit."
+            self._body_visible = True
+            return (
+                f"Full bug description revealed. "
+                f"Steps used: {self._step_count}/{self._max_steps}."
             )
+        elif action_type == "read_comments":
+            if self._comments_visible:
+                return "Comments already revealed. Choose another action or submit."
+            self._comments_visible = True
+            n = len(self._current_bug.comments)
+            return (
+                f"Revealed {n} comment(s). "
+                f"Steps used: {self._step_count}/{self._max_steps}."
+            )
+        elif action_type == "check_logs":
+            if self._logs_visible:
+                return "Logs already revealed. Choose another action or submit."
+            self._logs_visible = True
+            has_trace = bool(self._current_bug.stack_trace)
+            return (
+                f"System logs revealed. {'Stack trace available.' if has_trace else 'No stack trace.'} "
+                f"Steps used: {self._step_count}/{self._max_steps}."
+            )
+        elif action_type == "check_similar":
+            if self._similar_visible:
+                return "Similar bugs already revealed. Choose another action or submit."
+            self._similar_visible = True
+            n = len(self._current_bug.related_bugs)
+            return (
+                f"Found {n} related bug(s). "
+                f"Steps used: {self._step_count}/{self._max_steps}."
+            )
+        return f"Unknown investigation action: {action_type}"
+    def _handle_submission(self, action: TriageAction) -> TriageObservation:
+        """Grade the agent's final triage submission."""
+        score, feedback = grade_action(
+            self._current_task_key, self._current_bug, action,
+            answer=self._current_answer,
+        )
+        # Apply time efficiency bonus/penalty
+        # Fewer steps = better (if the answer is good)
+        investigation_steps = self._step_count - 1  # subtract the submit step
+        if investigation_steps == 0 and score >= 0.7:
+            # Got it right without investigating — impressive!
+            efficiency_bonus = 0.05
+            feedback += " | ⚡ Efficiency bonus: +0.05 (correct with minimal investigation)"
+        elif investigation_steps >= 3 and score >= 0.7:
+            # Took many steps but got it right — slight penalty for slowness
+            efficiency_penalty = 0.02 * (investigation_steps - 2)
+            score = score - efficiency_penalty
+            feedback += f" | ⏱ Time penalty: -{efficiency_penalty:.2f} ({investigation_steps} investigation steps)"
+        elif investigation_steps == 0 and score < 0.5:
+            # Rushed and got it wrong — penalty
+            feedback += " | ⚠ Consider investigating before submitting next time"
+        if investigation_steps == 0 and score >= 0.7:
+            score += 0.05
+        score = max(0.01, min(0.99, score))
         self._state.total_score += score
+        self._state.tasks_completed.append(self._current_task_key)
         self._episode_done = True
+        return self._build_observation(
+            score=score, feedback=feedback, done=True, reward=score,
         )
     @property
         return self._state
     def get_state(self) -> TriageState:
+        return self._state
+# ---------------------------------------------------------------------------
+#  SESSION MANAGER — handles concurrent sessions safely
+# ---------------------------------------------------------------------------
+class SessionManager:
+    """Thread-safe session management for multiple concurrent agents."""
+    def __init__(self, max_sessions: int = 1000, ttl_seconds: int = 600):
+        self._sessions: Dict[str, BugTriageEnvironment] = {}
+        self._timestamps: Dict[str, float] = {}
+        self._max_sessions = max_sessions
+        self._ttl = ttl_seconds
+    def create_session(self) -> Tuple[str, BugTriageEnvironment]:
+        """Create a new session and return (session_id, env)."""
+        self._cleanup_expired()
+        session_id = str(uuid.uuid4())
+        env = BugTriageEnvironment()
+        self._sessions[session_id] = env
+        self._timestamps[session_id] = time.time()
+        # Enforce max after adding
+        while len(self._sessions) > self._max_sessions:
+            oldest = min(self._timestamps, key=self._timestamps.get)
+            if oldest == session_id:
+                break
+            self._sessions.pop(oldest, None)
+            self._timestamps.pop(oldest, None)
+        return session_id, env
+    def get_session(self, session_id: str) -> Optional[BugTriageEnvironment]:
+        """Get an existing session's environment, or None if expired/missing."""
+        if session_id not in self._sessions:
+            return None
+        # Refresh TTL on access
+        self._timestamps[session_id] = time.time()
+        return self._sessions[session_id]
+    def remove_session(self, session_id: str) -> None:
+        """Remove a session after episode completes."""
+        self._sessions.pop(session_id, None)
+        self._timestamps.pop(session_id, None)
+    def _cleanup_expired(self) -> None:
+        """Remove sessions that exceeded TTL."""
+        now = time.time()
+        expired = [
+            sid for sid, ts in self._timestamps.items()
+            if now - ts > self._ttl
+        ]
+        for sid in expired:
+            self._sessions.pop(sid, None)
+            self._timestamps.pop(sid, None)
+        # Also enforce max sessions (remove oldest)
+        while len(self._sessions) > self._max_sessions:
+            oldest = min(self._timestamps, key=self._timestamps.get)
+            self._sessions.pop(oldest, None)
+            self._timestamps.pop(oldest, None)
+    @property
+    def active_count(self) -> int:
+        return len(self._sessions)

server/requirements.txt CHANGED Viewed

@@ -4,4 +4,5 @@ uvicorn[standard]
 pydantic
 websockets
 openai
-httpx

 pydantic
 websockets
 openai
+httpx
+requests

server/task.py CHANGED Viewed

@@ -1,16 +1,426 @@
 # server/task.py
 import sys
 import random
 sys.path.insert(0, "/app")
-from typing import Tuple, List
 from model import BugReport, TriageAction
-# BUG REPORT DATASET
-TASKS = {
     "easy": {
         "bugs": [
             BugReport(
@@ -22,6 +432,9 @@ TASKS = {
                 author="user123",
                 labels_hint=[],
                 comments=["Confirmed on iOS and Android.", "Happens every time."],
             ),
             BugReport(
                 id="easy-002",
@@ -31,6 +444,9 @@ TASKS = {
                 author="docs_fan",
                 labels_hint=["documentation"],
                 comments=[],
             ),
             BugReport(
                 id="easy-003",
@@ -40,6 +456,9 @@ TASKS = {
                 author="power_user",
                 labels_hint=["performance"],
                 comments=["Noticed after the last deploy.", "CPU spikes to 100%."],
             ),
             BugReport(
                 id="easy-004",
@@ -49,7 +468,11 @@ TASKS = {
                      "Affects all users attempting password reset.",
                 author="support_team",
                 labels_hint=["bug"],
-                comments=["Reported by 12 users this week.", "Started after email service migration."],
             ),
             BugReport(
                 id="easy-005",
@@ -59,9 +482,11 @@ TASKS = {
                 author="intern_dev",
                 labels_hint=["documentation"],
                 comments=[],
             ),
         ],
-        # Ground truth for grader
         "answers": {
             "easy-001": {"priority": "P0"},
             "easy-002": {"priority": "P3"},
@@ -82,6 +507,9 @@ TASKS = {
                 author="store_owner",
                 labels_hint=["bug"],
                 comments=["Revenue impact confirmed.", "Happening since Tuesday."],
             ),
             BugReport(
                 id="med-002",
@@ -92,6 +520,9 @@ TASKS = {
                 author="moderator_jane",
                 labels_hint=[],
                 comments=["GDPR concern — deleted content still visible."],
             ),
             BugReport(
                 id="med-003",
@@ -101,6 +532,9 @@ TASKS = {
                 author="safari_user",
                 labels_hint=["bug", "ux"],
                 comments=["Only on Safari, not Chrome/Firefox."],
             ),
             BugReport(
                 id="med-004",
@@ -110,7 +544,11 @@ TASKS = {
                      "Affects users with international data.",
                 author="data_analyst",
                 labels_hint=["bug"],
-                comments=["Encoding issue — UTF-8 not respected.", "Workaround: manual copy-paste."],
             ),
             BugReport(
                 id="med-005",
@@ -120,7 +558,11 @@ TASKS = {
                      "The unblock logic has a bug — it never clears the blocked flag.",
                 author="api_user",
                 labels_hint=["bug"],
-                comments=["Affects CI/CD pipelines hitting the API.", "Retry-After header is wrong."],
             ),
         ],
         "answers": {
@@ -144,6 +586,10 @@ TASKS = {
                 author="security_researcher",
                 labels_hint=[],
                 comments=["Critical. Affects production.", "Do not discuss publicly."],
             ),
             BugReport(
                 id="hard-002",
@@ -155,6 +601,9 @@ TASKS = {
                 author="devops_alice",
                 labels_hint=["performance"],
                 comments=["Verified with heap profiler.", "Started in v1.9."],
             ),
             BugReport(
                 id="hard-003",
@@ -167,7 +616,12 @@ TASKS = {
                      "Risk is low-probability but affects data integrity.",
                 author="qa_bot",
                 labels_hint=["bug"],
-                comments=["Reproduced with locust at 50 concurrent users.", "Sequential mode avoids it."],
             ),
             BugReport(
                 id="hard-004",
@@ -178,7 +632,12 @@ TASKS = {
                      "This is a session management security vulnerability.",
                 author="pentest_team",
                 labels_hint=["security"],
-                comments=["Verified on staging.", "OWASP A07 — Identification and Authentication Failures."],
             ),
             BugReport(
                 id="hard-005",
@@ -189,126 +648,316 @@ TASKS = {
                      "Triggered in production twice this week. Requires process kill to recover.",
                 author="oncall_eng",
                 labels_hint=["bug", "performance"],
-                comments=["PagerDuty alert fired twice.", "Needs exponential backoff + max retry cap."],
             ),
         ],
         "answers": {
             "hard-001": {
-                "priority": "P0",
-                "labels": ["bug", "security"],
-                "assigned_team": "security",
-                "milestone": "hotfix",
             },
             "hard-002": {
-                "priority": "P1",
-                "labels": ["bug", "performance"],
-                "assigned_team": "backend",
-                "milestone": "v2.1",
             },
             "hard-003": {
-                "priority": "P1",
-                "labels": ["bug", "data-integrity"],
-                "assigned_team": "backend",
-                "milestone": "v2.1",
             },
             "hard-004": {
-                "priority": "P0",
-                "labels": ["bug", "security"],
-                "assigned_team": "security",
-                "milestone": "hotfix",
             },
             "hard-005": {
-                "priority": "P0",
-                "labels": ["bug", "performance"],
-                "assigned_team": "backend",
-                "milestone": "hotfix",
             },
         },
     },
 }
-# TASK SAMPLER  — picks a random bug each reset
-def sample_bug(task_key: str) -> BugReport:
-    """Return a random bug from the given task's pool."""
-    return random.choice(TASKS[task_key]["bugs"])
-# GRADERS
 PRIORITY_ORDER = {"P0": 0, "P1": 1, "P2": 2, "P3": 3}
 def _priority_score(predicted: str, correct: str) -> float:
     if predicted == correct:
         return 0.95
-    diff = abs(PRIORITY_ORDER.get(predicted, 99) - PRIORITY_ORDER.get(correct, 99))
-    return 0.5 if diff == 1 else 0.05
 def _label_score(predicted: List[str], correct: List[str]) -> float:
-    pred_set = set(l.lower() for l in predicted)
-    corr_set = set(l.lower() for l in correct)
-    if not corr_set:
         return 0.95
-    intersection = pred_set & corr_set
-    union = pred_set | corr_set
-    raw = len(intersection) / len(union)
     return max(0.05, min(0.95, raw))
-def grade_action(task_key, bug, action):
-    answer = TASKS[task_key]["answers"][bug.id]
     feedback_parts = []
     if task_key == "easy":
         score = _priority_score(action.priority, answer["priority"])
         symbol = "✓" if score >= 0.9 else "~" if score >= 0.4 else "✗"
-        feedback_parts.append(f"Priority: {symbol} (got {action.priority}, expected {answer['priority']})")
         score = max(0.01, min(0.99, score))
         return round(score, 3), " | ".join(feedback_parts)
     elif task_key == "medium":
         p_score = _priority_score(action.priority, answer["priority"])
-        l_score = _label_score(action.labels, answer["labels"])
         expected_team = answer.get("assigned_team", "")
         t_score = 0.95 if expected_team and action.assigned_team.lower() == expected_team.lower() else 0.05
-        score = 0.45 * p_score + 0.40 * l_score + 0.15 * t_score
-        feedback_parts.append(f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
-        feedback_parts.append(f"Labels: {l_score:.2f}")
-        feedback_parts.append(f"Team: {t_score:.2f} (got {action.assigned_team}, expected {expected_team})")
         score = max(0.01, min(0.99, score))
         return round(score, 3), " | ".join(feedback_parts)
     else:  # hard
         p_score = _priority_score(action.priority, answer["priority"])
-        l_score = _label_score(action.labels, answer["labels"])
         t_score = 0.95 if action.assigned_team.lower() == answer["assigned_team"].lower() else 0.05
         m_score = 0.95 if action.milestone.lower() == answer["milestone"].lower() else 0.05
-        score = 0.35 * p_score + 0.30 * l_score + 0.20 * t_score + 0.15 * m_score
-        feedback_parts.append(f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
-        feedback_parts.append(f"Labels: {l_score:.2f}")
-        feedback_parts.append(f"Team: {t_score:.2f} (got {action.assigned_team}, expected {answer['assigned_team']})")
-        feedback_parts.append(f"Milestone: {m_score:.2f} (got {action.milestone}, expected {answer['milestone']})")
         if answer.get("assigned_team") == "security" and action.assigned_team.lower() != "security":
             score = max(0.01, score - 0.15)
             feedback_parts.append("⚠ Security escalation missed (-0.15)")
         score = max(0.01, min(0.99, score))
         return round(score, 3), " | ".join(feedback_parts)
 def priority_match(*args, **kwargs):
     if len(args) < 2:
         return 0.5
-    bug = args[0]
-    action = args[1]
     score, _ = grade_action("easy", bug, action)
     return float(score)
@@ -316,10 +965,7 @@ def priority_match(*args, **kwargs):
 def priority_label_team(*args, **kwargs):
     if len(args) < 2:
         return 0.5
-    bug = args[0]
-    action = args[1]
     score, _ = grade_action("medium", bug, action)
     return float(score)
@@ -327,17 +973,18 @@ def priority_label_team(*args, **kwargs):
 def full_triage(*args, **kwargs):
     if len(args) < 2:
         return 0.5
-    bug = args[0]
-    action = args[1]
     score, _ = grade_action("hard", bug, action)
     return float(score)
 __all__ = [
     "priority_match",
     "priority_label_team",
     "full_triage",
     "sample_bug",
     "grade_action",
-    "TASKS",
 ]

 # server/task.py
 import sys
 import random
+import hashlib
 sys.path.insert(0, "/app")
+from typing import Tuple, List, Dict, Any
 from model import BugReport, TriageAction
+# ---------------------------------------------------------------------------
+#  LABEL SYNONYM MAP — allows semantic matching
+# ---------------------------------------------------------------------------
+LABEL_SYNONYMS: Dict[str, set] = {
+    "bug":              {"defect", "issue", "error", "fault", "broken"},
+    "security":         {"vulnerability", "cve", "exploit", "auth", "injection"},
+    "performance":      {"perf", "slow", "latency", "optimization", "speed", "memory"},
+    "ux":               {"ui", "frontend", "user-experience", "design", "usability"},
+    "data-integrity":   {"data-loss", "corruption", "data", "consistency"},
+    "payments":         {"billing", "payment", "stripe", "checkout", "revenue"},
+    "documentation":    {"docs", "typo", "readme", "wiki"},
+    "infrastructure":   {"infra", "devops", "deploy", "ci", "cd", "docker"},
+    "api":              {"endpoint", "rest", "graphql", "http", "request"},
+    "database":         {"db", "sql", "query", "migration", "schema"},
+}
+# ---------------------------------------------------------------------------
+#  BUG TEMPLATE SYSTEM — generates hundreds of unique bugs
+# ---------------------------------------------------------------------------
+_BUG_TEMPLATES = {
+    "crash": {
+        "titles": [
+            "{service} crashes on {trigger}",
+            "{service} throws {error_type} when {trigger}",
+            "Fatal error in {service} during {trigger}",
+            "Unhandled exception in {service}: {error_type}",
+            "{service} segfaults under {condition}",
+        ],
+        "bodies": [
+            "When a user {trigger}, the {service} crashes immediately. "
+            "Error: {error_type}. Stack trace points to {component}. "
+            "Affects {impact}. {workaround}",
+            "The {service} is failing with {error_type} every time a user {trigger}. "
+            "No error message is shown to the user — the process just dies. "
+            "Impact: {impact}. {workaround}",
+        ],
+        "vars": {
+            "service": ["auth service", "payment gateway", "search API", "notification worker",
+                        "session manager", "user profile service", "file upload handler",
+                        "webhook processor", "background job runner", "cache layer"],
+            "trigger": ["submits a form with special characters", "uploads a file larger than 10MB",
+                        "logs in with SSO", "resets their password", "exports data to CSV",
+                        "switches between tabs rapidly", "uses the bulk import feature",
+                        "accesses the admin panel", "triggers a webhook", "runs a scheduled job"],
+            "error_type": ["NullPointerException", "SegmentationFault", "OutOfMemoryError",
+                           "ConnectionTimeoutException", "StackOverflowError",
+                           "IndexOutOfBoundsException", "TypeError", "KeyError"],
+            "component": ["UserController.java:142", "PaymentService.py:89",
+                          "AuthMiddleware.ts:56", "SearchIndex.go:203",
+                          "NotificationQueue.rb:77", "FileHandler.py:234"],
+            "impact": ["100% of users on this flow", "all mobile users", "EU region users only",
+                       "users with accounts older than 1 year", "approximately 30% of sessions",
+                       "every request during peak hours"],
+            "workaround": ["No workaround exists — the feature is completely broken.",
+                           "Workaround: users can retry after clearing browser cache.",
+                           "Temporary fix: restart the service every 2 hours.",
+                           "No known workaround. Users are blocked."],
+            "condition": ["high concurrent load", "memory pressure above 80%",
+                          "when connection pool is exhausted", "after running for 6+ hours"],
+        },
+        "answer_template": {
+            "severe": {"priority": "P0", "labels": ["bug"], "assigned_team": "backend", "milestone": "hotfix"},
+            "moderate": {"priority": "P1", "labels": ["bug"], "assigned_team": "backend", "milestone": "v2.1"},
+        },
+        "severity_keywords": {
+            "severe": ["100%", "all mobile", "No workaround", "completely broken", "blocked",
+                       "SegmentationFault", "OutOfMemoryError"],
+            "moderate": ["retry", "30%", "Temporary fix", "restart"],
+        },
+    },
+    "security": {
+        "titles": [
+            "SQL injection vulnerability in {endpoint}",
+            "XSS attack possible via {input_field}",
+            "Authentication bypass in {service}",
+            "Sensitive data exposed in {location}",
+            "{credential_type} not invalidated after {event}",
+            "SSRF vulnerability in {endpoint}",
+        ],
+        "bodies": [
+            "The {endpoint} endpoint does not sanitize {input_field} inputs. "
+            "Crafted queries can {exploit_result}. PoC attached and verified on {env}. "
+            "Treat as confidential — do not discuss publicly until patched. {additional_context}",
+            "When a user {event}, existing {credential_type} remain valid for {duration}. "
+            "An attacker who {attack_vector} can continue to access the account. "
+            "This is a {vuln_category} vulnerability. {additional_context}",
+        ],
+        "vars": {
+            "endpoint": ["/api/search", "/api/users", "/api/export", "/admin/query",
+                         "/api/upload", "/graphql", "/api/webhook"],
+            "input_field": ["search query", "username field", "file upload name",
+                            "comment body", "profile bio", "webhook URL"],
+            "service": ["login flow", "OAuth callback", "API gateway", "admin panel",
+                        "password reset", "2FA verification"],
+            "location": ["API error responses", "debug logs shipped to client",
+                         "public S3 bucket", "unencrypted cookies", "localStorage"],
+            "credential_type": ["JWT tokens", "session cookies", "API keys", "OAuth tokens"],
+            "event": ["changes their password", "revokes API access",
+                      "is suspended by admin", "enables 2FA"],
+            "exploit_result": ["dump the entire user table including password hashes",
+                               "execute arbitrary JavaScript in other users' browsers",
+                               "access any user's account without credentials",
+                               "read internal service endpoints via SSRF"],
+            "env": ["production", "staging", "production replica"],
+            "duration": ["up to 24 hours", "indefinitely", "until manual cache clear",
+                         "for the full token TTL (7 days)"],
+            "attack_vector": ["previously stole a token", "intercepted a session cookie",
+                              "obtained a leaked API key"],
+            "vuln_category": ["session management", "access control",
+                              "injection", "broken authentication"],
+            "additional_context": [
+                "OWASP A03 — Injection.",
+                "OWASP A07 — Identification and Authentication Failures.",
+                "CVSS score estimated at 9.1 (Critical).",
+                "Compliance impact: potential GDPR violation if user PII is exfiltrated.",
+                "Bounty hunter reported this 48 hours ago — disclosure deadline approaching.",
+            ],
+        },
+        "answer_template": {
+            "default": {"priority": "P0", "labels": ["bug", "security"],
+                        "assigned_team": "security", "milestone": "hotfix"},
+        },
+        "severity_keywords": {"default": []},
+    },
+    "performance": {
+        "titles": [
+            "{page} loads slowly for {dataset_size}",
+            "Memory leak in {service} causes OOM after {duration}",
+            "API response time degrades under {load_condition}",
+            "{operation} takes {duration} for {dataset_size}",
+            "CPU spikes to 100% when {trigger}",
+        ],
+        "bodies": [
+            "When {condition}, the {page} takes {response_time} to load. "
+            "{diagnostic_info}. {impact}. {workaround}",
+            "The {service} allocates memory during {operation} and never frees it. "
+            "Server runs out of memory every {duration}. {diagnostic_info}. "
+            "{workaround}",
+        ],
+        "vars": {
+            "page": ["dashboard", "analytics page", "user list", "search results",
+                     "audit log", "reports page", "admin overview"],
+            "service": ["background job processor", "cache warming service",
+                        "log aggregator", "image resizer", "ETL pipeline"],
+            "dataset_size": ["large datasets (10k+ rows)", "enterprise accounts",
+                             "tables with 100k+ entries", "files over 50MB"],
+            "duration": ["6 hours", "4 hours", "12 hours", "30+ seconds",
+                         "2+ minutes", "an entire day"],
+            "load_condition": ["concurrent load", "peak traffic", "batch processing",
+                               "more than 50 simultaneous users"],
+            "operation": ["bulk export", "report generation", "data migration",
+                          "full-text search", "image processing"],
+            "trigger": ["running bulk exports", "processing large uploads",
+                        "generating PDF reports", "reindexing search"],
+            "condition": ["a dataset has more than 10k rows",
+                          "multiple users trigger exports simultaneously",
+                          "the nightly ETL job runs alongside user traffic"],
+            "response_time": ["30+ seconds", "over a minute", "2-3 minutes",
+                              "timeout after 60 seconds"],
+            "diagnostic_info": ["CPU spikes to 100%", "Heap profiler confirms the leak",
+                                "Database EXPLAIN shows full table scan",
+                                "N+1 query pattern detected in APM",
+                                "Garbage collector running every 500ms"],
+            "impact": ["Affects power users with large accounts",
+                       "All users experience slowness during peak hours",
+                       "Requires manual restart to recover",
+                       "Operational overhead: scheduled restarts every 4 hours"],
+            "workaround": ["Workaround: export data and use offline tools.",
+                           "Workaround: scheduled restarts every 4 hours.",
+                           "No workaround — users just wait.",
+                           "Workaround: paginate results (but UX is degraded)."],
+        },
+        "answer_template": {
+            "severe": {"priority": "P1", "labels": ["bug", "performance"],
+                       "assigned_team": "backend", "milestone": "v2.1"},
+            "moderate": {"priority": "P2", "labels": ["bug", "performance"],
+                         "assigned_team": "backend", "milestone": "v2.1"},
+        },
+        "severity_keywords": {
+            "severe": ["OOM", "100%", "manual restart", "timeout", "No workaround",
+                       "all users", "never frees"],
+            "moderate": ["Workaround", "power users", "paginate"],
+        },
+    },
+    "ui_bug": {
+        "titles": [
+            "{ui_element} breaks layout on {browser}",
+            "{ui_element} not rendering correctly in {mode}",
+            "Responsive layout broken on {device}",
+            "{feature} toggle not persisting across {context}",
+            "Accessibility: {ui_element} missing {a11y_attr}",
+        ],
+        "bodies": [
+            "Switching to {mode} on {browser} causes {ui_element} to {visual_issue}. "
+            "{other_browsers}. {workaround}",
+            "On {device}, the {ui_element} is {visual_issue}. "
+            "Tested on {browser}. {impact}. {workaround}",
+        ],
+        "vars": {
+            "ui_element": ["navigation bar", "sidebar menu", "modal dialog",
+                           "dropdown selector", "data table", "footer",
+                           "toast notifications", "breadcrumb trail"],
+            "browser": ["Safari 16", "Firefox ESR", "Chrome on Android",
+                        "Edge on Windows", "iOS Safari", "Samsung Internet"],
+            "mode": ["dark mode", "high contrast mode", "RTL layout",
+                     "compact view", "print view"],
+            "device": ["iPhone SE", "tablets in portrait", "screens below 768px",
+                       "ultra-wide monitors", "4K displays"],
+            "feature": ["dark mode", "compact view", "language preference",
+                        "notification settings"],
+            "context": ["page reloads", "different tabs", "sessions",
+                        "browser restarts"],
+            "visual_issue": ["overlap the main content", "disappear entirely",
+                             "render with incorrect colors", "become unclickable",
+                             "overflow beyond the viewport"],
+            "other_browsers": ["Chrome and Firefox are unaffected.",
+                               "Only reproducible on this specific browser.",
+                               "Affects all WebKit-based browsers."],
+            "a11y_attr": ["ARIA labels", "keyboard focus indicators",
+                          "screen reader text", "proper heading hierarchy"],
+            "impact": ["Cosmetic issue, no functional impact.",
+                       "Users cannot access the affected feature.",
+                       "Usability is degraded but the feature works."],
+            "workaround": ["Workaround: use a different browser.",
+                           "Workaround: manually resize the window.",
+                           "No workaround for this browser.",
+                           "Workaround: disable the feature in settings."],
+        },
+        "answer_template": {
+            "severe": {"priority": "P2", "labels": ["bug", "ux"],
+                       "assigned_team": "frontend", "milestone": "v2.1"},
+            "moderate": {"priority": "P3", "labels": ["bug", "ux"],
+                         "assigned_team": "frontend", "milestone": "backlog"},
+        },
+        "severity_keywords": {
+            "severe": ["cannot access", "unclickable", "disappear", "No workaround"],
+            "moderate": ["Cosmetic", "different browser", "resize"],
+        },
+    },
+    "data_corruption": {
+        "titles": [
+            "Race condition in {feature}: {consequence}",
+            "Data inconsistency in {feature} under concurrent writes",
+            "{export_format} export produces corrupted output for {edge_case}",
+            "Stale data served from cache after {trigger}",
+            "Duplicate records created when {trigger}",
+        ],
+        "bodies": [
+            "Under concurrent load, {feature} can {consequence} due to a race condition "
+            "in {root_cause}. Frequency: {frequency}. {impact}. {workaround}",
+            "When {feature} data contains {edge_case}, the exported {export_format} file "
+            "is corrupted and cannot be {consumer}. {impact}. {workaround}",
+        ],
+        "vars": {
+            "feature": ["file upload", "order processing", "user registration",
+                        "inventory update", "comment system", "permission assignment"],
+            "consequence": ["files occasionally overwrite each other",
+                            "orders are duplicated or lost",
+                            "users get assigned wrong permissions",
+                            "inventory counts become negative"],
+            "root_cause": ["temp file naming logic", "lack of database locking",
+                           "non-atomic read-modify-write cycle",
+                           "missing unique constraint"],
+            "frequency": ["approximately 1 in 10,000 operations",
+                          "consistently under 50+ concurrent users",
+                          "intermittently — hard to reproduce",
+                          "every time the batch job runs"],
+            "edge_case": ["non-ASCII characters (e.g., café, naïve)",
+                          "values containing commas or quotes",
+                          "null or empty fields",
+                          "timestamps crossing DST boundaries"],
+            "export_format": ["CSV", "Excel", "JSON", "PDF"],
+            "consumer": ["opened in Excel", "parsed by downstream services",
+                         "imported back into the system"],
+            "trigger": ["double-clicking the submit button",
+                        "cache TTL expires during a write operation",
+                        "two users edit the same record simultaneously",
+                        "the nightly sync job overlaps with user activity"],
+            "impact": ["Potential data loss confirmed.",
+                       "No data loss confirmed yet, but risk exists.",
+                       "Affects users with international data.",
+                       "Breaks downstream pipeline processing."],
+            "workaround": ["Workaround: enable sequential mode in settings.",
+                           "Workaround: manually re-export after cleanup.",
+                           "No reliable workaround — data must be manually verified.",
+                           "Workaround: add a mutex lock externally (operational overhead)."],
+        },
+        "answer_template": {
+            "severe": {"priority": "P1", "labels": ["bug", "data-integrity"],
+                       "assigned_team": "backend", "milestone": "v2.1"},
+            "moderate": {"priority": "P2", "labels": ["bug", "data-integrity"],
+                         "assigned_team": "backend", "milestone": "v2.1"},
+        },
+        "severity_keywords": {
+            "severe": ["data loss", "No reliable workaround", "consistently",
+                       "permissions", "overwrite", "negative"],
+            "moderate": ["No data loss", "intermittently", "sequential mode",
+                         "re-export", "non-ASCII"],
+        },
+    },
+    "documentation": {
+        "titles": [
+            "Typo in {location}",
+            "Outdated {doc_type} on {page}",
+            "Missing documentation for {feature}",
+            "Incorrect {doc_element} in {location}",
+        ],
+        "bodies": [
+            "There is a {issue_type} on the {page}: {detail}. No functional impact, "
+            "purely cosmetic. {extra}",
+            "The {doc_type} for {feature} is {issue_type}. {detail}. {extra}",
+        ],
+        "vars": {
+            "location": ["homepage docs", "API reference", "README", "changelog",
+                         "contributing guide", "onboarding wiki"],
+            "doc_type": ["installation guide", "API documentation", "changelog",
+                         "migration guide", "code comments"],
+            "page": ["landing page", "docs homepage", "getting started page",
+                     "FAQ section", "footer"],
+            "feature": ["new webhook API", "batch processing endpoint",
+                        "SSO integration", "rate limiting"],
+            "doc_element": ["code example", "endpoint URL", "parameter description",
+                            "copyright year", "version number"],
+            "issue_type": ["a typo", "outdated", "missing", "incorrect", "misleading"],
+            "detail": ["'Welccome' should be 'Welcome'",
+                       "references removed v1.x API that no longer exists",
+                       "completely undocumented despite being a core feature",
+                       "shows '© 2022' but should be '© 2024'",
+                       "the curl example uses the wrong HTTP method"],
+            "extra": ["", "Low priority — does not block any workflow.",
+                      "New users have reported confusion.",
+                      "Only noticed by contributors reading source code."],
+        },
+        "answer_template": {
+            "default": {"priority": "P3", "labels": ["documentation"],
+                        "assigned_team": "devx", "milestone": "backlog"},
+        },
+        "severity_keywords": {"default": []},
+    },
+    "api_bug": {
+        "titles": [
+            "API rate limiter {issue} after {trigger}",
+            "{endpoint} returns {status_code} instead of {expected_code}",
+            "Pagination broken on {endpoint}: {symptom}",
+            "Webhook delivery {issue} for {event_type} events",
+            "API versioning: {endpoint} behaves differently on v1 vs v2",
+        ],
+        "bodies": [
+            "After receiving a {status_code} response, {consequence}. "
+            "The {root_cause}. {impact}. {workaround}",
+            "The {endpoint} endpoint {symptom} when {trigger}. "
+            "Expected behavior: {expected}. Actual: {actual}. {impact}.",
+        ],
+        "vars": {
+            "endpoint": ["/api/users", "/api/search", "/api/export",
+                         "/api/webhooks", "/api/billing", "/api/analytics"],
+            "issue": ["blocks legitimate users", "fails silently",
+                      "returns incorrect retry headers", "drops events"],
+            "trigger": ["a 429 error", "rate limit window resets",
+                        "a burst of requests from CI/CD", "server restart"],
+            "status_code": ["429", "500", "502", "504", "403"],
+            "expected_code": ["200", "201", "204", "404"],
+            "symptom": ["returns duplicate entries",
+                        "skips items between pages",
+                        "returns empty page despite more data existing"],
+            "event_type": ["payment.completed", "user.created",
+                           "subscription.cancelled", "deployment.finished"],
+            "consequence": ["legitimate users remain blocked for 1 hour",
+                            "data is silently lost with no error",
+                            "downstream services receive stale data"],
+            "root_cause": ["unblock logic has a bug — it never clears the blocked flag",
+                           "cursor-based pagination uses wrong sort order",
+                           "retry-after header reports seconds instead of milliseconds"],
+            "expected": ["200 OK with paginated results",
+                         "successful delivery with retry on failure",
+                         "proper rate limit reset after window expires"],
+            "actual": ["empty response with 200 status",
+                       "permanent block until manual intervention",
+                       "events dropped without any error log"],
+            "impact": ["Affects CI/CD pipelines hitting the API.",
+                       "External integrations break silently.",
+                       "Customer-facing dashboards show wrong data.",
+                       "Retry-After header causes clients to wait too long."],
+            "workaround": ["Workaround: manually clear Redis key.",
+                           "Workaround: add client-side deduplication.",
+                           "No workaround — requires server-side fix.",
+                           "Workaround: pin API version to v1 in headers."],
+        },
+        "answer_template": {
+            "severe": {"priority": "P1", "labels": ["bug", "api"],
+                       "assigned_team": "backend", "milestone": "v2.1"},
+            "moderate": {"priority": "P2", "labels": ["bug", "api"],
+                         "assigned_team": "backend", "milestone": "v2.1"},
+        },
+        "severity_keywords": {
+            "severe": ["silently lost", "permanent block", "No workaround",
+                       "dropped", "external integrations"],
+            "moderate": ["Workaround", "pin API", "deduplication"],
+        },
+    },
+}
+# The original handcrafted bugs — kept as a gold-standard subset
+_HANDCRAFTED_BUGS = {
     "easy": {
         "bugs": [
             BugReport(
                 author="user123",
                 labels_hint=[],
                 comments=["Confirmed on iOS and Android.", "Happens every time."],
+                severity_signals=["100% of users", "crashes", "no workaround"],
+                stack_trace="NullPointerException at AuthController.java:87",
+                affected_component="auth-service",
             ),
             BugReport(
                 id="easy-002",
                 author="docs_fan",
                 labels_hint=["documentation"],
                 comments=[],
+                severity_signals=["cosmetic", "no functional impact"],
+                stack_trace="",
+                affected_component="docs",
             ),
             BugReport(
                 id="easy-003",
                 author="power_user",
                 labels_hint=["performance"],
                 comments=["Noticed after the last deploy.", "CPU spikes to 100%."],
+                severity_signals=["workaround exists", "power users only"],
+                stack_trace="",
+                affected_component="dashboard",
             ),
             BugReport(
                 id="easy-004",
                      "Affects all users attempting password reset.",
                 author="support_team",
                 labels_hint=["bug"],
+                comments=["Reported by 12 users this week.",
+                           "Started after email service migration."],
+                severity_signals=["all users", "never dispatched"],
+                stack_trace="",
+                affected_component="email-service",
             ),
             BugReport(
                 id="easy-005",
                 author="intern_dev",
                 labels_hint=["documentation"],
                 comments=[],
+                severity_signals=["no functional impact"],
+                stack_trace="",
+                affected_component="frontend",
             ),
         ],
         "answers": {
             "easy-001": {"priority": "P0"},
             "easy-002": {"priority": "P3"},
                 author="store_owner",
                 labels_hint=["bug"],
                 comments=["Revenue impact confirmed.", "Happening since Tuesday."],
+                severity_signals=["revenue loss", "silently", "every failed checkout"],
+                stack_trace="Stripe API: card_declined at PaymentService.py:145",
+                affected_component="payment-service",
             ),
             BugReport(
                 id="med-002",
                 author="moderator_jane",
                 labels_hint=[],
                 comments=["GDPR concern — deleted content still visible."],
+                severity_signals=["GDPR violation", "deleted content visible"],
+                stack_trace="",
+                affected_component="search-index",
             ),
             BugReport(
                 id="med-003",
                 author="safari_user",
                 labels_hint=["bug", "ux"],
                 comments=["Only on Safari, not Chrome/Firefox."],
+                severity_signals=["workaround exists", "single browser"],
+                stack_trace="",
+                affected_component="frontend-css",
             ),
             BugReport(
                 id="med-004",
                      "Affects users with international data.",
                 author="data_analyst",
                 labels_hint=["bug"],
+                comments=["Encoding issue — UTF-8 not respected.",
+                           "Workaround: manual copy-paste."],
+                severity_signals=["corrupted", "workaround exists"],
+                stack_trace="",
+                affected_component="export-service",
             ),
             BugReport(
                 id="med-005",
                      "The unblock logic has a bug — it never clears the blocked flag.",
                 author="api_user",
                 labels_hint=["bug"],
+                comments=["Affects CI/CD pipelines hitting the API.",
+                           "Retry-After header is wrong."],
+                severity_signals=["permanent block", "never clears", "bug in logic"],
+                stack_trace="",
+                affected_component="api-gateway",
             ),
         ],
         "answers": {
                 author="security_researcher",
                 labels_hint=[],
                 comments=["Critical. Affects production.", "Do not discuss publicly."],
+                severity_signals=["SQL injection", "password hashes", "production",
+                                  "confidential"],
+                stack_trace="",
+                affected_component="search-api",
             ),
             BugReport(
                 id="hard-002",
                 author="devops_alice",
                 labels_hint=["performance"],
                 comments=["Verified with heap profiler.", "Started in v1.9."],
+                severity_signals=["memory leak", "OOM", "manual restart", "never frees"],
+                stack_trace="HeapDump: JobProcessor.process() -> 50MB/call, never GC'd",
+                affected_component="job-processor",
             ),
             BugReport(
                 id="hard-003",
                      "Risk is low-probability but affects data integrity.",
                 author="qa_bot",
                 labels_hint=["bug"],
+                comments=["Reproduced with locust at 50 concurrent users.",
+                           "Sequential mode avoids it."],
+                severity_signals=["race condition", "data integrity",
+                                  "workaround exists", "low-probability"],
+                stack_trace="",
+                affected_component="file-upload",
             ),
             BugReport(
                 id="hard-004",
                      "This is a session management security vulnerability.",
                 author="pentest_team",
                 labels_hint=["security"],
+                comments=["Verified on staging.",
+                           "OWASP A07 — Identification and Authentication Failures."],
+                severity_signals=["JWT not invalidated", "attacker", "security vulnerability",
+                                  "stolen token"],
+                stack_trace="",
+                affected_component="auth-service",
             ),
             BugReport(
                 id="hard-005",
                      "Triggered in production twice this week. Requires process kill to recover.",
                 author="oncall_eng",
                 labels_hint=["bug", "performance"],
+                comments=["PagerDuty alert fired twice.",
+                           "Needs exponential backoff + max retry cap."],
+                severity_signals=["infinite loop", "100%", "production",
+                                  "process kill", "starves other services"],
+                stack_trace="Thread dump: WebhookRetrier.retry() → recursive call, no exit",
+                affected_component="webhook-service",
             ),
         ],
         "answers": {
             "hard-001": {
+                "priority": "P0", "labels": ["bug", "security"],
+                "assigned_team": "security", "milestone": "hotfix",
             },
             "hard-002": {
+                "priority": "P1", "labels": ["bug", "performance"],
+                "assigned_team": "backend", "milestone": "v2.1",
             },
             "hard-003": {
+                "priority": "P1", "labels": ["bug", "data-integrity"],
+                "assigned_team": "backend", "milestone": "v2.1",
             },
             "hard-004": {
+                "priority": "P0", "labels": ["bug", "security"],
+                "assigned_team": "security", "milestone": "hotfix",
             },
             "hard-005": {
+                "priority": "P0", "labels": ["bug", "performance"],
+                "assigned_team": "backend", "milestone": "hotfix",
             },
         },
     },
 }
+# Combine into single TASKS dict (backward compatible)
+TASKS = _HANDCRAFTED_BUGS
+# ---------------------------------------------------------------------------
+#  PROCEDURAL BUG GENERATOR
+# ---------------------------------------------------------------------------
+def _determine_severity(text: str, keywords: Dict[str, list]) -> str:
+    """Check which severity level the generated text matches."""
+    text_lower = text.lower()
+    for level, kws in keywords.items():
+        if level == "default":
+            return "default"
+        hits = sum(1 for kw in kws if kw.lower() in text_lower)
+        if hits >= 1:
+            return level
+    # fallback to first non-default key
+    return list(keywords.keys())[0] if keywords else "moderate"
+def generate_bug(task_key: str, seed: int = None) -> Tuple[BugReport, dict]:
+    """Generate a procedural bug report with its correct answer."""
+    rng = random.Random(seed)
+    # Weight categories by difficulty
+    weights = {
+        "easy": {"documentation": 3, "ui_bug": 3, "performance": 2,
+                 "crash": 1, "api_bug": 1},
+        "medium": {"crash": 3, "performance": 3, "api_bug": 2,
+                   "data_corruption": 2, "ui_bug": 1},
+        "hard": {"security": 4, "crash": 3, "data_corruption": 3,
+                 "performance": 2, "api_bug": 2},
+    }
+    task_weights = weights.get(task_key, weights["medium"])
+    categories = []
+    for cat, w in task_weights.items():
+        categories.extend([cat] * w)
+    category = rng.choice(categories)
+    template = _BUG_TEMPLATES[category]
+    # Pick random variable values
+    chosen_vars = {}
+    for var_name, options in template["vars"].items():
+        chosen_vars[var_name] = rng.choice(options)
+    # Build title and body
+    title_tmpl = rng.choice(template["titles"])
+    body_tmpl = rng.choice(template["bodies"])
+    # Safe format — ignore missing keys
+    def safe_format(tmpl, vars_dict):
+        result = tmpl
+        for k, v in vars_dict.items():
+            result = result.replace("{" + k + "}", v)
+        return result
+    title = safe_format(title_tmpl, chosen_vars)
+    body = safe_format(body_tmpl, chosen_vars)
+    # Generate unique ID from seed
+    bug_id = f"gen-{seed or rng.randint(0, 999999):06d}"
+    # Pick author
+    authors = ["user_report", "qa_engineer", "support_team", "dev_oncall",
+               "security_bot", "customer_jane", "automated_monitor",
+               "intern_dev", "senior_eng", "pm_feedback"]
+    author = rng.choice(authors)
+    # Build comments
+    comment_templates = [
+        "Confirmed on our side.", "Reproduced in staging.",
+        "Multiple reports from users.", "Started after last deployment.",
+        "Urgent — customer escalation.", "Low priority — no user complaints.",
+        "Needs investigation.", "Related to ticket from last sprint.",
+    ]
+    num_comments = rng.randint(0, 3)
+    comments = rng.sample(comment_templates, min(num_comments, len(comment_templates)))
+    # Determine severity and answer
+    full_text = f"{title} {body} {' '.join(comments)}"
+    severity_kws = template.get("severity_keywords", {})
+    severity = _determine_severity(full_text, severity_kws)
+    answer_templates = template["answer_template"]
+    answer = dict(answer_templates.get(severity, list(answer_templates.values())[0]))
+    # For easy tasks, only priority matters
+    if task_key == "easy":
+        answer = {"priority": answer["priority"]}
+    elif task_key == "medium":
+        answer.pop("milestone", None)
+    bug = BugReport(
+        id=bug_id,
+        title=title,
+        body=body,
+        author=author,
+        labels_hint=rng.sample(["bug", "needs-triage", "reported"], rng.randint(0, 2)),
+        comments=comments,
+        severity_signals=[],
+        stack_trace="",
+        affected_component=chosen_vars.get("service", chosen_vars.get("endpoint", "")),
+    )
+    return bug, answer
+# ---------------------------------------------------------------------------
+#  BUG SAMPLER — uses handcrafted bugs first, then procedural for variety
+# ---------------------------------------------------------------------------
+def sample_bug(task_key: str, seed: int = None) -> Tuple[BugReport, dict]:
+    """Return a bug and its answer. Mixes handcrafted + procedural."""
+    rng = random.Random(seed)
+    # 40% chance of handcrafted, 60% procedural
+    if rng.random() < 0.4 and task_key in _HANDCRAFTED_BUGS:
+        bugs = _HANDCRAFTED_BUGS[task_key]["bugs"]
+        bug = rng.choice(bugs)
+        answer = _HANDCRAFTED_BUGS[task_key]["answers"][bug.id]
+        return bug, answer
+    else:
+        gen_seed = seed if seed is not None else rng.randint(0, 999999)
+        return generate_bug(task_key, seed=gen_seed)
+# ---------------------------------------------------------------------------
+#  GRADING — with semantic label matching
+# ---------------------------------------------------------------------------
 PRIORITY_ORDER = {"P0": 0, "P1": 1, "P2": 2, "P3": 3}
 def _priority_score(predicted: str, correct: str) -> float:
+    """Score priority assignment with partial credit for near-misses."""
     if predicted == correct:
         return 0.95
+    pred_rank = PRIORITY_ORDER.get(predicted, 99)
+    corr_rank = PRIORITY_ORDER.get(correct, 99)
+    diff = abs(pred_rank - corr_rank)
+    if diff == 1:
+        return 0.5
+    elif diff == 2:
+        return 0.2
+    return 0.05
+def _normalize_label(label: str) -> str:
+    """Normalize a label to its canonical form."""
+    label_lower = label.lower().strip()
+    for canonical, synonyms in LABEL_SYNONYMS.items():
+        if label_lower == canonical or label_lower in synonyms:
+            return canonical
+    return label_lower
 def _label_score(predicted: List[str], correct: List[str]) -> float:
+    """Score labels using semantic matching via synonym groups."""
+    pred_normalized = set(_normalize_label(l) for l in predicted)
+    corr_normalized = set(_normalize_label(l) for l in correct)
+    if not corr_normalized:
         return 0.95
+    intersection = pred_normalized & corr_normalized
+    union = pred_normalized | corr_normalized
+    raw = len(intersection) / len(union) if union else 0.0
     return max(0.05, min(0.95, raw))
+def _reasoning_score(reasoning: str, answer: dict) -> float:
+    """Bonus for reasoning that mentions relevant signals."""
+    if not reasoning or len(reasoning.strip()) < 10:
+        return 0.0
+    key_signals = {
+        "P0": ["production", "all users", "data loss", "security", "crash",
+               "revenue", "injection", "vulnerability", "100%"],
+        "P1": ["major", "significant", "no workaround", "broken",
+               "gdpr", "blocked", "leak", "never"],
+        "P2": ["degraded", "workaround", "partial", "slow",
+               "affected", "power users"],
+        "P3": ["minor", "cosmetic", "docs", "typo", "low",
+               "no functional impact"],
+    }
+    expected_priority = answer.get("priority", "P2")
+    signals = key_signals.get(expected_priority, [])
+    reasoning_lower = reasoning.lower()
+    hits = sum(1 for s in signals if s in reasoning_lower)
+    return min(0.15, hits * 0.05)
+def grade_action(task_key: str, bug: BugReport, action: TriageAction,
+                 answer: dict = None) -> Tuple[float, str]:
+    """Grade the agent's triage action against the correct answer."""
+    # Backward compatibility: look up answer from handcrafted if not provided
+    if answer is None:
+        if task_key in _HANDCRAFTED_BUGS and bug.id in _HANDCRAFTED_BUGS[task_key]["answers"]:
+            answer = _HANDCRAFTED_BUGS[task_key]["answers"][bug.id]
+        else:
+            return 0.5, "No answer key found for this bug."
     feedback_parts = []
+    reasoning_bonus = _reasoning_score(action.reasoning, answer)
     if task_key == "easy":
         score = _priority_score(action.priority, answer["priority"])
         symbol = "✓" if score >= 0.9 else "~" if score >= 0.4 else "✗"
+        feedback_parts.append(
+            f"Priority: {symbol} (got {action.priority}, expected {answer['priority']})")
+        score = score + reasoning_bonus
         score = max(0.01, min(0.99, score))
         return round(score, 3), " | ".join(feedback_parts)
     elif task_key == "medium":
         p_score = _priority_score(action.priority, answer["priority"])
+        l_score = _label_score(action.labels, answer.get("labels", []))
         expected_team = answer.get("assigned_team", "")
         t_score = 0.95 if expected_team and action.assigned_team.lower() == expected_team.lower() else 0.05
+        score = 0.45 * p_score + 0.40 * l_score + 0.15 * t_score + reasoning_bonus
+        feedback_parts.append(
+            f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
+        feedback_parts.append(f"Labels: {l_score:.2f} (semantic match)")
+        feedback_parts.append(
+            f"Team: {t_score:.2f} (got {action.assigned_team}, expected {expected_team})")
+        if reasoning_bonus > 0:
+            feedback_parts.append(f"Reasoning bonus: +{reasoning_bonus:.2f}")
         score = max(0.01, min(0.99, score))
         return round(score, 3), " | ".join(feedback_parts)
     else:  # hard
         p_score = _priority_score(action.priority, answer["priority"])
+        l_score = _label_score(action.labels, answer.get("labels", []))
         t_score = 0.95 if action.assigned_team.lower() == answer["assigned_team"].lower() else 0.05
         m_score = 0.95 if action.milestone.lower() == answer["milestone"].lower() else 0.05
+        score = 0.35 * p_score + 0.30 * l_score + 0.20 * t_score + 0.15 * m_score + reasoning_bonus
+        feedback_parts.append(
+            f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
+        feedback_parts.append(f"Labels: {l_score:.2f} (semantic match)")
+        feedback_parts.append(
+            f"Team: {t_score:.2f} (got {action.assigned_team}, expected {answer['assigned_team']})")
+        feedback_parts.append(
+            f"Milestone: {m_score:.2f} (got {action.milestone}, expected {answer['milestone']})")
+        if reasoning_bonus > 0:
+            feedback_parts.append(f"Reasoning bonus: +{reasoning_bonus:.2f}")
+        # Security escalation penalty
         if answer.get("assigned_team") == "security" and action.assigned_team.lower() != "security":
             score = max(0.01, score - 0.15)
             feedback_parts.append("⚠ Security escalation missed (-0.15)")
         score = max(0.01, min(0.99, score))
         return round(score, 3), " | ".join(feedback_parts)
+# ---------------------------------------------------------------------------
+#  NAMED GRADER FUNCTIONS — referenced by openenv.yaml
+# ---------------------------------------------------------------------------
 def priority_match(*args, **kwargs):
     if len(args) < 2:
         return 0.5
+    bug, action = args[0], args[1]
     score, _ = grade_action("easy", bug, action)
     return float(score)
 def priority_label_team(*args, **kwargs):
     if len(args) < 2:
         return 0.5
+    bug, action = args[0], args[1]
     score, _ = grade_action("medium", bug, action)
     return float(score)
 def full_triage(*args, **kwargs):
     if len(args) < 2:
         return 0.5
+    bug, action = args[0], args[1]
     score, _ = grade_action("hard", bug, action)
     return float(score)
 __all__ = [
     "priority_match",
     "priority_label_team",
     "full_triage",
     "sample_bug",
+    "generate_bug",
     "grade_action",
+    "TASKS",
+    "LABEL_SYNONYMS",
 ]

tests/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # tests/__init__.py

tests/test_api.py ADDED Viewed

	@@ -0,0 +1,190 @@

+# tests/test_api.py
+"""Integration tests for the FastAPI endpoints."""
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
+import pytest
+# These tests require fastapi and httpx
+try:
+    from fastapi.testclient import TestClient
+    from server.app import app
+    HAS_DEPS = True
+except ImportError:
+    HAS_DEPS = False
+pytestmark = pytest.mark.skipif(not HAS_DEPS, reason="FastAPI/httpx not installed")
+@pytest.fixture
+def client():
+    return TestClient(app)
+class TestHealthEndpoint:
+    def test_health_returns_ok(self, client):
+        r = client.get("/health")
+        assert r.status_code == 200
+        data = r.json()
+        assert data.get("status") in ("ok", "healthy")
+class TestTaskEndpoints:
+    def test_list_tasks(self, client):
+        r = client.get("/tasks")
+        assert r.status_code == 200
+        tasks = r.json()
+        assert len(tasks) == 3
+        ids = [t["id"] for t in tasks]
+        assert "easy" in ids
+        assert "medium" in ids
+        assert "hard" in ids
+    def test_get_specific_task(self, client):
+        r = client.get("/tasks/easy")
+        assert r.status_code == 200
+        assert r.json()["id"] == "easy"
+    def test_get_nonexistent_task(self, client):
+        r = client.get("/tasks/impossible")
+        assert r.status_code == 404
+class TestResetEndpoint:
+    def test_reset_returns_observation(self, client):
+        r = client.post("/reset", json={"task_id": "easy"})
+        assert r.status_code == 200
+        data = r.json()
+        assert "observation" in data
+        assert "session_id" in data
+        assert data["done"] is False
+    def test_reset_with_empty_body(self, client):
+        r = client.post("/reset", json={})
+        assert r.status_code == 200
+    def test_reset_returns_bug_report(self, client):
+        r = client.post("/reset", json={"task_id": "medium"})
+        data = r.json()
+        obs = data["observation"]
+        assert "bug_report" in obs
+        assert "title" in obs["bug_report"]
+class TestStepEndpoint:
+    def test_investigation_step(self, client):
+        # Reset first
+        r = client.post("/reset", json={"task_id": "easy"})
+        session_id = r.json()["session_id"]
+        # Investigate
+        r = client.post("/step", json={
+            "session_id": session_id,
+            "action": {"action_type": "read_body"},
+        })
+        assert r.status_code == 200
+        data = r.json()
+        assert data["done"] is False
+    def test_submit_step(self, client):
+        # Reset
+        r = client.post("/reset", json={"task_id": "easy"})
+        session_id = r.json()["session_id"]
+        # Submit
+        r = client.post("/step", json={
+            "session_id": session_id,
+            "action": {
+                "action_type": "submit",
+                "priority": "P0",
+                "labels": ["bug"],
+                "assigned_team": "backend",
+            },
+        })
+        assert r.status_code == 200
+        data = r.json()
+        assert data["done"] is True
+        assert 0 < data["reward"] < 1
+    def test_full_episode_flow(self, client):
+        # Reset
+        r = client.post("/reset", json={"task_id": "hard"})
+        assert r.status_code == 200
+        session_id = r.json()["session_id"]
+        # Investigate: read body
+        r = client.post("/step", json={
+            "session_id": session_id,
+            "action": {"action_type": "read_body"},
+        })
+        assert r.status_code == 200
+        assert r.json()["done"] is False
+        # Investigate: read comments
+        r = client.post("/step", json={
+            "session_id": session_id,
+            "action": {"action_type": "read_comments"},
+        })
+        assert r.status_code == 200
+        assert r.json()["done"] is False
+        # Submit triage
+        r = client.post("/step", json={
+            "session_id": session_id,
+            "action": {
+                "action_type": "submit",
+                "priority": "P0",
+                "labels": ["bug", "security"],
+                "assigned_team": "security",
+                "milestone": "hotfix",
+                "reasoning": "Critical security vulnerability in production",
+            },
+        })
+        assert r.status_code == 200
+        data = r.json()
+        assert data["done"] is True
+        assert 0 < data["reward"] < 1
+    def test_backward_compatible_no_session(self, client):
+        """Old-style requests without session_id should still work."""
+        r = client.post("/reset", json={"task_id": "easy"})
+        assert r.status_code == 200
+        r = client.post("/step", json={
+            "action": {
+                "priority": "P0",
+                "labels": ["bug"],
+            },
+        })
+        assert r.status_code == 200
+class TestStateEndpoint:
+    def test_state_returns_data(self, client):
+        client.post("/reset", json={"task_id": "easy"})
+        r = client.get("/state")
+        assert r.status_code == 200
+        data = r.json()
+        assert "current_task" in data
+        assert "step_count" in data
+class TestLeaderboard:
+    def test_get_empty_leaderboard(self, client):
+        r = client.get("/leaderboard")
+        assert r.status_code == 200
+        assert isinstance(r.json(), list)
+    def test_submit_to_leaderboard(self, client):
+        r = client.post("/leaderboard/submit", json={
+            "agent_name": "test-agent",
+            "model": "test-model",
+            "scores": {"easy": 0.9, "medium": 0.7, "hard": 0.5},
+            "avg_score": 0.7,
+        })
+        assert r.status_code == 200
+        data = r.json()
+        assert data["status"] == "submitted"
+        assert "rank" in data

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,205 @@

+# tests/test_environment.py
+"""Tests for the environment logic in server/environment.py"""
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
+import pytest
+from model import TriageAction, TriageObservation
+from server.environment import BugTriageEnvironment, SessionManager
+class TestEnvironmentReset:
+    def test_reset_returns_observation(self):
+        env = BugTriageEnvironment()
+        obs = env.reset(task_id="easy")
+        assert isinstance(obs, TriageObservation)
+        assert obs.bug_report is not None
+        assert obs.done is False
+        assert obs.task_id == "easy"
+    def test_reset_different_tasks(self):
+        env = BugTriageEnvironment()
+        for task_id in ["easy", "medium", "hard"]:
+            obs = env.reset(task_id=task_id)
+            assert obs.task_id == task_id
+            assert obs.done is False
+    def test_reset_invalid_task_defaults_to_easy(self):
+        env = BugTriageEnvironment()
+        obs = env.reset(task_id="nonexistent")
+        assert obs.task_id == "easy"
+    def test_reset_shows_truncated_body(self):
+        env = BugTriageEnvironment()
+        obs = env.reset(task_id="easy")
+        # Body should be truncated (not fully visible) on reset
+        assert obs.body_visible is False
+    def test_reset_hides_comments(self):
+        env = BugTriageEnvironment()
+        obs = env.reset(task_id="easy")
+        assert obs.comments_visible is False
+    def test_reset_clears_previous_state(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        env.step(TriageAction(action_type="submit", priority="P0"))
+        # Reset should clear everything
+        obs = env.reset(task_id="medium")
+        assert obs.done is False
+        assert obs.task_id == "medium"
+        assert obs.steps_taken == 0
+class TestEnvironmentInvestigation:
+    def test_read_body_reveals_full_body(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        obs = env.step(TriageAction(action_type="read_body"))
+        assert obs.body_visible is True
+        assert obs.done is False
+        assert obs.steps_taken == 1
+    def test_read_comments_reveals_comments(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        obs = env.step(TriageAction(action_type="read_comments"))
+        assert obs.comments_visible is True
+        assert obs.done is False
+    def test_check_logs_reveals_logs(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        obs = env.step(TriageAction(action_type="check_logs"))
+        assert obs.logs_visible is True
+        assert obs.done is False
+    def test_duplicate_investigation_gives_feedback(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        env.step(TriageAction(action_type="read_body"))
+        obs = env.step(TriageAction(action_type="read_body"))
+        assert "already" in obs.feedback.lower()
+    def test_step_count_increments(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        obs1 = env.step(TriageAction(action_type="read_body"))
+        assert obs1.steps_taken == 1
+        obs2 = env.step(TriageAction(action_type="read_comments"))
+        assert obs2.steps_taken == 2
+class TestEnvironmentSubmission:
+    def test_submit_returns_done(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        obs = env.step(TriageAction(action_type="submit", priority="P0"))
+        assert obs.done is True
+    def test_submit_returns_valid_score(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        obs = env.step(TriageAction(action_type="submit", priority="P0"))
+        assert 0 < obs.score < 1
+        assert 0 < obs.reward < 1
+    def test_investigate_then_submit(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="medium")
+        env.step(TriageAction(action_type="read_body"))
+        env.step(TriageAction(action_type="read_comments"))
+        obs = env.step(TriageAction(
+            action_type="submit", priority="P0",
+            labels=["bug"], assigned_team="backend",
+        ))
+        assert obs.done is True
+        assert 0 < obs.score < 1
+    def test_double_submit_stays_done(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        env.step(TriageAction(action_type="submit", priority="P0"))
+        obs = env.step(TriageAction(action_type="submit", priority="P1"))
+        assert obs.done is True
+        assert "already complete" in obs.feedback.lower()
+    def test_max_steps_forces_submit(self):
+        env = BugTriageEnvironment()
+        obs = env.reset(task_id="easy")
+        max_steps = obs.max_steps
+        # Use all steps investigating
+        for _ in range(max_steps - 1):
+            obs = env.step(TriageAction(action_type="read_body"))
+            if obs.done:
+                break
+        # This should force a submit even if action_type is investigate
+        if not obs.done:
+            obs = env.step(TriageAction(
+                action_type="read_comments",  # will be forced to submit
+                priority="P0",
+            ))
+class TestEnvironmentState:
+    def test_state_tracks_steps(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        env.step(TriageAction(action_type="read_body"))
+        state = env.get_state()
+        assert state.step_count == 1
+        assert "read_body" in state.actions_taken
+    def test_state_tracks_completed_tasks(self):
+        env = BugTriageEnvironment()
+        env.reset(task_id="easy")
+        env.step(TriageAction(action_type="submit", priority="P0"))
+        state = env.get_state()
+        assert "easy" in state.tasks_completed
+class TestSessionManager:
+    def test_create_session(self):
+        mgr = SessionManager(max_sessions=10, ttl_seconds=60)
+        session_id, env = mgr.create_session()
+        assert session_id is not None
+        assert isinstance(env, BugTriageEnvironment)
+        assert mgr.active_count == 1
+    def test_get_session(self):
+        mgr = SessionManager()
+        session_id, env = mgr.create_session()
+        retrieved = mgr.get_session(session_id)
+        assert retrieved is env
+    def test_get_missing_session(self):
+        mgr = SessionManager()
+        assert mgr.get_session("nonexistent") is None
+    def test_remove_session(self):
+        mgr = SessionManager()
+        session_id, _ = mgr.create_session()
+        mgr.remove_session(session_id)
+        assert mgr.get_session(session_id) is None
+        assert mgr.active_count == 0
+    def test_max_sessions_enforced(self):
+        mgr = SessionManager(max_sessions=3, ttl_seconds=60)
+        for _ in range(5):
+            mgr.create_session()
+        assert mgr.active_count <= 3
+    def test_multiple_sessions_independent(self):
+        mgr = SessionManager()
+        sid1, env1 = mgr.create_session()
+        sid2, env2 = mgr.create_session()
+        env1.reset(task_id="easy")
+        env2.reset(task_id="hard")
+        assert env1.get_state().current_task == "easy"
+        assert env2.get_state().current_task == "hard"

tests/test_grading.py ADDED Viewed

	@@ -0,0 +1,253 @@

+# tests/test_grading.py
+"""Tests for the grading logic in server/task.py"""
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
+import pytest
+from model import BugReport, TriageAction
+from server.task import (
+    _priority_score, _label_score, _normalize_label, _reasoning_score,
+    grade_action, generate_bug, sample_bug, TASKS, LABEL_SYNONYMS,
+)
+# ── Priority Scoring ──────────────────────────────────────
+class TestPriorityScoring:
+    def test_exact_match_gives_high_score(self):
+        assert _priority_score("P0", "P0") == 0.95
+    def test_all_exact_matches(self):
+        for p in ["P0", "P1", "P2", "P3"]:
+            assert _priority_score(p, p) == 0.95
+    def test_off_by_one_gives_partial_credit(self):
+        assert _priority_score("P0", "P1") == 0.5
+        assert _priority_score("P1", "P2") == 0.5
+        assert _priority_score("P2", "P3") == 0.5
+    def test_off_by_two_gives_low_credit(self):
+        assert _priority_score("P0", "P2") == 0.2
+        assert _priority_score("P1", "P3") == 0.2
+    def test_completely_wrong_gives_minimum(self):
+        assert _priority_score("P0", "P3") == 0.05
+    def test_invalid_priority(self):
+        assert _priority_score("P9", "P0") == 0.05
+        assert _priority_score("invalid", "P0") == 0.05
+# ── Label Scoring ─────────────────────────────────────────
+class TestLabelScoring:
+    def test_perfect_match(self):
+        score = _label_score(["bug", "security"], ["bug", "security"])
+        assert score >= 0.9
+    def test_partial_overlap(self):
+        score = _label_score(["bug"], ["bug", "security"])
+        assert 0.3 < score < 0.7  # ~50% Jaccard
+    def test_no_overlap(self):
+        score = _label_score(["docs"], ["bug", "security"])
+        assert score == 0.05  # clamped minimum
+    def test_empty_correct_labels(self):
+        score = _label_score(["bug"], [])
+        assert score == 0.95  # nothing expected => full credit
+    def test_synonym_matching(self):
+        # "defect" is a synonym for "bug"
+        score = _label_score(["defect"], ["bug"])
+        assert score >= 0.9  # should match via synonym
+    def test_case_insensitive(self):
+        score = _label_score(["BUG", "Security"], ["bug", "security"])
+        assert score >= 0.9
+# ── Label Normalization ───────────────────────────────────
+class TestLabelNormalization:
+    def test_canonical_stays_same(self):
+        assert _normalize_label("bug") == "bug"
+        assert _normalize_label("security") == "security"
+    def test_synonym_maps_to_canonical(self):
+        assert _normalize_label("defect") == "bug"
+        assert _normalize_label("vulnerability") == "security"
+        assert _normalize_label("slow") == "performance"
+        assert _normalize_label("ui") == "ux"
+    def test_unknown_label_passes_through(self):
+        assert _normalize_label("my-custom-label") == "my-custom-label"
+    def test_case_insensitive(self):
+        assert _normalize_label("BUG") == "bug"
+        assert _normalize_label("Vulnerability") == "security"
+# ── Reasoning Scoring ─────────────────────────────────────
+class TestReasoningScoring:
+    def test_empty_reasoning_gives_zero(self):
+        assert _reasoning_score("", {"priority": "P0"}) == 0.0
+    def test_short_reasoning_gives_zero(self):
+        assert _reasoning_score("bad", {"priority": "P0"}) == 0.0
+    def test_relevant_reasoning_gives_bonus(self):
+        score = _reasoning_score(
+            "This is a critical security vulnerability affecting production and causing data loss",
+            {"priority": "P0"},
+        )
+        assert score > 0
+    def test_bonus_capped_at_max(self):
+        score = _reasoning_score(
+            "production down all users data loss security crash revenue injection vulnerability 100%",
+            {"priority": "P0"},
+        )
+        assert score <= 0.15
+# ── Grade Action ──────────────────────────────────────────
+class TestGradeAction:
+    @pytest.fixture
+    def easy_bug(self):
+        return TASKS["easy"]["bugs"][0]  # easy-001: P0
+    @pytest.fixture
+    def medium_bug(self):
+        return TASKS["medium"]["bugs"][0]  # med-001: P0, payments, backend
+    @pytest.fixture
+    def hard_bug(self):
+        return TASKS["hard"]["bugs"][0]  # hard-001: P0, security, hotfix
+    def test_easy_perfect_answer(self, easy_bug):
+        action = TriageAction(priority="P0")
+        score, feedback = grade_action("easy", easy_bug, action)
+        assert 0.9 <= score <= 0.99
+        assert "✓" in feedback
+    def test_easy_wrong_answer(self, easy_bug):
+        action = TriageAction(priority="P3")
+        score, feedback = grade_action("easy", easy_bug, action)
+        assert score < 0.2
+    def test_medium_perfect_answer(self, medium_bug):
+        action = TriageAction(
+            priority="P0",
+            labels=["bug", "payments"],
+            assigned_team="backend",
+        )
+        score, feedback = grade_action("medium", medium_bug, action)
+        assert score > 0.8
+    def test_hard_security_penalty(self, hard_bug):
+        # hard-001 requires security team; assigning backend should be penalized
+        action_wrong = TriageAction(
+            priority="P0",
+            labels=["bug", "security"],
+            assigned_team="backend",  # Wrong! Should be security
+            milestone="hotfix",
+        )
+        action_right = TriageAction(
+            priority="P0",
+            labels=["bug", "security"],
+            assigned_team="security",
+            milestone="hotfix",
+        )
+        score_wrong, fb_wrong = grade_action("hard", hard_bug, action_wrong)
+        score_right, fb_right = grade_action("hard", hard_bug, action_right)
+        assert score_right > score_wrong
+        assert "Security escalation missed" in fb_wrong
+    def test_all_scores_in_valid_range(self):
+        """Every grading result must be in (0, 1) — open interval."""
+        for task_key in ["easy", "medium", "hard"]:
+            for bug in TASKS[task_key]["bugs"]:
+                for priority in ["P0", "P1", "P2", "P3"]:
+                    action = TriageAction(
+                        priority=priority,
+                        labels=["bug"],
+                        assigned_team="backend",
+                        milestone="backlog",
+                    )
+                    score, feedback = grade_action(task_key, bug, action)
+                    assert 0 < score < 1, (
+                        f"Score {score} out of range for {bug.id} "
+                        f"with priority={priority}"
+                    )
+                    assert isinstance(feedback, str)
+                    assert len(feedback) > 0
+# ── Procedural Bug Generation ─────────────────────────────
+class TestBugGeneration:
+    def test_generate_produces_valid_bug(self):
+        bug, answer = generate_bug("easy", seed=42)
+        assert isinstance(bug, BugReport)
+        assert bug.id.startswith("gen-")
+        assert len(bug.title) > 5
+        assert len(bug.body) > 20
+        assert "priority" in answer
+    def test_different_seeds_produce_different_bugs(self):
+        bug1, _ = generate_bug("easy", seed=1)
+        bug2, _ = generate_bug("easy", seed=2)
+        # Very unlikely to produce the same title with different seeds
+        assert bug1.title != bug2.title or bug1.body != bug2.body
+    def test_same_seed_produces_same_bug(self):
+        bug1, ans1 = generate_bug("easy", seed=42)
+        bug2, ans2 = generate_bug("easy", seed=42)
+        assert bug1.title == bug2.title
+        assert bug1.body == bug2.body
+        assert ans1 == ans2
+    def test_easy_bugs_have_only_priority(self):
+        for seed in range(10):
+            _, answer = generate_bug("easy", seed=seed)
+            assert "priority" in answer
+            # easy should NOT include milestone
+            assert "milestone" not in answer
+    def test_hard_bugs_have_full_answer(self):
+        for seed in range(50):
+            _, answer = generate_bug("hard", seed=seed)
+            assert "priority" in answer
+    def test_all_difficulties(self):
+        for difficulty in ["easy", "medium", "hard"]:
+            bug, answer = generate_bug(difficulty, seed=100)
+            assert isinstance(bug, BugReport)
+            assert "priority" in answer
+    def test_sample_bug_returns_tuple(self):
+        bug, answer = sample_bug("easy", seed=42)
+        assert isinstance(bug, BugReport)
+        assert isinstance(answer, dict)
+    def test_generated_bugs_are_gradeable(self):
+        """Generated bugs should work with the grading system."""
+        for difficulty in ["easy", "medium", "hard"]:
+            for seed in range(5):
+                bug, answer = generate_bug(difficulty, seed=seed)
+                action = TriageAction(
+                    priority=answer["priority"],
+                    labels=answer.get("labels", ["bug"]),
+                    assigned_team=answer.get("assigned_team", "backend"),
+                    milestone=answer.get("milestone", "backlog"),
+                )
+                score, feedback = grade_action(difficulty, bug, action, answer=answer)
+                assert 0 < score < 1, (
+                    f"Score {score} for {bug.id} ({difficulty})"
+                )